Create a CI/CD pipeline for Databricks using Azure DevOps

Written by Philippe Dupart | 24-Feb-2021 23:00:00

Prerequisites

In Azure:

A Ressource Group with a Databricks instance
An Azure DevOps Repo
Configure your repo following this tutorial
Create a Databricks Access Token

CI/CD pipeline

The goal of the CI pipeline is to ensure the validity of the code. To do so, we start with testing the code: Pytest, Black …

Due to the specificity of our project, we had to run a “CI Integration Test” job in Databricks to validate the code.

This Job tries to train an IA model and output a score if the score is over a threshold it succeeds, otherwise, it failed.

To execute this Job you need to do several steps:

Deploy notebooks in a temporary folder in your Databricks workspace
Deploy the “CI” Job linked to a notebook in the temporary folder
Run the “CI” Job and wait for its results

Deploy Notbooks

When we started the project the feature to link a Git Repo and a Databricks workspace was still in Preview. So, we chose to add all our Notebooks to our Git Repository.

Now that we have our Notebooks in our Repository we need to synchronize them with our Databricks workspace.

To deploy all our notebooks in a temporary folder we use the databricks CLI:

You might notice the strange specified profile, it is due to Azure DevOps tasks!

When you install Databricks Cli using the task provide by Azure DevOps it will not configure the default profile but a profile called AZDO in the “~/.databrickscfg” file.

Deploy and Run a CI job

To deploy the job we will use dbx CLI. It’s a CLI that helps you deploy jobs with your library attached to it. If you’ve followed the prerequisite you should have configured your repo with dbx in it.

Here is a short description of the folders in your repo:

src: Contain the library which will be packed and attached to your jobs
conf: definition of all jobs
tools: Contain the dbx installer
notebooks: Contain all our Notebooks, required for CI previous stage
tests: Contains all the unit test

First, you need to define the job you want to deploy using dbx. By default, the configuration file is ‘conf/deployment.json’. Several examples are given in the dbx tutorial. If you want to use a job you’ve already created, the dbx definition job is nearly identical to the settings section of a job description when you use the following command:

Once you’ve configured a job, You need to configure dbx:

Then you need to deploy the selected job

Now your job is deployed and you can see it in the Job interface in Databricks. The last step is to run it:

If you use the “--trace” option the azure task will induce an error in case the job run fails.

Deploy in a new workspace

As you can see, Our CI pipeline is quite complete now. In fact, to deploy your environment you only need to reuse some steps.

Configure databricks CLI and dbx

Update Notebooks

Update Jobs

Re-start jobs (Need a little scripting to retrieve all jobs name)

Configure Databricks

Configuring our Databricks Workspace took a lot more time than expected. We’ve encountered the following issues.

How to remove credentials from the code

The best solution I found is to link a Databricks Secret Scope with an Azure Key Vault.

Here is a link to the official documentation on how to do it.

Once you are done with configuring the Secret Scop you can use dbutils to access all linked key vault secrets, here is an example:

If you have multiple Databricks workspace to separate different environments. You can create a Secret Scope with the same name in each workspace link to a Key Vault corresponding to each environment!

How Azure pipeline can access Databricks

Using a predefine Databricks Token is not the best solution in terms of security and durability. To fix this issue we needed a Service Account to log onto Databricks. I found this documentation that will help you grant access to databricks for a service principal.

Now that you have a service principal who can access Databricks, you need to generate a Databricks Token. To do so you can follow this documentation.
Take care this token isn’t valid for a long time, so you will need to add a step in your pipeline to generate this token!

All-purpose clusters library management

This issue may be the most time-consuming one I had. When the user started to deploy jobs regularly in our environment, clusters began to fail to install the library needed by jobs.

To fix this issue, we added several steps to our pipeline:

Stop all running jobs: Avoid any issue with any streaming job or job writing into a file (corrupt file happened once)

Remove all dbx artifacts: Stop the installation of an old artifact by jobs restarting due to scheduling.
Uninstall library and restart clusters

To avoid this issue, I strongly recommend using Job Compute, and it’s way cheaper!

Deploying this CI/CD pipeline was quite challenging but in the end feedbacks from the Data Scientist are great and deployment to a new env is fully automated.

This tutorial helps you create a CI/CD pipeline on an already existing infrastructure. The next step will be to transform this existing infrastructure into IAC (Infrastructure As Code). To do so a databricks provider exists in Terraform. If you wish to implement it I advise you to read this article on Terraform before!

View full post