In Azure:
The goal of the CI pipeline is to ensure the validity of the code. To do so, we start with testing the code: Pytest, Black …
Due to the specificity of our project, we had to run a “CI Integration Test” job in Databricks to validate the code.
This Job tries to train an IA model and output a score if the score is over a threshold it succeeds, otherwise, it failed.
To execute this Job you need to do several steps:
When we started the project the feature to link a Git Repo and a Databricks workspace was still in Preview. So, we chose to add all our Notebooks to our Git Repository.
Now that we have our Notebooks in our Repository we need to synchronize them with our Databricks workspace.
To deploy all our notebooks in a temporary folder we use the databricks CLI:
You might notice the strange specified profile, it is due to Azure DevOps tasks!
When you install Databricks Cli using the task provide by Azure DevOps it will not configure the default profile but a profile called AZDO in the “~/.databrickscfg” file.
To deploy the job we will use dbx CLI. It’s a CLI that helps you deploy jobs with your library attached to it. If you’ve followed the prerequisite you should have configured your repo with dbx in it.
Here is a short description of the folders in your repo:
First, you need to define the job you want to deploy using dbx. By default, the configuration file is ‘conf/deployment.json’. Several examples are given in the dbx tutorial. If you want to use a job you’ve already created, the dbx definition job is nearly identical to the settings section of a job description when you use the following command:
Once you’ve configured a job, You need to configure dbx:
Then you need to deploy the selected job
Now your job is deployed and you can see it in the Job interface in Databricks. The last step is to run it:
If you use the “--trace” option the azure task will induce an error in case the job run fails.
As you can see, Our CI pipeline is quite complete now. In fact, to deploy your environment you only need to reuse some steps.
Configure databricks CLI and dbx
Update Notebooks
Update Jobs
Re-start jobs (Need a little scripting to retrieve all jobs name)
Configuring our Databricks Workspace took a lot more time than expected. We’ve encountered the following issues.
The best solution I found is to link a Databricks Secret Scope with an Azure Key Vault.
Here is a link to the official documentation on how to do it.
Once you are done with configuring the Secret Scop you can use dbutils to access all linked key vault secrets, here is an example:
If you have multiple Databricks workspace to separate different environments. You can create a Secret Scope with the same name in each workspace link to a Key Vault corresponding to each environment!
Using a predefine Databricks Token is not the best solution in terms of security and durability. To fix this issue we needed a Service Account to log onto Databricks. I found this documentation that will help you grant access to databricks for a service principal.
Now that you have a service principal who can access Databricks, you need to generate a Databricks Token. To do so you can follow this documentation.
Take care this token isn’t valid for a long time, so you will need to add a step in your pipeline to generate this token!
This issue may be the most time-consuming one I had. When the user started to deploy jobs regularly in our environment, clusters began to fail to install the library needed by jobs.
To fix this issue, we added several steps to our pipeline:
To avoid this issue, I strongly recommend using Job Compute, and it’s way cheaper!
Deploying this CI/CD pipeline was quite challenging but in the end feedbacks from the Data Scientist are great and deployment to a new env is fully automated.
This tutorial helps you create a CI/CD pipeline on an already existing infrastructure. The next step will be to transform this existing infrastructure into IAC (Infrastructure As Code). To do so a databricks provider exists in Terraform. If you wish to implement it I advise you to read this article on Terraform before!