Databricks: CI/CD Python Library to Share Functionality (2/3)

Tibor Vekony
8 min readJan 20, 2022

--

In my [previous article], I’ve explored another option to share code / functionality between Databricks notebooks, using a Shared Function Notebook.

Summary of the Article

In this part of the series, I’ll showcase a method to share code / functions between notebooks in Databricks, using a custom python (other languages could be used as well) library.

Instead of repeating the same code over-and-over again, it’s much easier to keep it maintained & use it, if it’s defined in a single place.

Problem Statement

In many cases, I’ve seen that the first 30–40% of Databricks notebooks is basically prepwork: defining variables & functions, importing common libraries, environment configuration, etc.

This adds unnecessary clutter to the notebooks, which makes them less readable. After a while, copy-pasting the same setup/configuration steps for each notebooks gets tedious. Not to mention that, if the setup changes, each notebook needs to be changed individually.

How to Share Functions Across Notebooks

I’ll showcase three ways to share code between Notebooks in Databricks — with their pros & cons:

  1. [Creating a shared functions notebook.]
  2. Compiling a library from code & uploading it.
  3. [Creating a library & uploading the code — no compilation needed.]

In this article, we’ll take a look at Option 2) Compiling a library from code & uploading it.

Compiling a Library

First, we’ll need a Python library, which encapsulates the functions, classes & other components we want to use in the Databricks Notebooks.

In this demo, I’m using a very simple Python library, that I’ve uploaded to [GitHub]. This library (SharedFunctionsLib) only contains 4 files:

  • __init__.py :: This is an empty file, which lets the Python interpreter know that a directory contains code for a Python module
  • setup.py :: This is a Python script, that is usually shipped with libraries or programs, written in the library’s language. It’s purpose is the correct installation of the software.
  • akv.py :: This is a module in the custom library that I’ve written. Its name can be whatever you’d like. I gave it “akv” (azure key vault), because it contains a helper class to work with AKV in Databricks.
  • miscellaneous.py :: Similarly to akv.py this is a custom module in the library as well. It contains only a miscellaneous function — hence the name — , later this can be imported & used in Databricks.

Now that we’ve encapsulated all the functions & classes in a Python library, it needs to be complied, before it can be installed onto a Databricks cluster & used in Notebooks.

The library could be compiled on your desktop computer, then manually uploaded to Databricks to be installed on a cluster, but instead of all those manual steps, let’s setup a CI/CD pipeline.

Continuous Integration (CI)

To automatically compile the library, after changes are applied to the main branch (though it could be any branch, or even tags, all depends on the trigger set in the build pipeline) of the GitHub repo, I’ll use an Azure Build pipeline:

Azure (Build) Pipeline in Azure DevOps.

After hitting the “Create Pipeline” button, Azure DevOps will take you through a fairly simple setup:

  • Select the location of your code: in my case it’s GitHub
  • Then, select the repo, which contains the library to compile
  • Next, select the type of configuration: Python Package. This will create a nice template pipeline in YAML, as a starting point.
Template Azure Build Pipeline written in YAML.

I’ll slightly tweak this pipeline to work with my minimalistic Python library:

YAML definition of the Azure Build pipeline used to compile the Python library.

The trigger is the main branch in the GitHub repo, whenever it’s changed, the build pipeline is executed.

The library is build for multiple Python versions (2.7; 3.5; 3.6; 3.7) to ensure compatibility.

In the first script step (“Install Tools”), I’m installing tools on the agent, which are needed to compile the code it pulls from the GitHub repo. These are wheel and setuptools, hinting that in the end we’ll end up with a .whl file that we can import into Databricks.

The second script step (“Building Package”) compiles the library, using the tools installed in the previous step. This step uses the setup.py file, discussed earlier. The output (compiled library) is placed into a “dist” folder.

In the third script step (“Copy built package from dist”) the complied library is simple moved from the dist folder into a staging area, the contents of which will be output at the end of the build pipeline as an artifact.

In the last step, the build artifact is published, which will contain the compiled library. It could have other things as well, for example configuration files, scripts to be used in the Release (CD) Pipeline, etc.

This all depends on your requirements, you can put whatever you need in the Release (CD) Pipeline into this artifact to easily access it, the libraries could be even zipped.

Contents of the artifact published by the Azure Build Pipeline. The artifact itself is named “SharedFunctionsLib”, the compiled Python library’s files are in the “libs” folder.

Continuous Deployment (CD)

Once the Azure Build Pipeline ran & the build artifact is available, it can be used in a Release (CD) Pipeline to deploy its contents.

Azure Release Pipelines menu in Azure DevOps.

Let’s create a new Release Pipeline, by clicking on the “New” button highlighted in the picture above, and selecting “New release pipeline” from the dropdown menu.

Azure Release Pipeline templates menu. Make sure to select an “Empty Job”!

It’ll prompt you to give (by default) “Stage 1” a proper name, I’ve named it “Databricks”. You can also give the Release pipeline a name, by clicking on the “New release pipeline” default name.

Newly created empty Release Pipeline in Azure DevOps.

Next, we need to select the artifact created & published by the Build Pipeline. To do that, just click the “Add an artifact” button.

Adding a new artifact to the Release Pipeline. In the “Project” and “Source (build pipeline)” dropdown menus select your Project & the Build Pipeline created as part of the previous step!
Continuous deployment trigger on the Artifact in the Release Pipeline.

If you don’t want to manually trigger the Release Pipeline, setup a Continuous deployment trigger, which will create a release whenever a new build artifact is available.

In tandem with the trigger of the Build Pipeline, the Python library will be compiled & deployed, whenever the code changes in the repo.

Next, define the steps to be executed (deployment steps) in the “Databricks” stage, there will be only 2 steps:

Steps in the Databricks stage in the Release Pipeline.

The first step is the use a specific version of the Python library, in my case it’s 3.7, you use whatever is compatible with your cluster in Databricks. Most likely it’ll be 3.x version as well.

The second step is to actually take the .whl files from the artifact & copy them to Databricks’ DBFS (DataBricks FileSystem), as this is one of the places, that the cluster can search for libraries.

“Deploy Library to DBFS” step in the Databricks stage, in the Release Pipeline.

This step isn’t natively available among the activities offered in the Release Pipelines, it comes from a [free 3rd party plug-in, developed by Data Thirst].

If you’re in an organization, you might not have the required privileges to add it yourself, but considering that this plug-in shows up in Microsoft’s official documentation & is used by many, the request to add it usually goes through without any hiccups.

As for the configuration of this step:

  • For the Azure Region select the region in which your Databricks Workspace is located. For me, it’s West US (“westus”).
  • For the Local Root Folder, specify the access path to the folder that you want to copy to DBFS. The easiest way to select this is to hit the 3 dots on the right-hand side of the text field & select it from the menu.
  • Target folder in DBFS should be the access path to the target folder in DBFS. I just copy it into a “libs” folder in the root of the DBFS. If it’s not there yet, it’ll be created.
  • Databricks bearer token to authenticate this copy activity. To create one, follow the [official documentation] on Databricks’ site. It’s easy, though it requires elevated privileges. If you don’t have them & can’t create one, you have to ask your admin to do so.

Once all the above is done, a release can be created / the Release Pipeline executed, which will copy the compiled Python library to DBFS.

The compiled Python library in DBFS.

Click the arrow on the right-hand side of the compiled library (py3 for me: “SharedFunctionsLib-0.0.1-py3-no…”), then “Copy Path”, and make sure to copy the Spark API Format path:

File Paths to the compiled Python library in DBFS.

Next, now that the library is available in DBFS, it can be installed onto a cluster:

Installing the library to a cluster.

Make sure to select to correct Library Source (DBFS/ADLS) & Library Type (Python Whl)! Note: The cluster must be running to install libraries!

Once the library finished installing to the cluster, it can be used in Notebooks:

Using the custom library in a Databricks Notebook.

This article is the second chapter of a 3 part series, in which I explore various options to share code / functions between Databricks notebooks. If you’d like to learn about the other options (creating & importing libraries — without compiling them), don’t forget to follow me to get notified when the next chapter is published!!!

If you’ve learned something new, share this article to show what you’ve just learned & to make sure others will see it as well!!!

--

--

Tibor Vekony
Tibor Vekony

Written by Tibor Vekony

Here to share & learn interesting practices, technologies & other stuff in the Cloud.

Responses (2)