Databricks: Share Functionality with Custom Python Libary Without Compilation (3/3)
In my previous articles, I’ve explored options to share code / functionality between Databricks notebooks:
- Using a [Shared Function Notebook] in the 1st Part
- Writing a [custom Python library, then building and deploying it to DBFS using CI/CD pipelines] in the 2nd Part.
Summary of the Article
In the last part of the series, I’ll showcase a method to share code / functions between notebooks in Databricks, using a custom python (other languages could be used as well) library, but without compiling it.
Why encapsulating common classes, functions, configuration & other logic into a library is good? Well, instead of repeating the same code over-and-over again across dozens of places (in this case, dozens of notebooks), it’s much easier to keep it maintained & use it, if it’s defined in a single place.
Problem Statement
In many cases, I’ve seen that the first 30–40% of Databricks notebooks is basically prepwork: defining variables & functions, importing common libraries, environment configuration, etc.
This adds unnecessary clutter to the notebooks, which makes them less readable. After a while, copy-pasting the same setup/configuration steps for each notebooks gets tedious. Not to mention that, if the setup changes, each notebook needs to be changed individually.
How to Share Functions Across Notebooks
I’ll showcase three ways to share code between Notebooks in Databricks — with their pros & cons:
- [Creating a shared functions notebook.]
- [Compiling a library from code & uploading it.]
- Creating a library & uploading the code — no compilation needed.
In this article, we’ll take a look at Option 3) Creating a library & uploading the code — no compilation needed.
Creating a library
First, we’ll need a Python library, which encapsulates the functions, classes & other components we want to use in the Databricks Notebooks.
In this demo, I’m using a very simple Python library, that I’ve uploaded to [GitHub]. I’ve used this library in the [previous article] as well, to showcase how to compile & deploy custom Python libraries to Databricks with CI/CD pipelines. This library (SharedFunctionsLib) only contains 4 files:
- __init__.py :: This is an empty file, which lets the Python interpreter know that a directory contains code for a Python module
- setup.py :: This is a Python script, that is usually shipped with libraries or programs, written in the library’s language. It’s purpose is the correct installation of the software.
- akv.py :: This is a module in the custom library that I’ve written. Its name can be whatever you’d like. I gave it “akv” (azure key vault), because it contains a helper class to work with AKV in Databricks.
- miscellaneous.py :: Similarly to akv.py this is a custom module in the library as well. It contains only a miscellaneous function — hence the name — , later this can be imported & used in Databricks.
Now that we’ve encapsulated all the functions & classes in a Python library, unlike in the previous part, I won’t compile it.
Creating a Repo in Databricks
Databricks was integrated with Git providers in the past as well, but it was on a Notebook level & it was rather clunky.
Databricks overhauled the old integration and it became Generally Available in October 2021. The new integration makes using Git providers with Databricks a whole lot easier — in short, it became quicker to review & commit code, less navigation between windows, etc.
Besides making using Git a whole lot more smooth, a new feature was added as well: the option to store files in the Repo (Files in Repos).
This new feature allows us — unlike the regular Workspace — to store files, including Python & R libraries in the Repo, that we can import in Notebooks that are within the same Repo.
Disclaimer: As of now (2022, January), the Files in Repos feature is in Public Preview and needs to be enabled by an admin of your Databricks Workspace! Obviously, Repos needs to be enabled as well, the option for that is above the “Files in Repos” option.
Let’s create a Repo now, I’ll add the Github Repo with my Python Library:
In the dialog window, simply paste the URL of your Repo, choose the correct Git Provider from the Dropdown menu (if the dialog couldn’t infer it for some reason) & change the name of the Repo, if you’d like it to have an alternative name in Databricks.
When all is done, click Create & behold your very own new repo:
In the regular Databricks Workspaces only Notebooks, (compiled) Libraries, Folders & MLFlow Experiments are allowed, but in Repos, as any type of file can be stored. If I navigate into the “SharedFunctionsLib” folder, I can see the .py files there:
No further setup is needed, Notebooks within the same Repo (my-python-library in this case) can import the “SharedFunctionLib” folder & use the functions, classes & other components within.
As you can see on the above picture, using the from … import * statements, the Python library can be imported & used, without us needing to compile it first.
Using custom libraries has never been easier! Besides that, files in Repos can be used to store small datasets (<100 MB, as of now) and for defining an environment in a requirements.txt file, which can be ran using pip to setup the environment.
Files <10 MB can also be edited within the Databricks Repo, though the editor doesn’t really have any options or features, I’m only ever using it to either quickly check the contents of a file or make minor edits.
Closing Thoughts
Depending on the current setup, you might use any one of the 3 options to share code / functionality, that I’ve showcased in this series.
If you already have most of your environment configuration & common functions defined in a Notebook, you can [encapsulate that in a Shared Functions notebook] easily.
If you want the dynamic approach & absolute control over your libraries, how they are built, deployed & served, then you might want to [create a CI/CD pipeline to build & deploy your libraries].
If nothing is set-in-stone yet, just starting out & want to be up & running ASAP, but still want to have good version control & options to adapt to changes in the future, then consider using Databricks Repos’ Files in Repos feature to use libraries without compilation.
Resources / Interesting Things to Check Out:
This article is the third & final chapter of a 3 part series, in which I explore various options to share code / functions between Databricks notebooks. If you’d like to learn about the other interesting things, tips & tricks, don’t forget to follow me!
If you’ve learned something new, share this article to show what you’ve just learned & to make sure others will see it as well!