Databricks: Share Functions across Notebooks (1/3)

Tibor Vekony
3 min readJan 13, 2022

Summary of the Article

In this series of articles I will showcase a few methods to share functions & variables between Databricks Notebooks. This is done to reduce code repetition, increase consistency, efficiency & reusability.

Problem Statement

In many cases, I’ve seen that the first 30–40% of Databricks notebooks is basically prepwork: defining variables & functions, importing common libraries, environment configuration, etc.

This adds unnecessary clutter to the notebooks, which makes them less readable. After a while, copy-pasting the same setup/configuration steps for each notebooks gets tedious. Not to mention that, if the setup changes, each notebook needs to be changed individually.

How to Share Functions Across Notebooks

I’ll showcase three ways to share code between Notebooks in Databricks — with their pros & cons:

  1. Creating a shared functions notebook.
  2. [Compiling a library from code & uploading it.]
  3. [Creating a library & uploading the code — no compilation needed.]

In this article, we’ll take a look at Option 1) Creating a shared functions notebook.

Creating a Shared Functions Notebook

This is a beginner-friendly, easy, straightforward approach.

Using the “%run” magic command, notebooks can execute other notebooks. While doing so, the variables & functions defined in the executed notebook become available for the executor notebook as well.

This happens, because in the background, whenever a notebook is attached to a spark cluster, an execution context is created, which contains the state of the REPL environments (one for each programming language — SQL, Python, Scala, R).

When Notebook A executes Notebook B, it is executed in the same execution context, which was created when Notebook A was attached to the cluster. This means, that they have access to the same state, meaning they can access the same variables, functions, etc.

Let’s see a simple example:

“_Main” Notebook in Databricks, which will execute another Notebook (cmd3).
“_Shared” Notebook in Datarbricks, containing the variables, funtions, setup process, etc. to be shared across other Notebooks.

The expected results are:

  • To see the variable defined in the _Main Notebook to be printed as the result of cmd3.
  • To see both the variable defined in _Main, as well as in _Shared, to be printed using the function defined in _Shared, as the result of cmd4.
Results of the test, as expected.

One very important limitation is, that due to each programming language supported in Databricks (SQL, Python, Scala, R) having its own REPL environment, variables defined in one programming language cannot be accessed by commands written in another programming language. E.g.: variables defined in Python cannot be accessed by commands written in Scala.

To somewhat circumvent this limitation, the values of variables can be written to the preferred storage (to files) and read by commands in another language.

Summary (Pros. & Cons.)

Shared Functions Notebook:

  • [+] Easy to setup, just encapsulate the shared code into a separate notebook.
  • [+] Readable to the naked eye, no need to compile/build it. No mandatory auxiliary files (license, README, setup, etc.).
  • [+] Familiar source control process, as it’s just another notebook.
  • [-] If it gets large, it becomes hard to maintain.
  • [-] Running it at the start of each execution, to get it into the execution context can be a lot of overhead and slow the overall process.

This article is the first chapter of a 3 part series, in which I explore various options to share code / functions between Databricks notebooks. If you’d like to learn about the other options (creating & importing libraries), don’t forget to follow me to get notified when the next chapter is published!!!

If you’ve learned something new, share this article to show what you’ve just learned & to make sure others will see it as well!!!

--

--

Tibor Vekony

Here to share & learn interesting practices, technologies & other stuff in the Cloud.