Basic Python Template

Project file structure
Local development environment
Running local tests and writing code
Running your code on the Databricks clusters
Setting up the CI tool

To create a project from this template, please run the following command:

dbx init --template=python_basic

The new project will be located in a folder with the chosen project name.

Project file structure 

Your generated template will have some generic parts, and some will be CI tool specific.

The clean project structure, without any CI-related files will look like this:

.
├── .dbx
│   ├── lock.json  # please note that this file shall be ignored and not added to your git repository.
│   └── project.json
├── .gitignore
├── README.md
├── conf
│   ├── deployment.yml
│   └── test
│       └── sample.yml
├── pytest.ini
├── sample_project
│   ├── __init__.py # <- this is the root folder of your Python package
│   ├── common.py # <- this file contains a generic class called Job, which provides you all necessary tools, such as Spark and DBUtils
│   └── tasks
│       ├── __init__.py
│       └── sample_task.py
├── setup.py
├── tests
│   └── unit
│       └── sample_test.py

Here are some comments about this structure:

.dbx folder is an auxiliary folder, where metadata about environments and execution context is located.
sample_project - Python package with your code (the directory name will follow your project name)
tests - directory with your package tests
conf/deployment.yml - deployment configuration file. Please note that this file is used to configure the job deployment properties, such as dependent libraries, tasks, cluster sizes etc.

Please note that this project mostly follows the classical Python package structure as described here.

The conf/deployment.yml is one of the main components that allows you to flexibly describe your jobs definitions.

All CI tools in this template are following the same concepts, described in this section.

Depending on your choices of the CI tool during the project creation, your project structure will look like this:

<file structure as above>
├── .github
│   └── workflows
│       ├── onpush.yml
│       └── onrelease.yml

Some explanations regarding structure:

.github/workflows/ - workflow definitions for GitHub Actions, in particular:

.github/workflows/onpush.yml defines the CI pipeline logic

.github/workflows/onrelease.yml defines the CD pipeline logic

<file structure as above>
├── azure-pipelines.yml

Some explanations regarding structure:

azure-pipelines.yml - Azure DevOps Pipelines workflow definition with both CI and CD pipeline logic.

<file structure as above>
├── .gitlab-ci.yml

Some explanations regarding structure:

.gitlab-ci.yml - GitLab CI/CD workflow definition with both CI and CD pipeline logic.

After generating the project, we’ll need to setup the local development environment.

Local development environment 

Create new conda environment and activate it:

conda create -n <your-environment-name> python=3.7.5
conda activate <your-environment-name>

If you would like to be able to run local unit tests, you’ll need JDK. If you don’t have one, it can be installed, for example, via conda :

conda install -c conda-forge openjdk

Move the shell to the project directory:

cd <project_name>

Install package in development mode, so your IDE can provide you all required introspection:

pip install -e ".[dev]"

At this stage, you have the following:

Configured Python package
Configured environment for local development

Running local tests and writing code 

Now, you can open the project in your IDE. Don’t forget to point the IDE to the given conda environment name for a full code introspection.

Take a look at the code sample in the <project_name>/tasks/sample_etl_task.py. This entrypoint file contains an example of an implemented job, based on the abstract Task name. You can see that a configuration object, named self.conf referenced in this job - these parameters will be provided from a conf/test/sample.yml file during Databricks run. In the local test, you can override this configuration - please find examples in tests/unit/sample_test.py.

To launch local test, simply use the pytest framework from the root directory of the project:

pytest tests/unit --cov <project_name>

At this stage, you have the following:

Configured Python package
Configured environment for local development
Python package is tested locally

Now, it’s time to launch our code on the Databricks clusters.

Running your code on the Databricks clusters 

To upload your code from the local environment to Databricks and execute it, there are multiple options:

execute your code on an interactive (also called all-purpose) cluster
launch your code as a job on automated cluster
launch your code as a job on interactive cluster

The third option in general is a bad idea, for a very simple reason - your local package will be installed on a cluster-wide level, which means that:

other users won’t be able to override your code, unless your restart the interactive cluster
you won’t be able to install another version of the same library, unless your restart the interactive cluster

Therefore, we’re considering two first options.

Option #1 (execution on interactive cluster) is really suitable when you would like to run your code on interactive cluster during development process to verify that code work properly within real environment. Your library will be installed in a separate context, which means that other users won’t be affected, and you still will be able to install newer versions.

Use this command to execute a specific job on interactive cluster:

dbx execute --job=<job-name> --cluster-name=<cluster-name>

Now, if you would like to launch your job on an automated cluster, you probably would like to configure some specific cluster properties, such as size, environment etc. To do this, please take a look at the conf/deployment.yml file. In general, this file follows the Databricks API structures, but it has some additional features, described through this documentation.

After setting the configuration in deployment file, it’s time to launch the job. However, we probably don’t really want to affect the real job object in our environment. Instead of this, we’re going to perform something called jobless deployment, by providing the --files-only property. Please take a look at this section for more details:

dbx deploy --jobs=<job-name> --files-only

Now the job can be launched in a run submit mode:

dbx launch --as-run-submit --job=<job-name>

At this stage, you have the following:

Configured Python package
Configured environment for local development
Python package is tested locally
Job has been launched on interactive cluster
Job has been deployed and launched in a jobless (also called ephemeral or run-submit) mode

Setting up the CI tool 

Depending on your CI tool, please choose the instruction accordingly:

Please do the following:

Create a new repository on GitHub

Configure DATABRICKS_HOST and DATABRICKS_TOKEN secrets for your project in GitHub UI

Add a remote origin to the local repo

Push the code

Open the GitHub Actions for your project to verify the state of the deployment pipeline

Warning

There is no need to manually create the releases via UI in case of the release. Release pipeline will create the release automatically.

Please do the following:

Create a new repository on GitHub or in Azure DevOps (or in any Azure DevOps-compatible git system)

Configure DATABRICKS_HOST and DATABRICKS_TOKEN secrets for your project in Azure DevOps. Note that secret variables must be mapped to env as mentioned in the official documentation.

Push the code

Open the Azure DevOps UI to check the deployment status

Please do the following:

Create a new repository on Gitlab
Configure DATABRICKS_HOST and DATABRICKS_TOKEN secrets for your project in GitLab UI
Add a remote origin to the local repo
Push the code
Open the GitLab CI/CD UI to check the deployment status

Please note that to create a release and deploy the job in a normal mode, tag the latest commit in the main branch and push the tags:

git fetch
git checkout main
git pull
git tag -a v0.0.1 -m "Release for version 0.0.1"
git push --tags

Basic Python Template

Project file structure

Local development environment

Running local tests and writing code

Running your code on the Databricks clusters

Setting up the CI tool

Project file structure 

Local development environment 

Running local tests and writing code 

Running your code on the Databricks clusters 

Setting up the CI tool 