Basic Python Template

To create a project from this template, please run the following command:

dbx init --template=python_basic

The new project will be located in a folder with the chosen project name.

Project file structure

Your generated template will have some generic parts, and some will be CI tool specific.

The clean project structure, without any CI-related files will look like this:

.
├── .dbx
│   ├── lock.json  # please note that this file shall be ignored and not added to your git repository.
│   └── project.json
├── .gitignore
├── README.md
├── conf
│   ├── deployment.yml
│   └── test
│       └── sample.yml
├── pytest.ini
├── sample_project
│   ├── __init__.py # <- this is the root folder of your Python package
│   ├── common.py # <- this file contains a generic class called Job, which provides you all necessary tools, such as Spark and DBUtils
│   └── jobs
│       ├── __init__.py
│       └── sample
│           ├── __init__.py
│           └── entrypoint.py
├── setup.py
├── tests
│   ├── integration
│   │   └── sample_test.py
│   └── unit
│       └── sample_test.py
└── unit-requirements.txt

Here are some comments about this structure:

  • .dbx folder is an auxiliary folder, where metadata about environments and execution context is located.

  • sample_project - Python package with your code (the directory name will follow your project name)

  • tests - directory with your package tests

  • conf/deployment.yml - deployment configuration file. Please note that this file is used to configure the job deployment properties, such as dependent libraries, tasks, cluster sizes etc.

Please note that this project mostly follows the classical Python package structure as described here.

The conf/deployment.yml is one of the main components that allows you to flexibly describe your jobs definitions.

All CI tools in this template are following the same concepts, described in this section.

Depending on your choices of the CI tool during the project creation, your project structure will look like this:

<file structure as above>
├── .github
│   └── workflows
│       ├── onpush.yml
│       └── onrelease.yml

Some explanations regarding structure:

  • .github/workflows/ - workflow definitions for GitHub Actions, in particular:

    • .github/workflows/onpush.yml defines the CI pipeline logic

    • .github/workflows/onrelease.yml defines the CD pipeline logic

After generating the project, we’ll need to setup the local development environment.

Local development environment

  • Create new conda environment and activate it:

conda create -n <your-environment-name> python=3.7.5
conda activate <your-environment-name>
  • If you would like to be able to run local unit tests, you’ll need JDK. If you don’t have one, it can be installed via:

conda install -c anaconda "openjdk=8.0.152"
  • Move the shell to the project directory:

cd <project_name>
  • Install dependencies for local development:

pip install -r unit-requirements.txt
  • Install package in development mode, so your IDE can provide you all required introspection:

pip install -e .

At this stage, you have the following:

  • Configured Python package

  • Configured environment for local development

Running local tests and writing code

Now, you can open the project in your IDE. Don’t forget to point the IDE to the given conda environment name for a full code introspection.

Take a look at the code sample in the <project_name>/jobs/sample/entrypoint.py. This entrypoint file contains an example of an implemented job, based on the abstract Job name. You can see that a configuration object, named self.conf referenced in this job - these parameters will be provided from a conf/test/sample.yml file during Databricks run. In the local test, you can override this configuration - please find examples in tests/unit/sample_test.py.

To launch local test, simply use the pytest framework from the root directory of the project:

pytest tests/unit --cov <project_name>

At this stage, you have the following:

  • Configured Python package

  • Configured environment for local development

  • Python package is tested locally

Now, it’s time to launch our code on the Databricks clusters.

Running your code on the Databricks clusters

To upload your code from the local environment to Databricks and execute it, there are multiple options:

  • execute your code on an interactive (also called all-purpose) cluster

  • launch your code as a job on automated cluster

  • launch your code as a job on interactive cluster

The third option in general is a bad idea, for a very simple reason - your local package will be installed on a cluster-wide level, which means that:

  • other users won’t be able to override your code, unless your restart the interactive cluster

  • you won’t be able to install another version of the same library, unless your restart the interactive cluster

Therefore, we’re considering two first options.

Option #1 (execution on interactive cluster) is really suitable when you would like to run your code on interactive cluster during development process to verify that code work properly within real environment. Your library will be installed in a separate context, which means that other users won’t be affected, and you still will be able to install newer versions.

Use this command to execute a specific job on interactive cluster:

dbx execute --job=<job-name> --cluster-name=<cluster-name>

Now, if you would like to launch your job on an automated cluster, you probably would like to configure some specific cluster properties, such as size, environment etc. To do this, please take a look at the conf/deployment.yml file. In general, this file follows the Databricks API structures, but it has some additional features, described through this documentation.

After setting the configuration in deployment file, it’s time to launch the job. However, we probably don’t really want to affect the real job object in our environment. Instead of this, we’re going to perform something called jobless deployment, by providing the --files-only property. Please take a look at this section for more details:

dbx deploy --jobs=<job-name> --files-only

Now the job can be launched in a run submit mode:

dbx launch --as-run-submit --job=<job-name>

At this stage, you have the following:

  • Configured Python package

  • Configured environment for local development

  • Python package is tested locally

  • Job has been launched on interactive cluster

  • Job has been deployed and launched in a jobless (also called ephemeral or run-submit) mode

Setting up the CI tool

Depending on your CI tool, please choose the instruction accordingly:

Please do the following:

  • Create a new repository on GitHub

  • Configure DATABRICKS_HOST and DATABRICKS_TOKEN secrets for your project in GitHub UI

  • Add a remote origin to the local repo

  • Push the code

  • Open the GitHub Actions for your project to verify the state of the deployment pipeline

Warning

There is no need to manually create the releases via UI in case of the release. Release pipeline will create the release automatically.

Please note that to create a release and deploy the job in a normal mode, tag the latest commit in the main branch and push the tags:

git fetch
git checkout main
git pull
git tag -a v0.0.1 -m "Release for version 0.0.1"
git push --tags