Basic Python Template
To create a project from this template, please run the following command:
dbx init --template=python_basic
The new project will be located in a folder with the chosen project name.
Project file structure
Your generated template will have some generic parts, and some will be CI tool specific.
The clean project structure, without any CI-related files will look like this:
.
├── .dbx
│ ├── lock.json # please note that this file shall be ignored and not added to your git repository.
│ └── project.json
├── .gitignore
├── README.md
├── conf
│ ├── deployment.yml
│ └── test
│ └── sample.yml
├── pytest.ini
├── sample_project
│ ├── __init__.py # <- this is the root folder of your Python package
│ ├── common.py # <- this file contains a generic class called Job, which provides you all necessary tools, such as Spark and DBUtils
│ └── tasks
│ ├── __init__.py
│ └── sample_task.py
├── setup.py
├── tests
│ └── unit
│ └── sample_test.py
Here are some comments about this structure:
.dbx
folder is an auxiliary folder, where metadata about environments and execution context is located.sample_project
- Python package with your code (the directory name will follow your project name)tests
- directory with your package testsconf/deployment.yml
- deployment configuration file. Please note that this file is used to configure the job deployment properties, such as dependent libraries, tasks, cluster sizes etc.
Please note that this project mostly follows the classical Python package structure as described here.
The conf/deployment.yml
is one of the main components that allows you to flexibly describe your jobs definitions.
All CI tools in this template are following the same concepts, described in this section.
Depending on your choices of the CI tool during the project creation, your project structure will look like this:
<file structure as above> ├── .github │ └── workflows │ ├── onpush.yml │ └── onrelease.yml
Some explanations regarding structure:
.github/workflows/
- workflow definitions for GitHub Actions, in particular:
.github/workflows/onpush.yml
defines the CI pipeline logic
.github/workflows/onrelease.yml
defines the CD pipeline logic
<file structure as above> ├── azure-pipelines.yml
- Some explanations regarding structure:
azure-pipelines.yml
- Azure DevOps Pipelines workflow definition with both CI and CD pipeline logic.
<file structure as above> ├── .gitlab-ci.yml
- Some explanations regarding structure:
.gitlab-ci.yml
- GitLab CI/CD workflow definition with both CI and CD pipeline logic.
After generating the project, we’ll need to setup the local development environment.
Local development environment
Create new conda environment and activate it:
conda create -n <your-environment-name> python=3.7.5
conda activate <your-environment-name>
If you would like to be able to run local unit tests, you’ll need JDK. If you don’t have one, it can be installed, for example, via
conda
:
conda install -c conda-forge openjdk
Move the shell to the project directory:
cd <project_name>
Install package in development mode, so your IDE can provide you all required introspection:
pip install -e ".[dev]"
At this stage, you have the following:
Configured Python package
Configured environment for local development
Running local tests and writing code
Now, you can open the project in your IDE. Don’t forget to point the IDE to the given conda environment name for a full code introspection.
Take a look at the code sample in the <project_name>/tasks/sample_etl_task.py
.
This entrypoint file contains an example of an implemented job, based on the abstract Task
name.
You can see that a configuration object, named self.conf
referenced in this job - these parameters will be provided from a conf/test/sample.yml
file during Databricks run.
In the local test, you can override this configuration - please find examples in tests/unit/sample_test.py
.
To launch local test, simply use the pytest
framework from the root directory of the project:
pytest tests/unit --cov <project_name>
At this stage, you have the following:
Configured Python package
Configured environment for local development
Python package is tested locally
Now, it’s time to launch our code on the Databricks clusters.
Running your code on the Databricks clusters
To upload your code from the local environment to Databricks and execute it, there are multiple options:
execute your code on an interactive (also called all-purpose) cluster
launch your code as a job on automated cluster
launch your code as a job on interactive cluster
The third option in general is a bad idea, for a very simple reason - your local package will be installed on a cluster-wide level, which means that:
other users won’t be able to override your code, unless your restart the interactive cluster
you won’t be able to install another version of the same library, unless your restart the interactive cluster
Therefore, we’re considering two first options.
Option #1 (execution on interactive cluster) is really suitable when you would like to run your code on interactive cluster during development process to verify that code work properly within real environment. Your library will be installed in a separate context, which means that other users won’t be affected, and you still will be able to install newer versions.
Use this command to execute a specific job on interactive cluster:
dbx execute --job=<job-name> --cluster-name=<cluster-name>
Now, if you would like to launch your job on an automated cluster, you probably would like to configure some specific cluster properties, such as size, environment etc.
To do this, please take a look at the conf/deployment.yml
file. In general, this file follows the Databricks API structures, but it has some additional features, described through this documentation.
After setting the configuration in deployment file, it’s time to launch the job. However, we probably don’t really want to affect the real job object in our environment.
Instead of this, we’re going to perform something called jobless deployment, by providing the --files-only
property. Please take a look at this section for more details:
dbx deploy --jobs=<job-name> --files-only
Now the job can be launched in a run submit mode:
dbx launch --as-run-submit --job=<job-name>
At this stage, you have the following:
Configured Python package
Configured environment for local development
Python package is tested locally
Job has been launched on interactive cluster
Job has been deployed and launched in a jobless (also called ephemeral or run-submit) mode
Setting up the CI tool
Depending on your CI tool, please choose the instruction accordingly:
Please do the following:
Create a new repository on GitHub
Configure
DATABRICKS_HOST
andDATABRICKS_TOKEN
secrets for your project in GitHub UIAdd a remote origin to the local repo
Push the code
Open the GitHub Actions for your project to verify the state of the deployment pipeline
Warning
There is no need to manually create the releases via UI in case of the release. Release pipeline will create the release automatically.
Please do the following:
Create a new repository on GitHub or in Azure DevOps (or in any Azure DevOps-compatible git system)
Configure
DATABRICKS_HOST
andDATABRICKS_TOKEN
secrets for your project in Azure DevOps. Note that secret variables must be mapped to env as mentioned in the official documentation.Push the code
Open the Azure DevOps UI to check the deployment status
- Please do the following:
Create a new repository on Gitlab
Configure
DATABRICKS_HOST
andDATABRICKS_TOKEN
secrets for your project in GitLab UIAdd a remote origin to the local repo
Push the code
Open the GitLab CI/CD UI to check the deployment status
Please note that to create a release and deploy the job in a normal mode, tag the latest commit in the main branch and push the tags:
git fetch
git checkout main
git pull
git tag -a v0.0.1 -m "Release for version 0.0.1"
git push --tags