Quickstart
Prerequisites
Python >=3.6 environment on your local machine
databricks-cli with a configured profile
In this instruction we’re based on Databricks Runtime 7.3 LTS ML.
If you don’t need to use ML libraries, we still recommend to use ML-based version due to %pip
magic support.
Installing dbx
Install dbx
via pip
:
pip install dbx
Starting from a template (Python)
If you already have an existing project, you can skip this step and move directly to the next one.
dbx
comes with a set of pre-defined templates and a command to use them straight away.
However, if you don’t like the project structure defined in the provided templates, feel free to use the instruction below for full customization.
Configuring environments
Move the shell into the project directory and configure dbx
.
Note
dbx
heavily relies on databricks-cli and uses the same set of profiles.
Please configure your profiles in advance using databricks configure
command as described here.
Create a new environment configuration via given command:
dbx configure \
--profile=test # name of your profile, omit if you would like to use the DEFAULT one
This command will configure a project file in .dbx/project.json
file. Feel free to repeat this command multiple times to reconfigure the environment.
Preparing Deployment Config
Next step would be to configure your deployment objects. The main idea of the deployment file is to provide a flexible way to configure jobs with their dependencies. To make this process easy and flexible, we support two options to define the configuration.
JSON:
conf/deployment.json
YAML:
conf/deployment.(yml|yaml)
If the above options are located relative to the project root directory they will be auto-discovered, else you will need to explicitly specify the file to dbx deploy
using the option --deployment-file=./path/to/file.(json|yml|yaml)
.
The --deployment-file
option also allows you to use multiple different deployment files.
Note
Within the deployment config, if you find that you have duplicated parts like cluster definitions or retry config or permissions or anything else, and you are finding it hard to manage the duplications, we recommend you either use YAML or Jsonnet.
Yaml is supported by dbx where as with Jsonnet, you are responsible for generating the json file through Jsonnet compilation process.
Note
dbx
supports passing environment variables into both JSON and YAML based deployment files. Please read more about this functionality here.
Note
Since version 0.4.1 dbx
supports Jinja2 rendering for JSON and YAML based configurations.
JSON
Here are some samples of deployment files for different cloud providers:
{
"default": {
"jobs": [
{
"name": "your-job-name",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "i3.xlarge",
"aws_attributes": {
"first_on_demand": 0,
"availability": "SPOT"
},
"num_workers": 2
},
"libraries": [],
"max_retries": 0,
"spark_python_task": {
"python_file": "tests/deployment-configs/placeholder_1.py"
}
}
]
}
}
{
"default": {
"jobs": [
{
"name": "your-job-name",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "Standard_F4s",
"num_workers": 2
},
"libraries": [],
"max_retries": 0,
"spark_python_task": {
"python_file": "path/to/entrypoint.py"
}
}
]
}
}
{
"default": {
"jobs": [
{
"name": "your-job-name",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "n1-highmem-4",
"num_workers": 2
},
"libraries": [],
"max_retries": 0,
"spark_python_task": {
"python_file": "path/to/entrypoint.py"
}
}
]
}
}
Expected structure of the deployment file is the following:
{
// you may have multiple environments defined per one deployment.json file
"<environment-name>": {
"jobs": [
// here goes a list of jobs, every job is one dictionary
{
"name": "this-parameter-is-required!",
// everything else is as per Databricks Jobs API
// however, you might reference any local file (such as entrypoint or job configuration)
"spark_python_task": {
"python_file": "path/to/entrypoint.py" // references entrypoint file relatively to the project root directory
},
"parameters": [
"--conf-file",
"conf/test/sample.json" // references configuration file relatively to the project root directory
]
}
]
}
}
As you can see, we simply follow the Databricks Jobs API with one enhancement -
any local files can be referenced and will be uploaded to dbfs in a versioned way during the dbx deploy
command.
YAML
You can define re-usable definitions in yaml. Here is an example yaml and its json equivalent:
Note
The YAML file needs to have a top level environments
key under which all environments will be listed.
The rest of the definition is the same as it is for config using json. It follows the
Databricks Jobs API with the same auto
versioning and upload of local files referenced with in the config.
# http://yaml.org/spec/1.2/spec.html
# https://learnxinyminutes.com/docs/yaml/
custom:
basic-cluster-props: &basic-cluster-props
spark_version: "7.3.x-cpu-ml-scala2.12"
node_type_id: "some-node-type"
aws_attributes:
first_on_demand: 0
availability: "SPOT"
basic-auto-scale-props: &basic-auto-scale-props
autoscale:
min_workers: 2
max_workers: 5
basic-static-cluster: &basic-static-cluster
new_cluster:
<<: *basic-cluster-props
num_workers: 2
basic-autoscale-cluster: &basic-autoscale-cluster
new_cluster:
<<: # merge these two maps and place them here.
- *basic-cluster-props
- *basic-auto-scale-props
basic-cluster-libraries: &basic-cluster-libraries
libraries:
- pypi:
package: "pydash"
environments:
default:
jobs:
- name: "your-job-name"
<<: *basic-static-cluster
libraries: []
max_retries: 0
spark_python_task:
python_file: "tests/deployment-configs/placeholder_1.py"
- name: "your-job-name-2"
<<: *basic-static-cluster
libraries: []
max_retries: 0
spark_python_task:
python_file: "tests/deployment-configs/placeholder_2.py"
- name: "your-job-name-3"
<<:
- *basic-static-cluster
- *basic-cluster-libraries
max_retries: 5
spark_python_task:
python_file: "tests/deployment-configs/placeholder_2.py"
- name: "your-job-name-4"
<<:
- *basic-autoscale-cluster
- *basic-cluster-libraries
max_retries: 5
spark_python_task:
python_file: "tests/deployment-configs/placeholder_2.py"
{
"default": {
"jobs": [
{
"name": "your-job-name",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "some-node-type",
"aws_attributes": {
"first_on_demand": 0,
"availability": "SPOT"
},
"num_workers": 2
},
"libraries": [],
"max_retries": 0,
"spark_python_task": {
"python_file": "tests/deployment-configs/placeholder_1.py"
}
},
{
"name": "your-job-name-2",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "some-node-type",
"aws_attributes": {
"first_on_demand": 0,
"availability": "SPOT"
},
"num_workers": 2
},
"libraries": [],
"max_retries": 0,
"spark_python_task": {
"python_file": "tests/deployment-configs/placeholder_2.py"
}
},
{
"name": "your-job-name-3",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "some-node-type",
"aws_attributes": {
"first_on_demand": 0,
"availability": "SPOT"
},
"num_workers": 2
},
"libraries": [
{
"pypi": {"package": "pydash"}
}
],
"max_retries": 5,
"spark_python_task": {
"python_file": "tests/deployment-configs/placeholder_2.py"
}
},
{
"name": "your-job-name-4",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "some-node-type",
"aws_attributes": {
"first_on_demand": 0,
"availability": "SPOT"
},
"autoscale": {
"min_workers": 2,
"max_workers": 5
}
},
"libraries": [
{
"pypi": {"package": "pydash"}
}
],
"max_retries": 5,
"spark_python_task": {
"python_file": "tests/deployment-configs/placeholder_2.py"
}
}
]
}
}
Interactive execution
Note
dbx
expects that cluster for interactive execution supports %pip
and %conda
magic commands.
The dbx execute
executes given job on an interactive cluster.
You need to provide either cluster-id
or cluster-name
, and a --job
parameter.
dbx execute \
--cluster-name=some-name \
--job=your-job-name
You can also provide parameters to install .whl packages before launching code from the source file, as well as installing dependencies from pip-formatted requirements file or conda environment yml config.
Deployment
After you’ve configured the deployment.json file, it’s time to perform an actual deployment:
dbx deploy \
--environment=test
You can optionally provide requirements.txt file, all requirements will be automatically added to the job definition. Please refer to the full description of deploy command in the CLI section for more options on setup.
Launch
Finally, after deploying all your job-related files, you can launch the job via the following command:
dbx launch --environment=test --job=sample
Please refer to the full description of launch command in the CLI section for more options.