Quickstart¶
Prerequisites¶
Python >=3.6 environment on your local machine
databricks-cli with a configured profile
In this instruction we’re based on Databricks Runtime 7.3 LTS ML.
If you don’t need to use ML libraries, we still recommend to use ML-based version due to %pip
magic support.
Starting from a template (Python)¶
If you already have an existing project, you can skip this step and move directly to the next one.
For Python-based deployments, we recommend to use cicd-templates for quickstart. However, if you don’t like the project structure defined in cicd-templates, feel free to use the instruction below for full customization.
Configuring environments¶
Move the shell into the project directory and configure dbx
.
Note
dbx
heavily relies on databricks-cli and uses the same set of profiles.
Please configure your profiles in advance using databricks configure
command as described here.
Create a new environment configuration via given command:
dbx configure \
--profile=test # name of your profile, omit if you would like to use the DEFAULT one
This command will configure a project file in .dbx/project.json
file. Feel free to repeat this command multiple times to reconfigure the environment.
Preparing Deployment Config¶
Next step would be to configure your deployment objects. To make this process easy and flexible, we support two options to define the configuration.
JSON:
conf/deployment.json
: This is the default file which will be picked up automatically.YAML:
conf/deployment.yaml
: To use [ yaml | yml ] you will need to explicitly specify the file using the option--deployment-file=./conf/deployment.yaml
Note
Within the deployment config, if you find that you have duplicated parts like cluster definitions or retry config or permissions or anything else, and you are finding it hard to manage the duplications, we recommend you either use YAML or Jsonnet.
Yaml is supported by dbx where as with Jsonnet, you are responsible for generating the json file through Jsonnet compilation process.
JSON¶
By default, deployment configuration is stored in conf/deployment.json
.
The main idea of the deployment file is to provide a flexible way to configure job with it’s dependencies.
You can use multiple different deployment files, providing the filename as an argument to dbx deploy
via --deployment-file=/path/to/file.json
option.
Here are some samples of deployment files for different cloud providers:
{
"default": {
"jobs": [
{
"name": "your-job-name",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "i3.xlarge",
"aws_attributes": {
"first_on_demand": 0,
"availability": "SPOT"
},
"num_workers": 2
},
"libraries": [],
"max_retries": 0,
"spark_python_task": {
"python_file": "tests/deployment-configs/placeholder_1.py"
}
}
]
}
}
{
"default": {
"jobs": [
{
"name": "your-job-name",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "Standard_F4s",
"num_workers": 2
},
"libraries": [],
"max_retries": 0,
"spark_python_task": {
"python_file": "path/to/entrypoint.py"
}
}
]
}
}
{
"default": {
"jobs": [
{
"name": "your-job-name",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "n1-highmem-4",
"num_workers": 2
},
"libraries": [],
"max_retries": 0,
"spark_python_task": {
"python_file": "path/to/entrypoint.py"
}
}
]
}
}
Expected structure of the deployment file is the following:
{
// you may have multiple environments defined per one deployment.json file
"<environment-name>": {
"jobs": [
// here goes a list of jobs, every job is one dictionary
{
"name": "this-parameter-is-required!",
// everything else is as per Databricks Jobs API
// however, you might reference any local file (such as entrypoint or job configuration)
"spark_python_task": {
"python_file": "path/to/entrypoint.py" // references entrypoint file relatively to the project root directory
},
"parameters": [
"--conf-file",
"conf/test/sample.json" // references configuration file relatively to the project root directory
]
}
]
}
}
As you can see, we simply follow the Databricks Jobs API with one enhancement -
any local files can be referenced and will be uploaded to dbfs in a versioned way during the dbx deploy
command.
YAML¶
If you want to use yaml, you will have to specify the file using --deployment-file=/path/to/file.yaml
option
available on the dbx deploy
or dbx execute
commands.
You can define re-usable definitions in yaml. Here is an example yaml and its json equivalent:
Note
The YAML file needs to have a top level environments
key under which all environments will be listed.
The rest of the definition is the same as it is for config using json. It follows the
Databricks Jobs API with the same auto
versioning and upload of local files referenced with in the config.
# http://yaml.org/spec/1.2/spec.html
# https://learnxinyminutes.com/docs/yaml/
custom:
basic-cluster-props: &basic-cluster-props
spark_version: "7.3.x-cpu-ml-scala2.12"
node_type_id: "i3.xlarge"
aws_attributes:
first_on_demand: 0
availability: "SPOT"
basic-auto-scale-props: &basic-auto-scale-props
autoscale:
min_workers: 2
max_workers: 5
basic-static-cluster: &basic-static-cluster
new_cluster:
<<: *basic-cluster-props
num_workers: 2
basic-autoscale-cluster: &basic-autoscale-cluster
new_cluster:
<<: # merge these two maps and place them here.
- *basic-cluster-props
- *basic-auto-scale-props
basic-cluster-libraries: &basic-cluster-libraries
libraries:
- "pydash"
environments:
default:
jobs:
- name: "your-job-name"
<<: *basic-static-cluster
libraries: []
max_retries: 0
spark_python_task:
python_file: "tests/deployment-configs/placeholder_1.py"
- name: "your-job-name-2"
<<: *basic-static-cluster
libraries: []
max_retries: 0
spark_python_task:
python_file: "tests/deployment-configs/placeholder_2.py"
- name: "your-job-name-3"
<<:
- *basic-static-cluster
- *basic-cluster-libraries
max_retries: 5
spark_python_task:
python_file: "tests/deployment-configs/placeholder_2.py"
- name: "your-job-name-4"
<<:
- *basic-autoscale-cluster
- *basic-cluster-libraries
max_retries: 5
spark_python_task:
python_file: "tests/deployment-configs/placeholder_2.py"
{
"default": {
"jobs": [
{
"name": "your-job-name",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "i3.xlarge",
"aws_attributes": {
"first_on_demand": 0,
"availability": "SPOT"
},
"num_workers": 2
},
"libraries": [],
"max_retries": 0,
"spark_python_task": {
"python_file": "tests/deployment-configs/placeholder_1.py"
}
},
{
"name": "your-job-name-2",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "i3.xlarge",
"aws_attributes": {
"first_on_demand": 0,
"availability": "SPOT"
},
"num_workers": 2
},
"libraries": [],
"max_retries": 0,
"spark_python_task": {
"python_file": "tests/deployment-configs/placeholder_2.py"
}
},
{
"name": "your-job-name-3",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "i3.xlarge",
"aws_attributes": {
"first_on_demand": 0,
"availability": "SPOT"
},
"num_workers": 2
},
"libraries": [
"pydash"
],
"max_retries": 5,
"spark_python_task": {
"python_file": "tests/deployment-configs/placeholder_2.py"
}
},
{
"name": "your-job-name-4",
"new_cluster": {
"spark_version": "7.3.x-cpu-ml-scala2.12",
"node_type_id": "i3.xlarge",
"aws_attributes": {
"first_on_demand": 0,
"availability": "SPOT"
},
"autoscale": {
"min_workers": 2,
"max_workers": 5
}
},
"libraries": [
"pydash"
],
"max_retries": 5,
"spark_python_task": {
"python_file": "tests/deployment-configs/placeholder_2.py"
}
}
]
}
}
Interactive execution¶
Note
dbx
expects that cluster for interactive execution supports %pip
and %conda
magic commands.
The dbx execute
executes given job on an interactive cluster.
You need to provide either cluster-id
or cluster-name
, and a --job
parameter.
dbx execute \
--cluster-name=some-name \
--job=your-job-name
You can also provide parameters to install .whl packages before launching code from the source file, as well as installing dependencies from pip-formatted requirements file or conda environment yml config.
Deployment¶
After you’ve configured the deployment.json file, it’s time to perform an actual deployment:
dbx deploy \
--environment=test
You can optionally provide requirements.txt file, all requirements will be automatically added to the job definition. Please refer to the full description of deploy command in the CLI section for more options on setup.
Launch¶
Finally, after deploying all your job-related files, you can launch the job via the following command:
dbx launch --environment=test --job=sample
Please refer to the full description of launch command in the CLI section for more options.