Quickstart

Prerequisites

Python >=3.6 environment on your local machine
databricks-cli with a configured profile

In this instruction we’re based on Databricks Runtime 7.3 LTS ML. If you don’t need to use ML libraries, we still recommend to use ML-based version due to %pip magic support.

Installing dbx

Install dbx via pip:

pip install dbx

Starting from a template (Python)

If you already have an existing project, you can skip this step and move directly to the next one.

dbx comes with a set of pre-defined templates and a command to use them straight away. However, if you don’t like the project structure defined in the provided templates, feel free to use the instruction below for full customization.

Configuring environments

Move the shell into the project directory and configure dbx.

Note

dbx heavily relies on databricks-cli and uses the same set of profiles. Please configure your profiles in advance using databricks configure command as described here.

Create a new environment configuration via given command:

dbx configure \
    --profile=test # name of your profile, omit if you would like to use the DEFAULT one

This command will configure a project file in .dbx/project.json file. Feel free to repeat this command multiple times to reconfigure the environment.

Preparing Deployment Config

Next step would be to configure your deployment objects. The main idea of the deployment file is to provide a flexible way to configure jobs with their dependencies. To make this process easy and flexible, we support two options to define the configuration.

JSON: conf/deployment.json
YAML: conf/deployment.(yml|yaml)

If the above options are located relative to the project root directory they will be auto-discovered, else you will need to explicitly specify the file to dbx deploy using the option --deployment-file=./path/to/file.(json|yml|yaml). The --deployment-file option also allows you to use multiple different deployment files.

Note

Within the deployment config, if you find that you have duplicated parts like cluster definitions or retry config or permissions or anything else, and you are finding it hard to manage the duplications, we recommend you either use YAML or Jsonnet.

Yaml is supported by dbx where as with Jsonnet, you are responsible for generating the json file through Jsonnet compilation process.

Note

Since version 0.4.1 dbx supports Jinja2 rendering for JSON and YAML based configurations.

JSON

Here are some samples of deployment files for different cloud providers:

{
    "default": {
        "jobs": [
            {
                "name": "your-job-name",
                "new_cluster": {
                    "spark_version": "7.3.x-cpu-ml-scala2.12",
                    "node_type_id": "i3.xlarge",
                    "aws_attributes": {
                        "first_on_demand": 0,
                        "availability": "SPOT"
                    },
                    "num_workers": 2
                },
                "libraries": [],
                "max_retries": 0,
                "spark_python_task": {
                    "python_file": "file://tests/deployment-configs/placeholder_1.py"
                }
            }
        ]
    }
}

{
    "default": {
        "jobs": [
            {
                "name": "your-job-name",
                "new_cluster": {
                    "spark_version": "7.3.x-cpu-ml-scala2.12",
                    "node_type_id": "Standard_F4s",
                    "num_workers": 2
                },
                "libraries": [],
                "max_retries": 0,
                "spark_python_task": {
                    "python_file": "file://path/to/entrypoint.py"
                }
            }
        ]
    }
}

{
    "default": {
        "jobs": [
            {
                "name": "your-job-name",
                "new_cluster": {
                    "spark_version": "7.3.x-cpu-ml-scala2.12",
                    "node_type_id": "n1-highmem-4",
                    "num_workers": 2
                },
                "libraries": [],
                "max_retries": 0,
                "spark_python_task": {
                    "python_file": "file://path/to/entrypoint.py"
                }
            }
        ]
    }
}

Expected structure of the deployment file is the following:

{
    // you may have multiple environments defined per one deployment.json file
    "<environment-name>": {
        "jobs": [
            // here goes a list of jobs, every job is one dictionary
            {
                "name": "this-parameter-is-required!",
                // everything else is as per Databricks Jobs API
                // however, you might reference any local file (such as entrypoint or job configuration)
                "spark_python_task": {
                    "python_file": "path/to/entrypoint.py" // references entrypoint file relatively to the project root directory
                },
                "parameters": [
                    "--conf-file",
                    "conf/test/sample.json" // references configuration file relatively to the project root directory
                ]
            }
        ]
    }
}

As you can see, we simply follow the Databricks Jobs API with one enhancement - any local files can be referenced and will be uploaded to dbfs in a versioned way during the dbx deploy command.

YAML

You can define re-usable definitions in yaml. Here is an example yaml and its json equivalent:

Note

The YAML file needs to have a top level environments key under which all environments will be listed. The rest of the definition is the same as it is for config using json. It follows the Databricks Jobs API with the same auto versioning and upload of local files referenced with in the config.

# http://yaml.org/spec/1.2/spec.html
# https://learnxinyminutes.com/docs/yaml/

custom:
  basic-cluster-props: &basic-cluster-props
    spark_version: "7.3.x-cpu-ml-scala2.12"
    node_type_id: "some-node-type"

  basic-static-cluster: &basic-static-cluster
    new_cluster:
      <<: *basic-cluster-props
      num_workers: 2

environments:
  default:
    jobs:
      - name: "your-job-name"
        tasks:
          - task_key: "first-task"
            <<: *basic-static-cluster
            spark_python_task:
              python_file: "file://placeholder_1.py"
          - task_key: "second-task"
            <<: *basic-static-cluster
            spark_python_task:
              python_file: "file://placeholder_2.py"
            depends_on:
              - task_key: "second-task"

{
    "default": {
        "jobs": [
            {
                "name": "multitask-job-name",
                "tasks": [
                    {
                        "task_key": "first-task",
                        "description": "some description",
                        "new_cluster": {
                            "spark_version": "7.3.x-cpu-ml-scala2.12",
                            "node_type_id": "some-node-type",
                            "num_workers": 2
                        },
                        "max_retries": 0,
                        "spark_python_task": {
                            "python_file": "file://placeholder_1.py"
                        }
                    },
                    {
                        "task_key": "second",
                        "description": "some description",
                        "new_cluster": {
                            "spark_version": "7.3.x-cpu-ml-scala2.12",
                            "node_type_id": "some-node-type",
                            "num_workers": 2
                        },
                        "max_retries": 0,
                        "spark_python_task": {
                            "python_file": "file://placeholder_1.py"
                        },
                        "depends_on": [
                            {
                                "task_key": "first-task"
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

Interactive execution

Note

dbx expects that cluster for interactive execution supports %pip and %conda magic commands.

The dbx execute executes given job on an interactive cluster. You need to provide either cluster-id or cluster-name, and a --job parameter.

dbx execute \
    --cluster-name=some-name \
    --job=your-job-name

You can also provide parameters to install .whl packages before launching code from the source file, as well as installing dependencies from pip-formatted requirements file or conda environment yml config.

Deployment

After you’ve configured the deployment.json file, it’s time to perform an actual deployment:

dbx deploy \
    --environment=test

You can optionally provide requirements.txt file, all requirements will be automatically added to the job definition. Please refer to the full description of deploy command in the CLI section for more options on setup.

Launch

Finally, after deploying all your job-related files, you can launch the job via the following command:

dbx launch --environment=test --job=sample

Please refer to the full description of launch command in the CLI section for more options.