CLI Reference
dbx
provides access to it’s functions in a cli-oriented fashion.
Each individual command has a detailed help screen accessible via dbx command_name --help
.
We encourage you to use dbx
both for local development and CI/CD pipelines.
Note
dbx
works with your PAT (Personal Access Token) in exactly the same way as databricks-cli.
This means that if the following environment variables:
DATABRICKS_HOST
DATABRICKS_TOKEN
are defined, dbx
will use them to perform actions.
It allows you to securely store these variables in your CI/CD tool and access them from within the pipeline. In general, we don’t recommend storing your tokens into the config file inside the CI pipeline, since this might be insecure. For Azure-based environments, you can also consider using AAD-based authentication. For local development, please use the Databricks CLI profiles - it’s very convenient for cases when you’re working with multiple environments.
dbx
dbx [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
configure
Configures project environment in the current folder.
This command might be used multiple times to change configuration of a given environment.
If project file (located in .dbx/project.json
) is non-existent, it will be initialized.
There is no strict requirement to configure project file via this command.
You can also configure it directly via any file editor.
dbx configure [OPTIONS]
Options
- --workspace-dir <workspace_dir>
Workspace directory for MLflow experiment.
If not provided, default directory will be
/Shared/dbx/projects/<current-folder-name>
.
- --artifact-location <artifact_location>
Artifact location in DBFS.
If not provided, default location will be
dbfs:/dbx/<current-folder-name>
.
- -e, --environment <environment>
Environment name.
If not provided,
default
will be used.
- --debug
Debug Mode. Shows full stack trace on error.
- --profile <profile>
CLI connection profile to use.
The default profile is
DEFAULT
.
datafactory
Azure Data Factory integration utilities.
dbx datafactory [OPTIONS] COMMAND [ARGS]...
reflect
Reflects job definitions to Azure Data Factory.
During the reflection, following actions will be performed:
Input specs file will be parsed
Per each defined cluster, a new linked service will be created
- Per each defined job, a job object in ADF pipeline will be reflected.Please note that chaining jobs into pipeline shall be done on ADF side.No other steps in datafactory pipeline will be changed by execution of this command.
dbx datafactory reflect [OPTIONS]
Options
- --specs-file <specs_file>
Required Path to deployment result specification file
- --subscription-name <subscription_name>
Required Name of Azure subscription
- -g, --resource-group <resource_group>
Required Resource group name
- --factory-name <factory_name>
Required Factory name
- -n, --name <name>
Required Pipeline name
- --debug
Debug Mode. Shows full stack trace on error.
- -e, --environment <environment>
Environment name.
If not provided,
default
will be used.
deploy
Deploy project to artifact storage.
This command takes the project in current folder (file .dbx/project.json
shall exist)
and performs deployment to the given environment.
During the deployment, following actions will be performed:
Python package will be built and stored in
dist/*
folder (can be disabled via--no-rebuild
)- Deployment configuration will be taken for a given environment (see
-e
for details)from the deployment file, defined in--deployment-file
.You can specify the deployment file in either JSON or YAML or Jinja-based JSON or YAML.[.json, .yaml, .yml, .j2]
are all valid file types. Per each job defined in the
--jobs
, all local file references will be checkedAny found file references will be uploaded to MLflow as artifacts of current deployment run
[DEPRECATED] If
--requirements-file
is provided, all requirements will be added to job definitionWheel file location will be added to the
libraries
. Can be disabled with--no-package
.If the job with given name exists, it will be updated, if not - created
- If
--write-specs-to-file
is provided, writes final job spec into a given file.For example, this option can look like this:--write-specs-to-file=.dbx/deployment-result.json
.
dbx deploy [OPTIONS]
Options
- --job <job>
Deploy a single job by it’s name. Both
--jobs
and--job
cannot be provided.
- --jobs <jobs>
Comma-separated list of job names to be deployed. If not provided, all jobs from the deployment file will be deployed. Both
--jobs
and--job
cannot be provided.
- --requirements-file <requirements_file>
[DEPRECATED]
- --no-rebuild
Disable package rebuild
- --no-package
Do not add package reference into the job description
- --files-only
Do not create jobs, only deploy files.
- --tags <tags>
Additional tags for deployment in format (tag_name=tag_value). Option might be repeated multiple times.
- --write-specs-to-file <write_specs_to_file>
Writes final job definitions into a given local file. Helpful when final representation of a deployed job is needed for other integrations. Please note that output file will be overwritten if it exists.
- --branch-name <branch_name>
The name of the current branch. If not provided or empty, dbx will try to detect the branch name.
- --jinja-variables-file <jinja_variables_file>
Path to a file with variables for Jinja template. Only works when Jinja-based deployment file is used. Read more about this functionality in the Jinja2 support doc.
- --debug
Debug Mode. Shows full stack trace on error.
- -e, --environment <environment>
Environment name.
If not provided,
default
will be used.
- --deployment-file <deployment_file>
Path to deployment file.
execute
Executes given job on the interactive cluster.
This command is very suitable to interactively execute your code on the interactive clusters.
Warning
There are some limitations for dbx execute
:
Only clusters which support
%pip
magic can work with execute.Currently, only Python-based execution is supported.
The following set of actions will be done during execution:
If interactive cluster is stopped, it will be automatically started
Package will be rebuilt from the source (can be disabled via
--no-rebuild
)Job configuration will be taken from deployment file for given environment
All referenced will be uploaded to the MLflow experiment
- Code will be executed in a separate context. Other users can work with the same packageon the same cluster without any limitations or overlapping.
Execution results will be printed out in the shell. If result was an error, command will have error exit code.
dbx execute [OPTIONS]
Options
- --cluster-id <cluster_id>
Cluster ID.
- --cluster-name <cluster_name>
Cluster name.
- --job <job>
Required Job name to be executed
- --task <task>
Task name (task_key field) inside the job to be executed. Required if the –job is a multitask job.
- --requirements-file <requirements_file>
[DEPRECATED]
- --no-rebuild
Disable package rebuild
- --no-package
Do not add package reference into the job description
- --upload-via-context
Upload files via execution context
- -e, --environment <environment>
Environment name.
If not provided,
default
will be used.
- --debug
Debug Mode. Shows full stack trace on error.
- --deployment-file <deployment_file>
Path to deployment file.
- --jinja-variables-file <jinja_variables_file>
Path to a file with variables for Jinja template. Only works when Jinja-based deployment file is used. Read more about this functionality in the Jinja2 support doc.
init
Generates new project from the template
Launching this command without --template-parameters
argument
will open cookiecutter dialogue to enter the required parameters.
dbx init [OPTIONS]
Options
- --template <template>
Built-in dbx template used to kickoff the project.
- Options
python_basic
- --path <path>
External template used to kickoff the project. Cannot be used together with
--template
option.
- --package <package>
Python package containing external template used to kickoff the project. Cannot be used together with
--template
option.
- --checkout <checkout>
Checkout argument for cookiecutter. Used only if
--path
is used.
- -p, --parameters <parameters>
Additional parameters for project creation in the format of parameter=value, for example:
- --no-input
- --debug
Debug Mode. Shows full stack trace on error.
launch
Finds the job deployment and launches it on a automated or interactive cluster.
This command will launch the given job by it’s name on a given environment.
Note
Job shall be deployed prior to be launched.
dbx launch [OPTIONS]
Options
- --job <job>
Required Job name.
- --trace
Trace the job until it finishes.
- --kill-on-sigterm
If provided, kills the job on SIGTERM (Ctrl+C) signal
- --existing-runs <existing_runs>
Strategy to handle existing active job runs.
Options behaviour:
wait
will wait for all existing job runs to be finishedcancel
will cancel all existing job runspass
will simply pass the check and try to launch the job directly
- Options
wait | cancel | pass
- --as-run-submit
Run the job as run submit.
- --tags <tags>
Additional tags to search for the latest deployment. Format: (
--tags="tag_name=tag_value"
). Option might be repeated multiple times.
- --parameters <parameters>
Parameters of the job.
If provided, default job arguments will be overridden. Format: (
--parameters="parameter1=value1"
). Option might be repeated multiple times.
- --parameters-raw <parameters_raw>
Parameters of the job as a raw string.
If provided, default job arguments will be overridden. If provided,
--parameters
argument will be ignored. Example command:dbx launch --job="my-job-name" --parameters-raw='{"key1": "value1", "key2": 2}'
. Please note that no parameters preprocessing will be done.
- --branch-name <branch_name>
The name of the current branch. If not provided or empty, dbx will try to detect the branch name.
- --include-output <include_output>
If provided, adds run output to the console output of the launch command. Please note that this option is only supported for Jobs V2.X+. For jobs created without tasks section output won’t be printed. If not provided, run output will be omitted.
Options behaviour:
stdout
will add stdout and stderr to the console outputstderr
will add only stderr to the console output
- Options
stdout | stderr
- -e, --environment <environment>
Environment name.
If not provided,
default
will be used.
- --debug
Debug Mode. Shows full stack trace on error.
sync
Sync local files to Databricks and watch for changes, with support for syncing to either a path
in DBFS or a
Databricks Repo via the dbfs
and repo
subcommands. This enables one to incrementally sync local files to Databricks in order to enable quick, iterative
development in an IDE with the ability to test changes almost immediately in Databricks notebooks.
Suppose you are using the Repos for Git integration feature
and have cloned a git repo within Databricks where you have Python notebooks stored as well as various Python
modules that the notebooks import. You can edit any of these files directly in Databricks.
The dbx sync repo
command provides an additional option: edit the files in a local repo on your computer
in an IDE of your choice and sync the changes to the repo in Databricks as you make changes.
For example, when run from a local git clone, the following will sync all the files to an existing repo
named myrepo
in Databricks and watch for changes:
dbx sync repo -d myrepo
At the top of your notebook you can turn on autoreload so that execution of cells will automatically pick up the changes:
%load_ext autoreload
%autoreload 2
The dbx sync repo
command syncs to a repo in Databricks. If that repo is a git clone you can see the
changes made to the files, as if you’d made the edits directly in Databricks. Alternatively, you can use
dbx sync dbfs
to sync the files to a path in DBFS. This keeps the files independent from the repos but
still allows you to use them in notebooks either in a repo or in notebooks existing in your workspace.
For example, when run from a local git clone in a myrepo
directory under a user
first.last@somewhere.com
, the following will sync all the files to the DBFS path
/tmp/users/first.last/myrepo
:
dbx sync dbfs
The destination path can also be specified, as in: -d /tmp/myrepo
.
When executing notebooks in a repo, the root of the repo is automatically added to the Python path so that
imports work relative to the repo root. This means that aside from turning on autoreload you don’t need to do
anything else special for the changes to be reflected in the cell’s execution. However, when syncing to DBFS,
for the imports to work you need to update the Python path to include this target directory you’re syncing to.
For example, to import from the /tmp/users/first.last/myrepo
path used above, use the following at the top
of your notebook:
import sys
if "/dbfs/tmp/users/first.last/myrepo" not in sys.path:
sys.path.insert(0, "/dbfs/tmp/users/first.last/myrepo")
The dbx sync
commands have many options for controlling which files/directories to include/exclude from
syncing, which are well documented below. For convenience, all patterns listed in a .gitignore
at the
source will be excluded from syncing. The .git
directory is excluded as well.
dbx sync [OPTIONS] COMMAND [ARGS]...
dbfs
Syncs from a source directory to DBFS.
dbx sync dbfs [OPTIONS]
Options
- --use-gitignore, --no-use-gitignore
Controls whether the .gitignore is used to automatically exclude file/directories from syncing.
- --polling-interval <polling_interval_secs>
Use file system polling instead of file system events and set the polling interval (in seconds)
- --watch, --no-watch
Controls whether the tool should watch for file changes after the initial sync. With
--watch
, which is the default, it will watch for file system changes and rerun the sync whenever any changes occur to files or directories matching the filters. With--no-watch
the tool will quit after the initial sync.
- -ep, --exclude-pattern <exclude_patterns>
A pattern specifying files and/or directories to exclude from syncing, relative to the source directory. This uses the same format as gitignore. For examples, see the documentation of
--include-pattern
.
- --allow-delete-unmatched, --disallow-delete-unmatched
Specifies how to handle files/directories that would be deleted in the remote destination because they don’t match the current set of filters.
For example, suppose you have used the option
-i foo
to sync only the foo directory and later quit the tool. Then suppose you restart the tool using-i bar
to sync only the bar directory. In this situation, it is unclear whether your intention is to 1) sync over bar and remove foo in the destination, or 2) sync over bar and leave foo alone in the destination. Due to this ambiguity, the tool will ask to confirm your intentions.To avoid having to confirm, you can use either of these options:
--allow-delete-unmatched
will delete files/directories in the destination that are not present locally with the current filters. So for the example above, this would remove foo in the destination when syncing with-i bar
.--disallow-delete-unmatched
will NOT delete files/directories in the destination that are not present locally with the current filters. So for the example above, this would leave foo in the destination when syncing with-i bar
.
- -fip, --force-include-pattern <force_include_patterns>
A pattern specifying files and/or directories to sync, relative to the source directory, regardless of whether these files and/or directories would otherwise be excluded.
See documentation of –include-pattern for usage.
- -ip, --include-pattern <include_patterns>
A pattern specifying files and/or directories to sync, relative to the source directory. This uses the same format as gitignore. When this option is used, no files or directories will be synced unless specifically included by this or other include options.
For example:
foo
will match any file or directory named foo anywhere under the source/foo/
will only match a directory named foo directly under the source.*.py
will match all Python files./foo/*.py
will match all Python files directly under the foo directory./foo/**/*.py
will match all Python files anywhere under the foo directory.
You may also store a list of patterns inside a
.syncinclude
file under the source path. Patterns in this file will be used as the default patterns to include. This essentially behaves as the opposite of a gitignore file, but with the same format.
- -e, --exclude <exclude_dirs>
A directory to exclude from syncing, relative to the source directory. This directory must exist.
For example:
-e foo
will exclude directory foo directly under the source directory from syncing-e foo/bar
will exclude directory foo/bar directly under the source directory from syncing
- -fi, --force-include <force_include_dirs>
A directory to sync, relative to the source directory. This directory must exist. When this option is used, no files or directories will be synced unless specifically included by this or other include options.
Unlike –include, this will sync a directory regardless of files/directories that are excluded from syncing. This can be useful when, for example, the .gitignore lists a directory that you want to have synced. The patterns in the .gitignore are used by default to exclude files/directories from syncing.
For example:
-i foo
will sync a directory foo directly under the source directory-i foo/bar
will sync a directory foo/bar directly under the source directory
- -i, --include <include_dirs>
A directory to sync, relative to the source directory. This directory must exist. When this option is used, no files or directories will be synced unless specifically included by this or other include options.
For example:
-i foo
will sync a directory foo directly under the source directory-i foo/bar
will sync a directory foo/bar directly under the source directory
- --dry-run
Log what the tool would do without making any changes.
- --full-sync
Ignores any existing sync state and syncs all files and directories matching the filters to the destination.
- -s, --source <source>
The local source path to sync from. If the current working directory is a git repo, then the tool by default uses that path as the source. Otherwise the source path will need to be specified.
- --profile <profile>
The Databricks CLI connection profile containing the host and API token to use to connect to Databricks.
- -d, --dest <dest_path>
A path in DBFS to sync to. For example,
-d /tmp/project
would sync from the local source path to the DBFS path/tmp/project
.Specifying this path is optional. By default the tool will sync to the destination
/tmp/users/<user_name>/<source_base_name>
. For example, given local source path/foo/bar
and Databricks userfirst.last@somewhere.com
, this would sync to/tmp/users/first.last/bar
. This path is chosen as a safe default option that is unlikely to overwrite anything important.When constructing this default destination path, the user name is determined using the Databricks API. If it cannot be determined, or to use a different user for the path, you may use the
--user
option.
- -u, --user <user_name>
Specify the user name to use when constructing the default destination path. This has no effect when
--dest
is already specified. If this is an email address then the domain is ignored. For example-u first.last
and-u first.last@somewhere.com
will both result infirst.last
as the user name.
repo
Syncs from source directory to a Databricks Repo.
dbx sync repo [OPTIONS]
Options
- --use-gitignore, --no-use-gitignore
Controls whether the .gitignore is used to automatically exclude file/directories from syncing.
- --polling-interval <polling_interval_secs>
Use file system polling instead of file system events and set the polling interval (in seconds)
- --watch, --no-watch
Controls whether the tool should watch for file changes after the initial sync. With
--watch
, which is the default, it will watch for file system changes and rerun the sync whenever any changes occur to files or directories matching the filters. With--no-watch
the tool will quit after the initial sync.
- -ep, --exclude-pattern <exclude_patterns>
A pattern specifying files and/or directories to exclude from syncing, relative to the source directory. This uses the same format as gitignore. For examples, see the documentation of
--include-pattern
.
- --allow-delete-unmatched, --disallow-delete-unmatched
Specifies how to handle files/directories that would be deleted in the remote destination because they don’t match the current set of filters.
For example, suppose you have used the option
-i foo
to sync only the foo directory and later quit the tool. Then suppose you restart the tool using-i bar
to sync only the bar directory. In this situation, it is unclear whether your intention is to 1) sync over bar and remove foo in the destination, or 2) sync over bar and leave foo alone in the destination. Due to this ambiguity, the tool will ask to confirm your intentions.To avoid having to confirm, you can use either of these options:
--allow-delete-unmatched
will delete files/directories in the destination that are not present locally with the current filters. So for the example above, this would remove foo in the destination when syncing with-i bar
.--disallow-delete-unmatched
will NOT delete files/directories in the destination that are not present locally with the current filters. So for the example above, this would leave foo in the destination when syncing with-i bar
.
- -fip, --force-include-pattern <force_include_patterns>
A pattern specifying files and/or directories to sync, relative to the source directory, regardless of whether these files and/or directories would otherwise be excluded.
See documentation of –include-pattern for usage.
- -ip, --include-pattern <include_patterns>
A pattern specifying files and/or directories to sync, relative to the source directory. This uses the same format as gitignore. When this option is used, no files or directories will be synced unless specifically included by this or other include options.
For example:
foo
will match any file or directory named foo anywhere under the source/foo/
will only match a directory named foo directly under the source.*.py
will match all Python files./foo/*.py
will match all Python files directly under the foo directory./foo/**/*.py
will match all Python files anywhere under the foo directory.
You may also store a list of patterns inside a
.syncinclude
file under the source path. Patterns in this file will be used as the default patterns to include. This essentially behaves as the opposite of a gitignore file, but with the same format.
- -e, --exclude <exclude_dirs>
A directory to exclude from syncing, relative to the source directory. This directory must exist.
For example:
-e foo
will exclude directory foo directly under the source directory from syncing-e foo/bar
will exclude directory foo/bar directly under the source directory from syncing
- -fi, --force-include <force_include_dirs>
A directory to sync, relative to the source directory. This directory must exist. When this option is used, no files or directories will be synced unless specifically included by this or other include options.
Unlike –include, this will sync a directory regardless of files/directories that are excluded from syncing. This can be useful when, for example, the .gitignore lists a directory that you want to have synced. The patterns in the .gitignore are used by default to exclude files/directories from syncing.
For example:
-i foo
will sync a directory foo directly under the source directory-i foo/bar
will sync a directory foo/bar directly under the source directory
- -i, --include <include_dirs>
A directory to sync, relative to the source directory. This directory must exist. When this option is used, no files or directories will be synced unless specifically included by this or other include options.
For example:
-i foo
will sync a directory foo directly under the source directory-i foo/bar
will sync a directory foo/bar directly under the source directory
- --dry-run
Log what the tool would do without making any changes.
- --full-sync
Ignores any existing sync state and syncs all files and directories matching the filters to the destination.
- -s, --source <source>
The local source path to sync from. If the current working directory is a git repo, then the tool by default uses that path as the source. Otherwise the source path will need to be specified.
- --profile <profile>
The Databricks CLI connection profile containing the host and API token to use to connect to Databricks.
- -d, --dest-repo <dest_repo>
Required The name of the Databricks Repo to sync to.
Repos exist in the Databricks workspace under a path of the form
/Repos/<user>/<repo>
. This specifies the<repo>
portion of the path.
- -u, --user <user_name>
The user who owns the Databricks Repo to sync to.
Repos exist in the Databricks workspace under a path of the form
/Repos/<user>/<repo>
. This specifies the<user>
portion of the path.This is optional, as the user name is determined automatically using the Databricks API. If it cannot be determined, or to use a different user for the path, the user name may be specified using this option.