Jinja2 Support: variables, logic and loops

Basic template support

Since version 0.4.1 dbx supports Jinja2 rendering for JSON and YAML based configurations. This allows you to use environment variables in the deployment, add variable-based conditions, Jinja filters and for loops to make your deployment more flexible for CI pipelines.

To add Jinja2 support to your deployment file, please add postfix .j2 to the name of your deployment file, for example deployment.yml.j2.

Deployment files stored at conf/deployment.(json|yml|yaml).j2. will be auto-discovered.

Please find examples on how to use Jinja2 templates below:

{
  "default": {
    "jobs": [
      {
        "name": "your-job-name",
        "timeout_seconds": "{{ env['TIMEOUT'] }}",
          {% if (env['ENVIRONMENT'] == "production") %}
            "email_notifications": {
              "on_failure": [
                "presetEmail@test.com",
                "test@test.com"
              ]
            },
          {% endif %}
        "new_cluster": {
          "spark_version": "7.3.x-cpu-ml-scala2.12",
          "node_type_id": "some-node-type",
          "aws_attributes": {
            "first_on_demand": 0,
            "availability": "{{ env['AVAILABILITY'] | default('SPOT') }}"
          },
          "num_workers": 2
        },
        "libraries": [],
          {% if (env['ENVIRONMENT'] == "production") %}
            "max_retries": {{ env['MAX_RETRY'] | default(-1) }},
          {% else %}
            "max_retries": {{ env['MAX_RETRY'] | default(3) }},
          {% endif %}
        "spark_python_task": {
          "python_file": "file://tests/deployment-configs/placeholder_1.py"
        }
      }
    ]
  }
}

environments:
  default:
    jobs:
      - name: "your-job-name"
        timeout_seconds: {{ env['TIMEOUT'] }}
        {% if (env['ENVIRONMENT'] == "production") %}
        email_notifications:
          on_failure:
            - "presetEmail@test.com"
            - "test@test.com"
        {% endif %}
        new_cluster:
          spark_version: "7.3.x-cpu-ml-scala2.12"
          node_type_id: "some-node-type"
          aws_attributes:
            first_on_demand: 0
            availability: {{ env['AVAILABILITY'] | default("SPOT") }}
          num_workers: 2
        libraries: []
        {% if (env['ENVIRONMENT'] == "production") %}
        max_retries: {{ env['MAX_RETRY'] | default(-1) }}
        {% else %}
        max_retries: {{ env['MAX_RETRY'] | default(3) }}
        {% endif %}
        spark_python_task:
          python_file: "file://tests/deployment-configs/placeholder_1.py"

Support for includes

Jinja2-based templates also support include clause which allows you to re-share common bits of configuration across multiple files and improve modularity of configurations.

For example, your main deployment file can look like this:

{
    "default": {
        "jobs": [
            {
                "name": "your-job-name",
                "new_cluster": {% include 'includes/cluster-test.json.j2' %},
                "libraries": [],
                "max_retries": 0,
                "spark_python_task": {
                    "python_file": "file://placeholder_1.py"
                }
            }
        ]
    }
}

And in the includes folder you can provide the cluster configuration component:

{
    "spark_version": "7.3.x-cpu-ml-scala2.12",
    "node_type_id": "some-node-type",
    "aws_attributes": {
        "first_on_demand": 0,
        "availability": "SPOT"
    },
    "num_workers": 2
}

Environment variables

Since version 0.6.0 dbx supports passing environment variables into the deployment configuration, giving you an additional level of flexibility. You can pass environment variables both into JSON and YAML-based configurations which are written in Jinja2 template format. This allows you to parametrize the deployment and make it more flexible for CI pipelines.

{
    "default": {
    "jobs": [
        {
        "name": "your-job-name",
        "timeout_seconds": "{{ env['TIMEOUT'] }}",
        "email_notifications": {
            "on_failure": [
                    "{{ env['ALERT_EMAIL'] | lower }}",
                    "presetEmail@test.com"
            ]
        },
        "new_cluster": {
            "spark_version": "7.3.x-cpu-ml-scala2.12",
            "node_type_id": "some-node-type",
            "aws_attributes": {
            "first_on_demand": 0,
            "availability": "{{ env['AVAILABILITY'] | default('SPOT') }}"
            },
            "num_workers": 2
        },
        "libraries": [],
        "max_retries": "{{ env['MAX_RETRY'] | default(3) }}",
        "spark_python_task": {
            "python_file": "file://tests/deployment-configs/placeholder_1.py"
            }
        }]
    }
}

environments:
  default:
    jobs:
      - name: "your-job-name"
        timeout_seconds: {{ env['TIMEOUT'] }}
        email_notifications:
          on_failure:
            - {{ env['ALERT_EMAIL'] | lower }}
            - "presetEmail@test.com"
        new_cluster:
          spark_version: "7.3.x-cpu-ml-scala2.12"
          node_type_id: "some-node-type"
          aws_attributes:
            first_on_demand: 0
            availability: {{ env['AVAILABILITY'] | default("SPOT") }}
          num_workers: 2
        libraries: []
        max_retries: {{ env['MAX_RETRY'] | default(3) }}
        spark_python_task:
          python_file: "file://tests/deployment-configs/placeholder_1.py"

Variable file

Since version 0.6.6 dbx supports an option (--jinja-variables-file) to pass variables from a file to the Jinja-based deployment file. Variables shall be stored in a file in YAML format which contains variables that are to be used from the inside of the main Jinja definition.

Consider the following variables file:

TIMEOUT: 12000
ON_FAILURE_EMAILS:
  - example@dbx.com
  - presetEmail@test.com

Variables from this file can be referenced when file path is passed as an option to dbx deploy.

Referencing is done by using special var["VAR_NAME"] syntax:

{
    "default": {
    "jobs": [
        {
        "name": "your-job-name",
        "timeout_seconds": {{ var["TIMEOUT"] }},
        "email_notifications": {
            "on_failure": [
                {% for email in var["ON_FAILURE_EMAILS"] %}
                    "{{ email | lower }}"
                {{ "," if not loop.last }}
                {% endfor %}
            ]
        },
        "new_cluster": {
            "spark_version": "7.3.x-cpu-ml-scala2.12",
            "node_type_id": "some-node-type",
            "aws_attributes": {
            "first_on_demand": 0,
            "availability": "{{ var["AVAILABILITY"] | default('SPOT') }}"
            },
            "num_workers": 2
        },
        "libraries": [],
        "max_retries": {{ var["MAX_RETRY"] | default(3) }},
        "spark_python_task": {
            "python_file": "tests/deployment-configs/placeholder_1.py"
            }
        }]
    }
}

environments:
  default:
    jobs:
      - name: "your-job-name"
        timeout_seconds: {{ var["TIMEOUT"] }}
        email_notifications:
          on_failure:
            {% for email in var["ON_FAILURE_EMAILS"] %}
            - "{{ email | lower }}"
            {% endfor %}
        new_cluster:
          spark_version: "7.3.x-cpu-ml-scala2.12"
          node_type_id: "some-node-type"
          aws_attributes:
            first_on_demand: 0
            availability: {{ var["AVAILABILITY"] | default("SPOT") }}
          num_workers: 2
        libraries: []
        max_retries: {{ var["MAX_RETRY"] | default(3) }}
        spark_python_task:
          python_file: "tests/deployment-configs/placeholder_1.py"

Variables can also be combined with environment variables mentioned above.