Packaging arbitrary files in Python Packages

Whilst writing a Python Package with dbx, you might be in a need to add some arbitrary files to your Python package.

Such arbitrary files can include:

*.sql files where your Spark SQL logic resides
small static data files that are used in your pipelines
test files, e.g. when you need to have a file in a specific format

Standard Python packaging tools allow to simply collect, combine and package such arbitrary files together with the main package code.

Note

This example is written for setup.py-based packaging. For tools like poetry and another packaging formats please check their respective docs.

Referencing files

First of all, we’ll need to reference files in the setup.py files.

Imagine having the following project structure:

.
├── <package-name>
│       ├── __init__.py
│       └── resources
│           ├── raw
│           │   └── username.csv
│           └── sql
│               └── create_table.sql
├── setup.py

It’s a good practice to keep all arbitrary files in a separate directory (in this case it’s located in <package-name>/resources.

In the setup.py the package_data field is responsible for referencing files from this folder:

from setuptools import setup
setup(
    ...
    package_data={'': ['resources/sql/*.sql', "resources/raw/*.csv"]},
    ...
)

Using the referenced files

To access the referenced files, do the following in Python:

import pkg_resources

raw_csv_path = pkg_resources.resource_filename(
    "<package-name>", "resources/raw/username.csv"
)
query_path = pkg_resources.resource_filename(
    "<package-name>", "resources/sql/create_table.sql"
)

The provided paths can be used to locally read these files for any purpose.