Packaging arbitrary files in Python Packages
Whilst writing a Python Package with dbx
, you might be in a need to add some arbitrary files to your Python package.
Such arbitrary files can include:
*.sql
files where your Spark SQL logic residessmall static data files that are used in your pipelines
test files, e.g. when you need to have a file in a specific format
Standard Python packaging tools allow to simply collect, combine and package such arbitrary files together with the main package code.
Note
This example is written for setup.py
-based packaging.
For tools like poetry
and another packaging formats please check their respective docs.
Referencing files
First of all, we’ll need to reference files in the setup.py
files.
Imagine having the following project structure:
.
├── <package-name>
│ ├── __init__.py
│ └── resources
│ ├── raw
│ │ └── username.csv
│ └── sql
│ └── create_table.sql
├── setup.py
It’s a good practice to keep all arbitrary files in a separate directory (in this case it’s located in <package-name>/resources
.
In the setup.py
the package_data
field is responsible for referencing files from this folder:
from setuptools import setup
setup(
...
package_data={'': ['resources/sql/*.sql', "resources/raw/*.csv"]},
...
)
Using the referenced files
To access the referenced files, do the following in Python:
import pkg_resources
raw_csv_path = pkg_resources.resource_filename(
"<package-name>", "resources/raw/username.csv"
)
query_path = pkg_resources.resource_filename(
"<package-name>", "resources/sql/create_table.sql"
)
The provided paths can be used to locally read these files for any purpose.