Describe the bug
We are trying to use All-Purpose clusters for dbt Python models. The PythonCommandSubmitter class is currently agnostic of packages defined in the model config.
We were able to work around this by using the upload_notebook flag (internally handled by PythonNotebookUploader), which installs the packages at the cluster system-scope.
This represents a significant problem for us: as the project grows, different Python models will require different packages or versions, eventually leading to dependency collisions. While this is less of an issue in production (where we use ephemeral Job Clusters), it severely hinders the Development Experience (DX). We want to use shared All-Purpose clusters in development to speed up iteration without package conflicts.
Databricks provides Notebook-Scoped libraries specifically to avoid this. Since dbt can already compile the code into a Databricks Notebook, the submitters should be updated to leverage this feature.
Steps To Reproduce
By running this simple model you can test the issue:
import pandas as pd
from rapidfuzz import fuzz
def get_fuzz_ration(a,b):
return fuzz.ratio(a,b)
def model(dbt, session):
"""
Simple test model to verify rapidfuzz works on the cluster.
Should complete in under 5 seconds.
"""
# Configuration - using same cluster and package
dbt.config(
materialized="table",
cluster_id=<an_all_purpose_cluster_id>,
packages=['rapidfuzz==3.13.0'],
submission_method='all_purpose_cluster',
create_notebook=True,
timeout=600
)
# Create a tiny test dataset (just 3 rows)
test_data = pd.DataFrame({
'id': [1, 2, 3],
'text1': ['hello world', 'databricks test', 'rapidfuzz check'],
'text2': ['hello world!', 'databrick test', 'rapid fuzz check']
})
# Compute simple fuzzy scores
test_data['similarity_score'] = [
get_fuzz_ration(a, b)
for a, b in zip(test_data['text1'], test_data['text2'])
]
return test_data
-
Toggle Notesbook to False using a "vanilla" cluster: The model fails because the package is not found.
-
Set Notesbook to True: The model works, but the package is installed at the cluster system level (affecting all other users and models).
Expected behavior
The compiled code should include a %pip install <package> command at the beginning of the notebook/script.
Ideally, we could introduce a flag like use_notebook_scoped_libraries: true. This would ensure that:
-
Packages are isolated to the specific dbt run.
-
The solution works across all compute types (Job Clusters, Serverless, and All-Purpose) since %pip is the standard for modern Databricks runtimes.
-
The solution will work no matter the data security access (Shared or single user)
Screenshots and log output
Running the model with use_notebook=True works, but the dependency persists in the cluster environment even after the job ends:
System information
The output of dbt --version:
(dbt-data-model) ➜ dbt_databricks git:(test/python-model) ✗ uv run dbt --version
Core:
- installed: 1.11.2
- latest: 1.11.2 - Up to date!
Plugins:
- databricks: 1.11.4 - Up to date!
- spark: 1.10.0 - Up to date!
The operating system you're using: MacOS Tahoe Version 26.2
The output of python --version: Python 3.11.14
Describe the bug
We are trying to use All-Purpose clusters for dbt Python models. The PythonCommandSubmitter class is currently agnostic of packages defined in the model config.
We were able to work around this by using the
upload_notebookflag (internally handled by PythonNotebookUploader), which installs the packages at the cluster system-scope.This represents a significant problem for us: as the project grows, different Python models will require different packages or versions, eventually leading to dependency collisions. While this is less of an issue in production (where we use ephemeral Job Clusters), it severely hinders the Development Experience (DX). We want to use shared All-Purpose clusters in development to speed up iteration without package conflicts.
Databricks provides Notebook-Scoped libraries specifically to avoid this. Since dbt can already compile the code into a Databricks Notebook, the submitters should be updated to leverage this feature.
Steps To Reproduce
By running this simple model you can test the issue:
Toggle Notesbook to False using a "vanilla" cluster: The model fails because the package is not found.
Set Notesbook to True: The model works, but the package is installed at the cluster system level (affecting all other users and models).
Expected behavior
The compiled code should include a
%pip install <package>command at the beginning of the notebook/script.Ideally, we could introduce a flag like
use_notebook_scoped_libraries: true. This would ensure that:Packages are isolated to the specific dbt run.
The solution works across all compute types (Job Clusters, Serverless, and All-Purpose) since
%pipis the standard for modern Databricks runtimes.The solution will work no matter the data security access (Shared or single user)
Screenshots and log output
Running the model with
use_notebook=Trueworks, but the dependency persists in the cluster environment even after the job ends:System information
The output of
dbt --version:The operating system you're using: MacOS Tahoe Version 26.2
The output of
python --version: Python 3.11.14