Add Notebook-scoped packages for command submits or notebook job run#1321
Add Notebook-scoped packages for command submits or notebook job run#1321fedemgp wants to merge 3 commits intodatabricks:mainfrom
Conversation
tejassp-db
left a comment
There was a problem hiding this comment.
Have you taken this into account - The Databricks docs state "Starting with Databricks Runtime 13.0 %pip commands do not automatically restart the Python process. If you install a new package or update an existing package, you may need to use dbutils.library.restartPython() to see the new packages."
- https://docs.databricks.com/aws/en/libraries/notebooks-python-libraries#manage-libraries-with-pip-commands
- https://docs.databricks.com/aws/en/libraries/restart-python-process
Also have you considered the recommendation on driver node sizes for notebook scoped libraries
sd-db
left a comment
There was a problem hiding this comment.
Thx for the PR. Took a look at the code changes(didn't review tests yet) and added comments.
| logger.debug("Submitting Python model using the Command API.") | ||
|
|
||
| # Prepare code with notebook-scoped package installation if needed | ||
| code_to_execute = self._prepare_code_with_packages(compiled_code) |
There was a problem hiding this comment.
We should ensure we only call this path if notebook_scoped_libraries is set.
There was a problem hiding this comment.
now the private function checks if notebook_scoped_libraries is set before adding this header
| parsed_model.config.packages if parsed_model.config.notebook_scoped_libraries else [] | ||
| ) | ||
|
|
||
| def _prepare_code_with_packages(self, compiled_code: str) -> str: |
There was a problem hiding this comment.
instead of duplication, better to extract into a re-usable function component. Maybe something like
def prepare_code_with_packages(compiled_code: str, packages: list[str], separator: str = "\n\n") -> str:
There was a problem hiding this comment.
added it into its base clase PythonSubmitter, i had to make some adjustments so any submitter is able to use uit
| # Only add packages to cluster-level libraries if not using notebook-scoped | ||
| if not notebook_scoped_libraries: | ||
| for package in packages: | ||
| if index_url: |
There was a problem hiding this comment.
Let us look to add support for index_url in notebook scoped packages as well. Since this is supported today; it would be weird if some model silently breaks if something breaks if an user who has set this switches to using notebook scoped packages.
There was a problem hiding this comment.
already added support to index-url also at notebook scoped packages here
| logger.debug(f"Adding notebook-scoped package installation: {pip_install_cmd}") | ||
|
|
||
| # Prepend the pip install command to the compiled code | ||
| return f"{pip_install_cmd}\n\n# COMMAND ----------\n{compiled_code}" |
There was a problem hiding this comment.
As recommended by @tejassp-db, let us add dbutils.library.restartPython() after the pip_install_cmd
| return compiled_code | ||
|
|
||
| # Build the %pip install command for notebook-scoped packages | ||
| pip_install_cmd = "%pip install " + " ".join(self.packages) |
There was a problem hiding this comment.
maybe can try %pip install -q {packages} to reduce noise in notebook output
I didn't take into account the recommendation to restart python process, thanks for that. I will refactor this draft PR to make it more production ready (avoiding duplicated code) and consider that. About the how large should the driver node be, this solution is agnostic about cluster size. It will depend on the team that previously defined the cluster (no matter if it was an all purpose cluster, job cluster, whatever). |
e96a5f0 to
19d86fc
Compare
Signed-off-by: Federico Manuel Gomez Peter <federico.gomez@payclip.com>
19d86fc to
6fa5301
Compare
|
I was not able to run the end to end tests to be honest, i will try to do it tomorrow |
Signed-off-by: Federico Manuel Gomez Peter <federico.gomez@payclip.com>
Resolves #1320
Description
Add feature to install packages in Notebook-scoped environment for PythonCommandSubmitt and PythonNotebookUploader classes
Execution Test
I made the changes and tested it in our environment, looking that the compiled code now have the prepended package installation
All purpose cluster
Serverless cluster
Job cluster
Checklist
CHANGELOG.mdand added information about my change to the "dbt-databricks next" section.