Add local execution mode for ML training#410
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a “local execution” path for ML model training initiated from the dashboard, controlled via a new local_execution.ml_training section in the experiment config, so training can run without submitting jobs through the Superfacility API/Perlmutter.
Changes:
- Introduces
state.model_training_localand derives it fromlocal_execution.ml_trainingin the experiment config. - Extends
ModelManagerto route training either to Perlmutter (remote) or to a localpython ml/train_model.py ...subprocess. - Updates the Train button disable logic to only require Perlmutter to be active for remote training.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
ml/train_model.py |
Minor training log message capitalization. |
dashboard/state_manager.py |
Initializes new state.model_training_local flag. |
dashboard/model_manager.py |
Adds local-training subprocess kernel, refactors config-prep, and gates training path by model_training_local. |
dashboard/app.py |
Reads local_execution.ml_training from config and passes the flag into ModelManager. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Configuration options could be |
…L#410) Replace __is_neural_network, __is_gaussian_process, and __is_neural_network_ensemble with direct comparisons against the existing __model_type attribute, which already holds the same information. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| yaml.dump(config_dict, f) | ||
| return config_path | ||
|
|
||
| async def _training_kernel(self): |
There was a problem hiding this comment.
Should we rename this function?
| async def _training_kernel(self): | |
| async def _training_kernel_sfapi(self): |
There was a problem hiding this comment.
Agree, initially I had only local true/false, now we have local/sfapi/irapi. I will fix the name.
| f"Starting local training: {train_model_path} --model {model_type}" | ||
| ) | ||
| proc = await asyncio.create_subprocess_exec( | ||
| "conda", |
There was a problem hiding this comment.
Why does this need to be formatted this way (one word per line)?
Could we instead have:
proc = await asyncio.create_subprocess_exec(
"conda run --no-capture-output -n synapse-ml",
f"python {train_model_path} --config_file {config_path} --model {model}",
cwd=...
There was a problem hiding this comment.
See also: https://github.com/BLAST-AI-ML/synapse/blob/main/tests/test_ml_pipeline.py#L145
(it does not use asyncio but is otherwise similar)
There was a problem hiding this comment.
I think this was formatted automatically by my pre-commit check, but I can see if I can force it otherwise.
There was a problem hiding this comment.
Nope, I had misunderstood your question.
If I try your code, I get the following error:
Error occurred when executing local training: [Errno 2] No such file or directory: 'conda run --no-capture-output -n synapse-ml'
Here's the explanation I found:
asyncio.create_subprocess_execdoes not parse strings like a shell. Its first positional argument is the executable path only, and every subsequent positional argument is a separate argv element. Here the call passes the entire string"conda run --no-capture-output -n synapse-ml"(spaces and all) as the executable name - the OS then tries to find a file literally namedconda run --no-capture-output -n synapse-ml, which doesn't exist. The second positional argument has the same problem:"python /path/to/train_model.py --config_file ... --model ..."is passed as a singleargv[1]string rather than being split into individual arguments. The fix is to pass each token as a separate argument
There was a problem hiding this comment.
Let me know if you manage to make it work on your end.
|
I think it would be good to merge this PR before we do more refactoring to plug in the IRI API connection, because this PR does build some underlying infrastructure. |
| print("Initializing model manager...") | ||
| self.__model = None | ||
| self.__model_type = model_type | ||
| self.__model_training_mode = model_training_mode |
There was a problem hiding this comment.
Since we store this info in the state anyway, could we avoid storing it also in __model_training_mode? (and use state whenever needed instead?)
|
End-to-end workflow tested successfully in combination with #425 and using the "right" AmSC MLflow API key. |
|
|
||
| class ModelManager: | ||
| def __init__(self, config_dict, model_type): | ||
| def __init__(self, config_dict, model_type, model_training_mode): |
There was a problem hiding this comment.
model_training_mode is not used within the constructors ; we can simply drop this argument.
|
|
||
| async def training_kernel(self): | ||
| def _prepare_training_config(self, temp_dir): | ||
| """Prepare a merged training config YAML in the given temp directory. |
There was a problem hiding this comment.
The term "merged" is a bit ambiguous here.
Maybe Prepare a training config YAML in the given temp directory, updated with information from the dashboard
|
|
||
|
|
||
| def check_evaluate(config_dict, model_type): | ||
| def check_evaluate(config_dict, model_type, model_training_mode): |
There was a problem hiding this comment.
I don't think that model_training_mode is changing anything in the logic now.
I would remove this argument, and instead import state (from state_manager import state) and set state.model_training_mode = "local".
Overview
Add a local execution mode for ML training, such that the training launched from the dashboard is executed locally instead of on Perlmutter through the Superfacility API.
This assumes the experiment configuration file has a new section of the form
To do
Related PRs