Skip to content

Add local execution mode for ML training#410

Merged
RemiLehe merged 25 commits intoBLAST-AI-ML:mainfrom
EZoni:local_mode_training
Apr 13, 2026
Merged

Add local execution mode for ML training#410
RemiLehe merged 25 commits intoBLAST-AI-ML:mainfrom
EZoni:local_mode_training

Conversation

@EZoni
Copy link
Copy Markdown
Member

@EZoni EZoni commented Mar 25, 2026

Overview

Add a local execution mode for ML training, such that the training launched from the dashboard is executed locally instead of on Perlmutter through the Superfacility API.

This assumes the experiment configuration file has a new section of the form

execution_mode:
  ml_training: local
  simulation: sfapi

To do

  • Review code changes and test end-to-end workflow.
  • Define format and naming of the new section in the configuration file.
  • Define where the locally-trained model is saved: standard MLflow server.
  • Control the execution mode through a new section in the configuration file.
  • Rebase after Rename model_type_tag to model_type in dashboard code #409.

Related PRs

@EZoni EZoni added ml Changes related to the ML models dashboard Changes related to the dashboard labels Mar 25, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a “local execution” path for ML model training initiated from the dashboard, controlled via a new local_execution.ml_training section in the experiment config, so training can run without submitting jobs through the Superfacility API/Perlmutter.

Changes:

  • Introduces state.model_training_local and derives it from local_execution.ml_training in the experiment config.
  • Extends ModelManager to route training either to Perlmutter (remote) or to a local python ml/train_model.py ... subprocess.
  • Updates the Train button disable logic to only require Perlmutter to be active for remote training.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
ml/train_model.py Minor training log message capitalization.
dashboard/state_manager.py Initializes new state.model_training_local flag.
dashboard/model_manager.py Adds local-training subprocess kernel, refactors config-prep, and gates training path by model_training_local.
dashboard/app.py Reads local_execution.ml_training from config and passes the flag into ModelManager.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dashboard/model_manager.py Outdated
Comment thread dashboard/model_manager.py
Comment thread dashboard/model_manager.py
Comment thread dashboard/model_manager.py Outdated
Comment thread dashboard/model_manager.py
Comment thread dashboard/model_manager.py
Comment thread dashboard/app.py Outdated
Comment thread dashboard/app.py Outdated
Comment thread dashboard/model_manager.py
@EZoni
Copy link
Copy Markdown
Member Author

EZoni commented Mar 26, 2026

Configuration options could be local, sfapi, iriapi.

Comment thread dashboard/model_manager.py
RemiLehe added a commit to RemiLehe/synapse that referenced this pull request Mar 27, 2026
…L#410)

Replace __is_neural_network, __is_gaussian_process, and
__is_neural_network_ensemble with direct comparisons against the
existing __model_type attribute, which already holds the same
information.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread tests/check_model.py Outdated
@EZoni EZoni changed the title [WIP] Add local execution mode for ML training Add local execution mode for ML training Apr 1, 2026
@EZoni EZoni requested a review from RemiLehe April 1, 2026 23:00
Comment thread dashboard/app.py
Comment thread dashboard/model_manager.py Outdated
yaml.dump(config_dict, f)
return config_path

async def _training_kernel(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename this function?

Suggested change
async def _training_kernel(self):
async def _training_kernel_sfapi(self):

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, initially I had only local true/false, now we have local/sfapi/irapi. I will fix the name.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in b9ca5b4.

f"Starting local training: {train_model_path} --model {model_type}"
)
proc = await asyncio.create_subprocess_exec(
"conda",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be formatted this way (one word per line)?
Could we instead have:

proc = await asyncio.create_subprocess_exec(
    "conda run --no-capture-output -n synapse-ml", 
    f"python {train_model_path} --config_file {config_path} --model {model}",
    cwd=...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also: https://github.com/BLAST-AI-ML/synapse/blob/main/tests/test_ml_pipeline.py#L145
(it does not use asyncio but is otherwise similar)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was formatted automatically by my pre-commit check, but I can see if I can force it otherwise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I had misunderstood your question.

If I try your code, I get the following error:

Error occurred when executing local training: [Errno 2] No such file or directory: 'conda run --no-capture-output -n synapse-ml'

Here's the explanation I found:

asyncio.create_subprocess_exec does not parse strings like a shell. Its first positional argument is the executable path only, and every subsequent positional argument is a separate argv element. Here the call passes the entire string "conda run --no-capture-output -n synapse-ml" (spaces and all) as the executable name - the OS then tries to find a file literally named conda run --no-capture-output -n synapse-ml, which doesn't exist. The second positional argument has the same problem: "python /path/to/train_model.py --config_file ... --model ..." is passed as a single argv[1] string rather than being split into individual arguments. The fix is to pass each token as a separate argument

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you manage to make it work on your end.

@EZoni
Copy link
Copy Markdown
Member Author

EZoni commented Apr 8, 2026

I think it would be good to merge this PR before we do more refactoring to plug in the IRI API connection, because this PR does build some underlying infrastructure.

Comment thread dashboard/model_manager.py Outdated
print("Initializing model manager...")
self.__model = None
self.__model_type = model_type
self.__model_training_mode = model_training_mode
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we store this info in the state anyway, could we avoid storing it also in __model_training_mode? (and use state whenever needed instead?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 20a50b9.

@EZoni EZoni requested a review from RemiLehe April 9, 2026 20:48
@EZoni
Copy link
Copy Markdown
Member Author

EZoni commented Apr 10, 2026

End-to-end workflow tested successfully in combination with #425 and using the "right" AmSC MLflow API key.

Comment thread dashboard/app.py
Comment thread dashboard/model_manager.py Outdated

class ModelManager:
def __init__(self, config_dict, model_type):
def __init__(self, config_dict, model_type, model_training_mode):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_training_mode is not used within the constructors ; we can simply drop this argument.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 91fcdf9.

Comment thread dashboard/model_manager.py Outdated

async def training_kernel(self):
def _prepare_training_config(self, temp_dir):
"""Prepare a merged training config YAML in the given temp directory.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "merged" is a bit ambiguous here.

Maybe Prepare a training config YAML in the given temp directory, updated with information from the dashboard

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 8a1f00a.

Comment thread tests/check_model.py Outdated


def check_evaluate(config_dict, model_type):
def check_evaluate(config_dict, model_type, model_training_mode):
Copy link
Copy Markdown
Contributor

@RemiLehe RemiLehe Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that model_training_mode is changing anything in the logic now.

I would remove this argument, and instead import state (from state_manager import state) and set state.model_training_mode = "local".

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 91fcdf9.

@EZoni EZoni requested a review from RemiLehe April 13, 2026 23:51
@RemiLehe RemiLehe merged commit c227338 into BLAST-AI-ML:main Apr 13, 2026
3 checks passed
@EZoni EZoni deleted the local_mode_training branch April 13, 2026 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dashboard Changes related to the dashboard ml Changes related to the ML models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants