Add local execution mode for ML training by EZoni · Pull Request #410 · BLAST-AI-ML/synapse

EZoni · 2026-03-25T16:36:10Z

Overview

Add a local execution mode for ML training, such that the training launched from the dashboard is executed locally instead of on Perlmutter through the Superfacility API.

This assumes the experiment configuration file has a new section of the form

execution_mode:
  ml_training: local
  simulation: sfapi

To do

Review code changes and test end-to-end workflow.
Define format and naming of the new section in the configuration file.
Define where the locally-trained model is saved: standard MLflow server.
Control the execution mode through a new section in the configuration file.
Rebase after Rename model_type_tag to model_type in dashboard code #409.

Related PRs

Copilot

Pull request overview

Adds a “local execution” path for ML model training initiated from the dashboard, controlled via a new local_execution.ml_training section in the experiment config, so training can run without submitting jobs through the Superfacility API/Perlmutter.

Changes:

Introduces state.model_training_local and derives it from local_execution.ml_training in the experiment config.
Extends ModelManager to route training either to Perlmutter (remote) or to a local python ml/train_model.py ... subprocess.
Updates the Train button disable logic to only require Perlmutter to be active for remote training.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
`ml/train_model.py`	Minor training log message capitalization.
`dashboard/state_manager.py`	Initializes new `state.model_training_local` flag.
`dashboard/model_manager.py`	Adds local-training subprocess kernel, refactors config-prep, and gates training path by `model_training_local`.
`dashboard/app.py`	Reads `local_execution.ml_training` from config and passes the flag into `ModelManager`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…the configuration file

EZoni · 2026-03-26T23:31:57Z

Configuration options could be local, sfapi, iriapi.

…L#410) Replace __is_neural_network, __is_gaussian_process, and __is_neural_network_ensemble with direct comparisons against the existing __model_type attribute, which already holds the same information. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

RemiLehe · 2026-04-03T21:48:02Z

+            yaml.dump(config_dict, f)
+        return config_path
+
+    async def _training_kernel(self):


Should we rename this function?

Suggested change

async def _training_kernel(self):

async def _training_kernel_sfapi(self):

Agree, initially I had only local true/false, now we have local/sfapi/irapi. I will fix the name.

Applied in b9ca5b4.

RemiLehe · 2026-04-03T21:55:30Z

+                    f"Starting local training: {train_model_path} --model {model_type}"
+                )
+                proc = await asyncio.create_subprocess_exec(
+                    "conda",


Why does this need to be formatted this way (one word per line)?
Could we instead have:

proc = await asyncio.create_subprocess_exec( "conda run --no-capture-output -n synapse-ml", f"python {train_model_path} --config_file {config_path} --model {model}", cwd=...

See also: https://github.com/BLAST-AI-ML/synapse/blob/main/tests/test_ml_pipeline.py#L145
(it does not use asyncio but is otherwise similar)

I think this was formatted automatically by my pre-commit check, but I can see if I can force it otherwise.

Nope, I had misunderstood your question.

If I try your code, I get the following error:

Error occurred when executing local training: [Errno 2] No such file or directory: 'conda run --no-capture-output -n synapse-ml'

Here's the explanation I found:

asyncio.create_subprocess_exec does not parse strings like a shell. Its first positional argument is the executable path only, and every subsequent positional argument is a separate argv element. Here the call passes the entire string "conda run --no-capture-output -n synapse-ml" (spaces and all) as the executable name - the OS then tries to find a file literally named conda run --no-capture-output -n synapse-ml, which doesn't exist. The second positional argument has the same problem: "python /path/to/train_model.py --config_file ... --model ..." is passed as a single argv[1] string rather than being split into individual arguments. The fix is to pass each token as a separate argument

Let me know if you manage to make it work on your end.

EZoni · 2026-04-08T15:17:38Z

I think it would be good to merge this PR before we do more refactoring to plug in the IRI API connection, because this PR does build some underlying infrastructure.

RemiLehe · 2026-04-09T14:50:34Z

        print("Initializing model manager...")
        self.__model = None
        self.__model_type = model_type
+        self.__model_training_mode = model_training_mode


Since we store this info in the state anyway, could we avoid storing it also in __model_training_mode? (and use state whenever needed instead?)

Applied in 20a50b9.

EZoni · 2026-04-10T22:42:40Z

End-to-end workflow tested successfully in combination with #425 and using the "right" AmSC MLflow API key.

RemiLehe · 2026-04-10T16:39:52Z


 class ModelManager:
-    def __init__(self, config_dict, model_type):
+    def __init__(self, config_dict, model_type, model_training_mode):


model_training_mode is not used within the constructors ; we can simply drop this argument.

Applied in 91fcdf9.

RemiLehe · 2026-04-10T16:42:33Z


-    async def training_kernel(self):
+    def _prepare_training_config(self, temp_dir):
+        """Prepare a merged training config YAML in the given temp directory.


The term "merged" is a bit ambiguous here.

Maybe Prepare a training config YAML in the given temp directory, updated with information from the dashboard

Applied in 8a1f00a.

RemiLehe · 2026-04-10T16:46:12Z



-def check_evaluate(config_dict, model_type):
+def check_evaluate(config_dict, model_type, model_training_mode):


I don't think that model_training_mode is changing anything in the logic now.

I would remove this argument, ~~and instead import state (from state_manager import state) and set state.model_training_mode = "local".~~

Applied in 91fcdf9.

Add local execution mode for ML training

a6efc07

EZoni added ml Changes related to the ML models dashboard Changes related to the dashboard labels Mar 25, 2026

EZoni added 6 commits March 25, 2026 10:02

Prefix training_kernel with _ for consistency

3ebe697

Remove description from argument parser

6e8ba3c

Ensure consistency in error messages

0a7ed13

Simplify search of train_model.py

80246fc

Clean up stdout capitalization in train_model.py

a676817

Read ML training execution mode from configuration file

efe5c56

EZoni requested a review from Copilot March 25, 2026 18:42

Copilot started reviewing on behalf of EZoni March 25, 2026 18:43 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

EZoni commented Mar 25, 2026

View reviewed changes

Comment thread dashboard/app.py Outdated

EZoni and others added 3 commits March 25, 2026 16:31

Guard against falsy values that might have been explicitly stored in …

9d7631f

…the configuration file

Flush model training status on exceptions in _training_kernel

3bcf0fb

Flush model training status on exceptions in _training_kernel_local

ec203bc

EZoni commented Mar 26, 2026

View reviewed changes

Comment thread dashboard/model_manager.py

Add inline comments to async subprocess code

687e009

EZoni commented Mar 26, 2026

View reviewed changes

Comment thread dashboard/model_manager.py

EZoni added 5 commits March 31, 2026 14:35

Merge branch 'main' into local_mode_training

381a470

Support three execution modes: local, sfapi, iriapi

a87a22b

Run conda environment in local execution mode

1690aee

Fix ML training mode variable

289decf

Merge branch 'main' into local_mode_training

c1e8fde

EZoni commented Apr 1, 2026

View reviewed changes

Comment thread tests/check_model.py Outdated

EZoni mentioned this pull request Apr 1, 2026

Fix calibration tensors device mismatch on GPU #414

Merged

Merge branch 'main' into local_mode_training

5b99096

EZoni changed the title ~~[WIP] Add local execution mode for ML training~~ Add local execution mode for ML training Apr 1, 2026

EZoni requested a review from RemiLehe April 1, 2026 23:00

Merge branch 'main' into local_mode_training

6709ed4

RemiLehe reviewed Apr 3, 2026

View reviewed changes

EZoni added 2 commits April 3, 2026 16:31

Rename _training_kernel as _training_kernel_sfapi

b9ca5b4

Merge branch 'main' into local_mode_training

cc0fd54

RemiLehe reviewed Apr 9, 2026

View reviewed changes

EZoni added 2 commits April 9, 2026 09:23

Merge branch 'main' into local_mode_training

f5961df

Remove __model_training_mode from ModelManager

20a50b9

EZoni requested a review from RemiLehe April 9, 2026 20:48

RemiLehe reviewed Apr 13, 2026

View reviewed changes

EZoni added 3 commits April 13, 2026 16:36

Merge branch 'main' into local_mode_training

ea82179

Remove model_training_mode from ModelManager constructor

91fcdf9

Fix comments and logging

8a1f00a

EZoni requested a review from RemiLehe April 13, 2026 23:51

RemiLehe approved these changes Apr 13, 2026

View reviewed changes

RemiLehe merged commit c227338 into BLAST-AI-ML:main Apr 13, 2026
3 checks passed

EZoni deleted the local_mode_training branch April 13, 2026 23:53

	async def _training_kernel(self):
	async def _training_kernel_sfapi(self):



		def check_evaluate(config_dict, model_type):
		def check_evaluate(config_dict, model_type, model_training_mode):

Conversation

EZoni commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

To do

Related PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EZoni commented Mar 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EZoni commented Apr 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EZoni commented Apr 10, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RemiLehe Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EZoni commented Mar 25, 2026 •

edited

Loading

RemiLehe Apr 10, 2026 •

edited

Loading