Skip to content

THIS ONE AFTER THE LAST ONE: fix(#517): clear error when workload YAML missing; document supported matrix#563

Open
FileSystemGuy wants to merge 1 commit into
mainfrom
feat/missing-workload-yaml-error
Open

THIS ONE AFTER THE LAST ONE: fix(#517): clear error when workload YAML missing; document supported matrix#563
FileSystemGuy wants to merge 1 commit into
mainfrom
feat/missing-workload-yaml-error

Conversation

@FileSystemGuy

Copy link
Copy Markdown
Contributor

Summary

Resolves #517.

mlpstorage closed training unet3d datasize --accelerator-type mi355 … was producing a bare:

Configuration file not found: …/configs/dlio/workload/unet3d_mi355.yaml

…which is technically true but tells the user nothing about why the file is missing (because we don't support that combination in v3.0) or what they should do instead.

Code change

DLIOBenchmark.process_dlio_params now does an explicit existence check on the resolved workload YAML and raises ConfigurationError(CONFIG_FILE_NOT_FOUND) with a specific "combination not supported" message before the lower-level loader can emit its generic one. The message:

  • Names the rejected (model, accelerator-type) pair.
  • Points at the exact missing YAML so the failure is unambiguous.
  • Lists the three v3.0-submittable combinations: unet3d/b200, retinanet/b200, retinanet/mi355.
  • Notes that other pairs are available under `whatif` if a workload file exists for them.

The helper branches on `BENCHMARK_TYPE` so checkpointing (model-only filename) gets the same treatment with a shorter message.

Example output

[E103] The combination --model=unet3d --accelerator-type=mi355 is not supported.
  Details: Parameter: model+accelerator-type; Actual: unet3d + mi355
  Suggestion: Missing workload definition: …/configs/dlio/workload/unet3d_mi355.yaml
  v3.0 submittable combinations (CLOSED or OPEN):
    --model unet3d    --accelerator-type b200
    --model retinanet --accelerator-type b200
    --model retinanet --accelerator-type mi355
  Other (model, accelerator) pairs work under \`whatif\` if a workload definition
  file exists for them; this combination has none.

Docs

ManPage.md (near `--accelerator-type`) and `training/README.md` (top of §"Training Models") gain the full support matrix:

Model a100 h100 b200 mi355
unet3d whatif v3.0
retinanet v3.0 v3.0
cosmoflow whatif whatif
resnet50 whatif whatif
dlrm whatif whatif
flux whatif whatif
  • v3.0 = submittable in CLOSED or OPEN
  • whatif = planning-only via `mlpstorage whatif …`; not submittable
  • = no workload definition file; this is the path that now errors out clearly.

Test plan

…t supported matrix

Before: passing an (model, accelerator-type) combination that has no
configs/dlio/workload/<model>_<accel>.yaml file produced a generic
"Configuration file not found: …/unet3d_mi355.yaml" error from
read_config_from_file. The user had no way to tell whether this was
a packaging bug, an install problem, or an unsupported combination.

After: DLIOBenchmark.process_dlio_params checks the resolved workload
path up front and raises ConfigurationError(CONFIG_FILE_NOT_FOUND)
with a specific "combination not supported" message that:
  - names the (model, accelerator-type) pair
  - points at the missing YAML so the failure is unambiguous
  - lists the three v3.0 submittable combinations (unet3d/b200,
    retinanet/b200, retinanet/mi355)
  - notes that other pairs are available under `whatif` if a workload
    file exists for them.

Docs (ManPage.md, training/README.md) gain the full (model x accelerator)
support matrix, marking each cell as v3.0-submittable, whatif-only, or
not supported.

Checkpointing uses a model-only filename so it gets a shorter variant
of the message via the same helper, branching on BENCHMARK_TYPE.
@FileSystemGuy FileSystemGuy requested a review from a team June 27, 2026 01:13
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@FileSystemGuy FileSystemGuy changed the title fix(#517): clear error when workload YAML missing; document supported matrix THIS ONE AFTER THE LAST ONE: fix(#517): clear error when workload YAML missing; document supported matrix Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unet3d_mi355.yaml configuration file not found when running mlpstorage closed training unet3d datasize

1 participant