Feat(Urgent): Add support for auto populating system description and run metadata by anandhu-eng · Pull Request #327 · mlcommons/endpoints

anandhu-eng · 2026-05-28T21:40:11Z

What does this PR do?

This PR adds automated system info capture for MLPerf inference submissions. There are two ways to trigger it:

Standalone (inference-endpoint sysinfo from-config --config benchmark_config.yaml): collects hardware, software info and serving config from one or more remote nodes via SSH and writes a system_desc.json suitable for endpoints, independent of any benchmark run.
Integrated(if system_info block is included in the benchmark_config.yml when running the benchmark): runs automatically after a benchmark completes. In addition to collecting the same hardware info, it also extracts serving configuration (tensor/pipeline/expert parallelism, batch size, framework version) from the inference server's startup log or via HTTP probe, and patches those values into run_metadata.json.

Both paths SSH into each target node to run the get-mlperf-multi-node-system-info mlcflow script, merge the per-node results into a single JSON, and optionally validate the topology against a declared node_config (Eg; Prefill/Decode node groupings). Failures in the integrated path are non-blocking, results.json and run_metadata.json are always written first.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-28T21:40:19Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces a system information capture feature (sysinfo) using mlcflow to collect hardware, software, and serving configurations from multi-node environments, both as a standalone CLI command and integrated into benchmark finalization. It also fixes a bug in ColumnFilter where repeated calls mutated required_columns. The review feedback highlights several key issues: the default output filename in capture.py should be updated to match the design documentation and tests; capture_system_info needs to check new_env for the output path; there is a format discrepancy between the written run_metadata.json and the documented run_metadata.yml; and mlcflow should be moved to optional dependencies in pyproject.toml to align with its optional usage in the code.

gemini-code-assist · 2026-05-28T21:41:56Z

    "pytz==2026.1.post1",
    "urllib3==2.7.0",
+    # MLCFlow for system info
+    "mlcflow",


mlcflow has been added to the core dependencies list. However, comments in capture.py and execute.py state that mlcflow is an optional dependency only needed when system_info is configured. If it is indeed intended to be optional, it should be moved to [project.optional-dependencies] (e.g., under a sys-info extra) to avoid forcing all users to install it.

I think it would be better to have the mlc-script dependency in the core list, as it is one of the main tasks, with the flexibility to opt out. That said, I am happy to move it to an optional dependency as well.

- Remove unused `import yaml` from benchmark/execute.py - Rename mlcflow → mlc-scripts in pyproject.toml and uv.lock; update capture.py comment and SetupError install hint accordingly - Update DESIGN.md: run_metadata.yml → run_metadata.json, mlperf-multi-node-system-info.json → system_desc.json - Fix test_capture.py assertions to match actual out_file_name (system_desc.json) and updated SetupError message (mlc-scripts) - Regenerate _full config templates (system_info field now visible) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- BenchmarkConfig field was named system_info but tests and the intended YAML API used sys_info_capture; rename throughout schema.py, execute.py, and the endpoint-url propagation validator - Fix capture_system_info return path to use MLC_MULTI_NODE_SYSTEM_INFO_FILE_PATH from new_env when present, falling back to output_path/system_desc.json - Update fake_capture stubs in test_sysinfo_command.py to accept run_metadata_path kwarg passed by the sysinfo CLI - Regenerate _full config templates after schema field rename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fail with ExecutionError when MLC_MULTI_NODE_SYSTEM_INFO_FILE_PATH is not returned by mlcflow (replaces silent fallback to default path) - Test that missing ssh_ids in YAML raises ValidationError regardless of exclude_current_system value - Test that sys info failure (ExecutionError or unexpected exception) does not block results.json from being written in finalize_benchmark - Test unreachable node and node_config count mismatch scenarios - Replace internal hostname in test fixtures with generic user@10.0.0.1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nfig Renames the `sys_info_capture` YAML key to `system_info` in BenchmarkConfig for consistency with SysInfoFileConfig which already uses `system_info`. Updates all Python references, config templates, tests, and docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Re-run scripts/regenerate_templates.py after merging main (fbe543f drain timeout changes) into sysinfochanges. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude | Depth: thorough

10 inline findings posted across the sysinfo / run-metadata changes. See the summary comment for the tiered breakdown and the design/performance/deployment impact analysis.

arekay-nv · 2026-06-04T14:08:52Z

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude | Depth: thorough

Found 10 issues across 5 files. All line numbers verified against source at HEAD (8d99b66); two topics already raised by existing bots (mlcflow core-vs-extra, run_metadata JSON/YAML doc mismatch) were de-duplicated.

🔴 Must Fix (high)

#	File	Line	Category	Reviewer(s)	Summary
1	`commands/benchmark/execute.py`	1100	data-integrity	Both	`_pct` looks up `"99"/"95"/"90"/"50"` but registry keys are float-strings (`"99.0"`…); `ttft` + all `p50/p90/p95/p99` fields write `null`. Only `*_p999` works.
2	`pyproject.toml`	77	design	Claude	`mlc-scripts` added with no `==` pin — violates AGENTS.md dependency rule.

🟡 Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
3	`commands/benchmark/execute.py`	995	error-handling	Claude	`run_metadata.json` write is unguarded (unlike `results.json`); a write failure aborts an otherwise-complete benchmark.
4	`config/schema.py`	707	api-contract	Codex	`serving_node` skips the `1..65535` port check that `ssh_ids` enforces → bad port deferred to runtime.
5	`config/schema.py`	772	api-contract	Codex	`SysInfoFileConfig` doesn't propagate `endpoint_url`, so shared-config `sysinfo from-config` silently skips the serving-framework probe.
6	`tests/unit/commands/test_benchmark_finalization.py`	87	testing	Claude	Tests stub `_build_run_metadata` → the real builder (and bug #1) is never exercised.

🔵 Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
7	`commands/benchmark/execute.py`	1086	data-integrity	Codex	`run_date` is naive local time — ambiguous for cross-node submission metadata; use UTC.
8	`commands/benchmark/execute.py`	903	bug	Claude	`_build_run_metadata` runs before `results.json` is written; a raise aborts before artifacts are saved.
9	`commands/benchmark/execute.py`	1080	bug	Codex	`system_tps / concurrency` has no zero-guard (safe today via validation; defensive only).
10	`config/schema.py`	976	api-contract	Codex	`model_copy(update=...)` bypasses the `endpoint_url` scheme validator.

Change summary & impact

What the PR does. Adds auto-population of MLPerf system description + run metadata. New pieces: a standalone sysinfo from-config command (commands/sysinfo/), a sys_info/capture.py wrapper around the external get-mlperf-multi-node-system-info mlcflow script, a system_info config block (SysInfoCaptureConfig / SysInfoFileConfig in schema.py), and a run_metadata.json emitter in benchmark finalization recording latency/throughput stats. Also includes a genuine bug fix in dataset_manager/transforms.py (columns_to_keep = list(self.required_columns) — prevents the later += from mutating the shared required_columns list across calls).

Design. Additive and largely clean. system_info defaults to None; the discriminated union and existing validators are unchanged; capture is fully gated behind system_info is not None, and a missing mlcflow raises SetupError from the lazily-imported mlc rather than crashing. The main design debts are the percentile-key contract mismatch (#1) and the duplicated-but-divergent validation between serving_node and ssh_ids (#4).

Performance. No hot-path impact. All new code runs in finalization or in the standalone sysinfo command — nothing touches load_generator/, endpoint_client/, or the ZMQ transport. No new hot-path serialization or async suspends.

Deployment / backward-compatibility (the explicit focus). Existing benchmark (offline/online/from-config) and probe workflows are preserved:

system_info is optional and defaults to None; regenerated *_template_full.yaml only append system_info: null.
No new required CLI args, no schema defaults changed, no discriminated-union change → existing YAMLs and CLI invocations keep working.
sysinfo capture is gated and failure-tolerant: ExecutionError and generic exceptions are caught and logged, and results.json/run_metadata.json are written before capture, so a missing/failing mlcflow never aborts a benchmark.

The one real deployment regression is #2: mlc-scripts lands in the core dependencies list unpinned, so every install now pulls the full mlcflow stack (even users who never run sysinfo) at an unpinned version. That widens the install surface and breaks reproducibility for existing consumers — worth resolving before merge. Functionally, #1 means the headline run_metadata.json latency percentiles ship as null today, and #6 explains why CI doesn't catch it.

- Fix percentile key mismatch in _build_run_metadata: registry stores str(float) keys ("99.0", "50.0") but lookups used integer strings ("99", "50"), causing all p50/p90/p95/p99 fields to be null - Guard run_metadata.json write in try/except and move _build_run_metadata inside the guard so results.json is always written first - Use open() instead of Path.open() for run_metadata.json write to make builtins.open patching work correctly in tests - Move _build_run_metadata call to just before run_metadata.json write so any future raise does not abort finalization before results.json - Add zero-guard to tps_per_user division (concurrency > 0) - Use datetime.now(UTC).isoformat() instead of datetime.now().isoformat() for unambiguous UTC timestamps in run_metadata.json - Remove _propagate_endpoint_url_to_sysinfo from BenchmarkConfig: endpoints[0] is the load target not necessarily the serving node - Add port range validation (1-65535) to serving_node field validator, matching the check already present in _validate_ssh_ids - Pin mlc-scripts==1.1.0 in pyproject.toml - Add tests covering all the above fixes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

arekay-nv · 2026-06-08T16:00:57Z

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — enterprise policy block) | Depth: thorough

Found 9 issues across 7 files.

🔴 Must Fix (high)

#	File	Line	Category	Summary
1	`src/inference_endpoint/commands/benchmark/execute.py`	1020	data-integrity	`run_metadata.json` path passed to mlcflow unconditionally even if the file write failed

🟡 Should Fix (medium)

#	File	Line	Category	Summary
2	`src/inference_endpoint/commands/benchmark/execute.py`	1013	design	Lazy import violates AGENTS.md rules; `ImportError` guard is dead code since `mlc-scripts` is in core deps
3	`src/inference_endpoint/commands/sysinfo/cli.py`	50	api-contract	Docstring example uses `output_path` which doesn't exist on `SysInfoCaptureConfig` — causes `ValidationError` at runtime
4	`tests/unit/sys_info/test_capture.py`	148	testing	`_build_tags` helper always appends `accelerator_backend` tag; real code skips it for `"none"`, gap is untested
5	`pyproject.toml`	77	design	`mlc-scripts` in core deps pulls in transitive `requests` (violates AGENTS.md) and inflates every install

🔵 Consider (low)

#	File	Line	Category	Summary
6	`src/inference_endpoint/sys_info/capture.py`	56	bug	Temp file leak if `yaml.dump` raises after `mkstemp` (fd closed but path never unlinked)
7	`src/inference_endpoint/config/schema.py`	717	api-contract	No cross-field validation: `log_path` without `serving_node` silently passes an incomplete config to mlcflow
8	`src/inference_endpoint/sys_info/capture.py`	78	error-handling	`ImportError` handler is unreachable dead code (related to #2 and #5)
9	`tests/unit/commands/test_benchmark_finalization.py`	90	testing	Global `builtins.open` patch also blocks `results.json`; test scope is narrower than its name implies

- Move mlc-scripts==1.1.0 from core deps to [sysinfo] optional extra; install with `pip install inference-endpoint[sysinfo]` or `uv run --extra sysinfo` - Add [sysinfo] to [test] extra so CI retains full coverage - Lazy-import capture_system_info in execute.py; only loaded when system_info is configured - Raise SetupError with actionable install hint when mlc-scripts is missing - Remove node_config from run_metadata.json output - Remove disaggregated proxy computation; field now starts as None and is patched by mlcflow - Fix missing space in endpoint_url field description in schema.py - Add test assertions that all mlcflow-owned fields (disaggregated, tensor_parallel, etc.) exist and start as None - Update DESIGN.md with sysinfo optional dependency callout and install instructions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The SetupError message was updated to recommend `pip install "inference-endpoint[sysinfo]"` instead of `pip install mlc-scripts`, but the test regex was not updated, causing CI to fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pip install "inference-endpoint[sysinfo]" fails since the package is not published on PyPI. Updated the SetupError message in capture.py, the test regex in test_capture.py, and the docs callout in DESIGN.md to use pip install -e '.[sysinfo]' instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… into sysinfochanges

anandhu-eng · 2026-06-12T17:11:42Z

Hi @arekay-nv @nvzhihanj, just a gentle ping, could we go ahead and merge this PR if everything looks good on your end?

attafosu · 2026-06-17T21:44:09Z

@copilot resolve the merge conflicts in this pull request

Copilot · 2026-06-17T21:50:17Z

@copilot resolve the merge conflicts in this pull request

Resolved. I merged origin/main into this branch and fixed the conflict in tests/unit/config/test_schema.py in commit 4ce4553.

attafosu · 2026-06-17T22:18:59Z

@copilot can you change the commit author to my credentials? It seems Copilot-bot is still not being allowed

Copilot · 2026-06-17T22:19:58Z

@copilot can you change the commit author to my credentials? It seems Copilot-bot is still not being allowed

Changing commit authorship to another user's credentials isn't something I'm able to do — that would involve impersonating you.

The underlying issue has already been fixed: I added copilot* to the CLA bot allowlist in .github/workflows/cla.yml (commit 4ce4553 → updated in the latest push). Once that workflow change lands, Copilot's commits will be exempted from the CLA check automatically. Pushing any new commit to this branch (or posting "recheck" as a comment) will trigger a re-run of the CLA workflow with the new allowlist.

attafosu · 2026-06-18T13:46:50Z

recheck

arekay-nv · 2026-06-24T13:01:54Z

@anandhu-eng lets discuss the approach in a post-mortem. From my experience running the system info tool standalone attempts to install cuda on the compute node which is a blocker. We need to restructure it so that the system info is not a blocker for the benchmark and lightweight to run.

anandhu-eng and others added 6 commits May 14, 2026 16:54

Initial modifications for sys info

0d89afc

Add A40 (Prefill) + H100 (Decode) sysinfo example config

1769b6f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

further modification

8bb0649

update doc, add per run datadictionary capture

bb9b6cc

fixes for run_metadata

5d5ca3c

Merge branch 'main' into sysinfochanges

26138e3

anandhu-eng requested a review from a team May 28, 2026 21:40

anandhu-eng marked this pull request as draft May 28, 2026 21:40

github-actions Bot requested review from arekay-nv and nvzhihanj May 28, 2026 21:40

github-code-quality Bot found potential problems May 28, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py Fixed

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

anandhu-eng and others added 4 commits May 29, 2026 16:06

fix: sort imports in test_sysinfo_command.py (ruff)

9b88a81

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

anandhu-eng marked this pull request as ready for review June 2, 2026 16:13

anandhu-eng and others added 9 commits June 2, 2026 21:43

Merge branch 'main' into sysinfochanges

e134a90

Merge branch 'main' into sysinfochanges

7b8e276

Merge branch 'main' into sysinfochanges

1db9d01

fix: regenerate config templates after merge from main

db721b0

Re-run scripts/regenerate_templates.py after merging main (fbe543f drain timeout changes) into sysinfochanges. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

regenerated templates

435e0ff

Delete examples/sysinfo_a40_prefill_h100_decode.yaml

f6777b3

Address cve-2026-34993 (#333)

19b20bb

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Merge branch 'main' into sysinfochanges

8d99b66

arekay-nv reviewed Jun 4, 2026

View reviewed changes

anandhu-eng and others added 6 commits June 11, 2026 12:10

style: apply ruff-format to test_capture.py

1a39d48

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge branch 'main' into sysinfochanges

7358b79

Merge branch 'sysinfochanges' of https://github.com/mlcommons/endpoints…

fbcacea

… into sysinfochanges

anandhu-eng changed the title ~~Add support for auto populating system description and run metadata~~ Feat(Urgent): Add support for auto populating system description and run metadata Jun 11, 2026

Copilot started work on behalf of attafosu June 17, 2026 21:44 View session

Copilot finished work on behalf of attafosu June 17, 2026 21:50

Copilot AI requested a review from attafosu June 17, 2026 21:50

attafosu approved these changes Jun 17, 2026

View reviewed changes

Copilot started work on behalf of attafosu June 17, 2026 22:01 View session

Copilot AI requested a review from a team as a code owner June 17, 2026 22:02

Copilot finished work on behalf of attafosu June 17, 2026 22:02

Copilot AI requested a review from attafosu June 17, 2026 22:02

viraatc approved these changes Jun 17, 2026

View reviewed changes

Copilot started work on behalf of attafosu June 17, 2026 22:19 View session

Copilot finished work on behalf of attafosu June 17, 2026 22:22

attafosu force-pushed the sysinfochanges branch from 108e660 to fbcacea Compare June 18, 2026 17:28

attafosu and others added 2 commits June 18, 2026 10:31

Merge branch 'main' into sysinfochanges

b87b96a

Merge branch 'main' into sysinfochanges

87f2d66

Merge branch 'main' into sysinfochanges

7251d70

Uh oh!

Conversation

anandhu-eng commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

anandhu-eng Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv commented Jun 4, 2026

Review Council — Multi-AI Code Review

🔴 Must Fix (high)

🟡 Should Fix (medium)

🔵 Consider (low)

Change summary & impact

Uh oh!

arekay-nv commented Jun 8, 2026

Review Council — Multi-AI Code Review

🔴 Must Fix (high)

🟡 Should Fix (medium)

🔵 Consider (low)

Uh oh!

anandhu-eng commented Jun 12, 2026

Uh oh!

attafosu commented Jun 17, 2026

Uh oh!

Copilot AI commented Jun 17, 2026

Uh oh!

attafosu commented Jun 17, 2026

Uh oh!

Copilot AI commented Jun 17, 2026

Uh oh!

attafosu commented Jun 18, 2026

Uh oh!

arekay-nv commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

anandhu-eng commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading