Feat(Urgent): Add support for auto populating system description and run metadata #327
Feat(Urgent): Add support for auto populating system description and run metadata #327anandhu-eng wants to merge 33 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces a system information capture feature (sysinfo) using mlcflow to collect hardware, software, and serving configurations from multi-node environments, both as a standalone CLI command and integrated into benchmark finalization. It also fixes a bug in ColumnFilter where repeated calls mutated required_columns. The review feedback highlights several key issues: the default output filename in capture.py should be updated to match the design documentation and tests; capture_system_info needs to check new_env for the output path; there is a format discrepancy between the written run_metadata.json and the documented run_metadata.yml; and mlcflow should be moved to optional dependencies in pyproject.toml to align with its optional usage in the code.
| "pytz==2026.1.post1", | ||
| "urllib3==2.7.0", | ||
| # MLCFlow for system info | ||
| "mlcflow", |
There was a problem hiding this comment.
mlcflow has been added to the core dependencies list. However, comments in capture.py and execute.py state that mlcflow is an optional dependency only needed when system_info is configured. If it is indeed intended to be optional, it should be moved to [project.optional-dependencies] (e.g., under a sys-info extra) to avoid forcing all users to install it.
There was a problem hiding this comment.
I think it would be better to have the mlc-script dependency in the core list, as it is one of the main tasks, with the flexibility to opt out. That said, I am happy to move it to an optional dependency as well.
- Remove unused `import yaml` from benchmark/execute.py - Rename mlcflow → mlc-scripts in pyproject.toml and uv.lock; update capture.py comment and SetupError install hint accordingly - Update DESIGN.md: run_metadata.yml → run_metadata.json, mlperf-multi-node-system-info.json → system_desc.json - Fix test_capture.py assertions to match actual out_file_name (system_desc.json) and updated SetupError message (mlc-scripts) - Regenerate _full config templates (system_info field now visible) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BenchmarkConfig field was named system_info but tests and the intended YAML API used sys_info_capture; rename throughout schema.py, execute.py, and the endpoint-url propagation validator - Fix capture_system_info return path to use MLC_MULTI_NODE_SYSTEM_INFO_FILE_PATH from new_env when present, falling back to output_path/system_desc.json - Update fake_capture stubs in test_sysinfo_command.py to accept run_metadata_path kwarg passed by the sysinfo CLI - Regenerate _full config templates after schema field rename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fail with ExecutionError when MLC_MULTI_NODE_SYSTEM_INFO_FILE_PATH is not returned by mlcflow (replaces silent fallback to default path) - Test that missing ssh_ids in YAML raises ValidationError regardless of exclude_current_system value - Test that sys info failure (ExecutionError or unexpected exception) does not block results.json from being written in finalize_benchmark - Test unreachable node and node_config count mismatch scenarios - Replace internal hostname in test fixtures with generic user@10.0.0.1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nfig Renames the `sys_info_capture` YAML key to `system_info` in BenchmarkConfig for consistency with SysInfoFileConfig which already uses `system_info`. Updates all Python references, config templates, tests, and docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-run scripts/regenerate_templates.py after merging main (fbe543f drain timeout changes) into sysinfochanges. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by: Codex (gpt-5.4) + Claude | Depth: thorough
10 inline findings posted across the sysinfo / run-metadata changes. See the summary comment for the tiered breakdown and the design/performance/deployment impact analysis.
Review Council — Multi-AI Code ReviewReviewed by: Codex (gpt-5.4) + Claude | Depth: thorough Found 10 issues across 5 files. All line numbers verified against source at HEAD ( 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
Change summary & impactWhat the PR does. Adds auto-population of MLPerf system description + run metadata. New pieces: a standalone Design. Additive and largely clean. Performance. No hot-path impact. All new code runs in finalization or in the standalone Deployment / backward-compatibility (the explicit focus). Existing
The one real deployment regression is #2: |
- Fix percentile key mismatch in _build_run_metadata: registry stores
str(float) keys ("99.0", "50.0") but lookups used integer strings
("99", "50"), causing all p50/p90/p95/p99 fields to be null
- Guard run_metadata.json write in try/except and move _build_run_metadata
inside the guard so results.json is always written first
- Use open() instead of Path.open() for run_metadata.json write to make
builtins.open patching work correctly in tests
- Move _build_run_metadata call to just before run_metadata.json write
so any future raise does not abort finalization before results.json
- Add zero-guard to tps_per_user division (concurrency > 0)
- Use datetime.now(UTC).isoformat() instead of datetime.now().isoformat()
for unambiguous UTC timestamps in run_metadata.json
- Remove _propagate_endpoint_url_to_sysinfo from BenchmarkConfig:
endpoints[0] is the load target not necessarily the serving node
- Add port range validation (1-65535) to serving_node field validator,
matching the check already present in _validate_ssh_ids
- Pin mlc-scripts==1.1.0 in pyproject.toml
- Add tests covering all the above fixes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Review Council — Multi-AI Code ReviewReviewed by: Claude (Codex unavailable — enterprise policy block) | Depth: thorough Found 9 issues across 7 files. 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
|
- Move mlc-scripts==1.1.0 from core deps to [sysinfo] optional extra; install with `pip install inference-endpoint[sysinfo]` or `uv run --extra sysinfo` - Add [sysinfo] to [test] extra so CI retains full coverage - Lazy-import capture_system_info in execute.py; only loaded when system_info is configured - Raise SetupError with actionable install hint when mlc-scripts is missing - Remove node_config from run_metadata.json output - Remove disaggregated proxy computation; field now starts as None and is patched by mlcflow - Fix missing space in endpoint_url field description in schema.py - Add test assertions that all mlcflow-owned fields (disaggregated, tensor_parallel, etc.) exist and start as None - Update DESIGN.md with sysinfo optional dependency callout and install instructions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The SetupError message was updated to recommend `pip install "inference-endpoint[sysinfo]"` instead of `pip install mlc-scripts`, but the test regex was not updated, causing CI to fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pip install "inference-endpoint[sysinfo]" fails since the package is not published on PyPI. Updated the SetupError message in capture.py, the test regex in test_capture.py, and the docs callout in DESIGN.md to use pip install -e '.[sysinfo]' instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… into sysinfochanges
|
Hi @arekay-nv @nvzhihanj, just a gentle ping, could we go ahead and merge this PR if everything looks good on your end? |
|
@copilot resolve the merge conflicts in this pull request |
Resolved. I merged |
|
@copilot can you change the commit author to my credentials? It seems Copilot-bot is still not being allowed |
Changing commit authorship to another user's credentials isn't something I'm able to do — that would involve impersonating you. The underlying issue has already been fixed: I added |
|
recheck |
|
@anandhu-eng lets discuss the approach in a post-mortem. From my experience running the system info tool standalone attempts to install cuda on the compute node which is a blocker. We need to restructure it so that the system info is not a blocker for the benchmark and lightweight to run. |
What does this PR do?
This PR adds automated system info capture for MLPerf inference submissions. There are two ways to trigger it:
Standalone (inference-endpoint sysinfo from-config --config benchmark_config.yaml): collects hardware, software info and serving config from one or more remote nodes via SSH and writes a system_desc.json suitable for endpoints, independent of any benchmark run.
Integrated(if system_info block is included in the benchmark_config.yml when running the benchmark): runs automatically after a benchmark completes. In addition to collecting the same hardware info, it also extracts serving configuration (tensor/pipeline/expert parallelism, batch size, framework version) from the inference server's startup log or via HTTP probe, and patches those values into run_metadata.json.
Both paths SSH into each target node to run the get-mlperf-multi-node-system-info mlcflow script, merge the per-node results into a single JSON, and optionally validate the topology against a declared node_config (Eg; Prefill/Decode node groupings). Failures in the integrated path are non-blocking, results.json and run_metadata.json are always written first.
Type of change
Related issues
Testing
Checklist