feat: add E2E perf tests for mlperf-endpoints#328
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces end-to-end performance and correctness tests for the benchmark CLI, including roofline tests, low-QPS correctness tests, and a pytest terminal summary reporter. The review feedback identifies several critical improvements: fixing a logic bug in the Poisson binary search where the lowest target might be skipped, using os.cpu_count() for cross-platform core detection instead of parsing /proc/cpuinfo, defensively handling type conversion errors in the summary formatter to prevent pytest crashes, and adding headroom to --num-samples in the Poisson load pattern test to prevent premature termination.
Drive the cyclopts `inference-endpoint` app in-process against the existing MaxThroughputServer and VariableResponseServer stubs. Two families: * Roofline (`@pytest.mark.performance`, CI-skipped) — measures peak QPS for max_throughput, concurrency sweep, and binary-searches the largest 10k-multiple target_qps Poisson sustains. Reports numbers rather than asserting on them. * Low-QPS correctness (`@pytest.mark.integration`, CI-included) — 5 QPS Poisson against the realistic stub for 20s; asserts zero failed requests. Guards keep-alive / idle-pool / slow-response regressions that only surface when connections sit idle longer than TCP_KEEPIDLE. A conftest.py captures each parameterized case via a `record_result` fixture and renders a unified summary table with host + CPU info at end of session, so cross-machine roofline runs are easy to compare. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* poisson binary search: switch to while lo<=hi + best_sustained so the LO boundary is actually tested. Old loop could converge to lo==hi==LO/STEP without running LO and report max_sustained=0. * low_qps: 2x num-samples headroom over TARGET_QPS*DURATION so wall time, not sample count, caps the run despite Poisson variance. * conftest._host_info: use os.cpu_count() for cores (cross-platform); keep /proc/cpuinfo only for the CPU model string. Document why the OSError except is silent. * conftest._fmt_cell: wrap int/float conversions in try/except so a bad recorded value can't crash the end-of-session summary table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Low-QPS correctness was marked integration which would have it run in CI on every PR. These are long-running benchmark tests that aren't meant to gate merges; marking them all performance keeps CI fast and makes the file's policy uniform. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ed4f70a to
b66712f
Compare
Opt-in (`-vvv`) per-request lifecycle tracing for the benchmark client, rendered as a live `rich` dashboard alongside the run. Off by default with no measurable overhead when off (emit is a no-op binding); the worker hot path stays lock-free and allocation-free. Pipeline: - utils/trace.py — lock-free SPSC ring emitter (~190 ns/event) into a per-process 512 MiB anonymous mmap (pages fault in on write). Per-pid POSIX FIFO transport: non-blocking open with bounded retry (raises rather than hanging if the dashboard died), O_NONBLOCK writes that drop on EAGAIN with an adaptive sampler + cumulative self-healing drop counter. bootstrap spawns the dashboard, tracks its Popen, and cleanup() reaps it (kills a wedged one as a backstop). 17-byte <BQQ frames. - utils/trace_dashboard.py — pure aggregation + render (unit-tested in isolation): lifecycle fold into HDR-histogram stages, heat-graded %E2E table, client/server/backpressure verdict, e2e bar, backpressure cause tree (pickup-ipc heats independently; encode+tcp-acquire fused), per-proc loop-lag panel, and an ESTABLISHED tcp-conn gauge. Cross-process deltas floored at 0. - scripts/trace_dashboard.py — TUI subprocess: FIFO reader thread, ZMQ SUB to the aggregator PUB (sidecar fallback), and an off-render-thread /proc tcp-conn sampler. Reads to true FIFO EOF with an authoritative final frame. - Worker / session / agentic-inference emit sites; warmup excluded via a PERF_START reset; PERF_END freezes the lifecycle/verdict/tree. Perf (B200 bare metal, #328 roofline, 3 reps): trace-off within run-to-run noise of main; -vvv ~5-8% at the sub-ms stub ceiling (worst case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Opt-in (`-vvv`) per-request lifecycle tracing for the benchmark client, rendered as a live `rich` dashboard alongside the run. Off by default with no measurable overhead when off (emit is a no-op binding); the worker hot path stays lock-free and allocation-free. Pipeline: - utils/trace.py — lock-free SPSC ring emitter (~190 ns/event) into a per-process 512 MiB anonymous mmap (pages fault in on write). Per-pid POSIX FIFO transport: non-blocking open with bounded retry (raises rather than hanging if the dashboard died), O_NONBLOCK writes that drop on EAGAIN with an adaptive sampler + cumulative self-healing drop counter. bootstrap spawns the dashboard, tracks its Popen, and cleanup() reaps it (kills a wedged one as a backstop). 17-byte <BQQ frames. - utils/trace_dashboard.py — pure aggregation + render (unit-tested in isolation): lifecycle fold into HDR-histogram stages, heat-graded %E2E table, client/server/backpressure verdict, e2e bar, backpressure cause tree (pickup-ipc heats independently; encode+tcp-acquire fused), per-proc loop-lag panel, and an ESTABLISHED tcp-conn gauge. Cross-process deltas floored at 0. - scripts/trace_dashboard.py — TUI subprocess: FIFO reader thread, ZMQ SUB to the aggregator PUB (sidecar fallback), and an off-render-thread /proc tcp-conn sampler. Reads to true FIFO EOF with an authoritative final frame. - Worker / session / agentic-inference emit sites; warmup excluded via a PERF_START reset; PERF_END freezes the lifecycle/verdict/tree. Perf (B200 bare metal, #328 roofline, 3 reps): trace-off within run-to-run noise of main; -vvv ~5-8% at the sub-ms stub ceiling (worst case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
arekay-nv
left a comment
There was a problem hiding this comment.
Nice, lets get this in so we can iterate with perf tracking!
| port=0, | ||
| num_workers=4, | ||
| stream=request.param, | ||
| stream_interval=10, | ||
| quiet=True, |
There was a problem hiding this comment.
Should we make these configurable? Also, is having the stream_interval=10 a good value for a max throughput server - we should probably document this.
Drive the cyclopts
inference-endpointapp in-process against the existingMaxThroughputServerandVariableResponseServerstubs. Two families, both marked@pytest.mark.performance(CI-skipped — run on demand):Roofline — measures peak QPS for
max_throughput,concurrencysweep (1k / 4k / 16k), and binary-searches the largest 10 k-multipletarget_qpsPoisson sustains. Reports numbers rather than asserting on them.Low-QPS correctness — 5 QPS Poisson against the realistic stub for ~20 s; asserts zero failed requests. Guards keep-alive / idle-pool / slow-response regressions that only surface when connections sit idle longer than
TCP_KEEPIDLE.A
conftest.pycaptures each parameterized case via arecord_resultfixture and renders a unified summary table with host / CPU / core info at end of session, so cross-machine roofline runs are easy to compare.WARNING: full run takes ~8–10 min wall.
How to run
Example output — local dev box
What does this PR do?
Adds the smallest set of E2E parameterized perf tests that exercise all three online load patterns + low-QPS keep-alive guard, behind the existing
performancemarker so CI is unaffected.Type of change
Related issues
Testing
Checklist