You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduces a reproducible, CI-gated benchmark that measures py_binary cold-start (launcher + Python interpreter) overhead. The workflow runs on every PR that touches py/private/** or benchmark/startup/**, compares the PR against HEAD main, and posts a sticky comment with results. If the PR regresses startup time by more than 10 % vs main, the check fails.
Motivation
rules_py performance is heavily dependent on the Bash launcher emitted by py_binary. Small changes to environment-setup logic, runfiles resolution, or interpreter flags can have outsized impact on user-visible startup time. Until now we had no automated way to detect these regressions before merge.
How it works
Three isolated slots are measured in the same CI job:
BCR (aspect_rules_py from the Bazel Central Registry, pinned to 1.11.5) — shown as a historical baseline only.
HEAD main — current main branch checked out side-by-side.
This PR — the PR merge commit.
Build isolation — each slot uses its own --output_base (/tmp/bazel-{bcr,main,pr}) and an explicit empty --disk_cache= so there is zero cross-slot cache contamination.
Measurement — hyperfine --warmup 5 --runs 50 runs a no-op py_binary (main.py is just pass). Wall-clock time captures launcher overhead + Python startup; no custom instrumentation is injected into the launcher, which keeps the measurement representative of real user binaries.
Comparison — compare.py reads the three hyperfine JSON outputs plus optional *-build.json files (cold bazel build time) and emits a Markdown table.
Regression gate — the only gate that can block a PR is PR vs HEAD main (threshold: 10 %). Comparing against BCR is intentionally not used as a gate because transitive dependency versions drift between releases, which would attribute upstream changes to this project.
Files added
benchmark/startup/MODULE.bazel.template — template for generating the benchmark workspace.
benchmark/startup/generate_module.py — script that produces MODULE.bazel for either BCR or local_path_override mode.
No _BENCH_T0_NS injection: we intentionally avoided modifying the launcher to capture an internal timestamp. hyperfine wall-clock on a no-op binary is simpler, survives refactoring, and measures exactly what users feel.
PR vs main gating: BCR is displayed for context but never blocks, because BCR and main resolve different transitive dependency graphs (e.g. rules_python@1.0.0 vs @1.7.0), making BCR an invalid regression reference.
Isolated output bases: guarantees that the build graph and action cache of one slot cannot influence another.
Measured with hyperfine --warmup 5 --runs 50 on Linux Gate: PR vs HEAD main (threshold: 10%). BCR is shown only as a historical baseline. Build time: cold bazel build //:bench with isolated output base, no disk cache.
xangcastle
changed the title
feat: performance benchmarks
feat: add CI-gated py_binary startup benchmark
May 19, 2026
Measured with hyperfine --warmup 5 --runs 50 on Linux Gate: PR vs HEAD main (threshold: 10%). BCR is shown only as a historical baseline. Build time: cold bazel build //:bench with isolated output base, no disk cache.
Should we make this GHA manual and not do it on 100% of PRs? I wonder if most of the time it will just be noise and cause confusion for people and we'll only want it when we are interested?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces a reproducible, CI-gated benchmark that measures
py_binarycold-start (launcher + Python interpreter) overhead. The workflow runs on every PR that touchespy/private/**orbenchmark/startup/**, compares the PR against HEADmain, and posts a sticky comment with results. If the PR regresses startup time by more than 10 % vs main, the check fails.Motivation
rules_pyperformance is heavily dependent on the Bash launcher emitted bypy_binary. Small changes to environment-setup logic, runfiles resolution, or interpreter flags can have outsized impact on user-visible startup time. Until now we had no automated way to detect these regressions before merge.How it works
Three isolated slots are measured in the same CI job:
aspect_rules_pyfrom the Bazel Central Registry, pinned to1.11.5) — shown as a historical baseline only.mainbranch checked out side-by-side.Build isolation — each slot uses its own
--output_base(/tmp/bazel-{bcr,main,pr}) and an explicit empty--disk_cache=so there is zero cross-slot cache contamination.Measurement —
hyperfine --warmup 5 --runs 50runs a no-oppy_binary(main.pyis justpass). Wall-clock time captures launcher overhead + Python startup; no custom instrumentation is injected into the launcher, which keeps the measurement representative of real user binaries.Comparison —
compare.pyreads the threehyperfineJSON outputs plus optional*-build.jsonfiles (coldbazel buildtime) and emits a Markdown table.Regression gate — the only gate that can block a PR is PR vs HEAD main (threshold: 10 %). Comparing against BCR is intentionally not used as a gate because transitive dependency versions drift between releases, which would attribute upstream changes to this project.
Files added
benchmark/startup/MODULE.bazel.template— template for generating the benchmark workspace.benchmark/startup/generate_module.py— script that producesMODULE.bazelfor either BCR orlocal_path_overridemode.benchmark/startup/BUILD.bazel/main.py— minimal no-oppy_binarytarget.benchmark/startup/compare.py— parseshyperfineJSON, computes deltas, prints Markdown table, gates on regression, and writesGITHUB_OUTPUT..github/workflows/startup-benchmark.yml— CI workflow (installshyperfine, runs the three slots, posts PR comment).docs/startup-benchmark.md— design doc and local run instructions.Running locally
Design decisions
_BENCH_T0_NSinjection: we intentionally avoided modifying the launcher to capture an internal timestamp.hyperfinewall-clock on a no-op binary is simpler, survives refactoring, and measures exactly what users feel.rules_python@1.0.0vs@1.7.0), making BCR an invalid regression reference.