Skip to content

feat: add CI-gated py_binary startup benchmark#1029

Open
xangcastle wants to merge 1 commit into
mainfrom
xangcastle/performance-benchmarks
Open

feat: add CI-gated py_binary startup benchmark#1029
xangcastle wants to merge 1 commit into
mainfrom
xangcastle/performance-benchmarks

Conversation

@xangcastle
Copy link
Copy Markdown
Member

@xangcastle xangcastle commented May 18, 2026

Summary

Introduces a reproducible, CI-gated benchmark that measures py_binary cold-start (launcher + Python interpreter) overhead. The workflow runs on every PR that touches py/private/** or benchmark/startup/**, compares the PR against HEAD main, and posts a sticky comment with results. If the PR regresses startup time by more than 10 % vs main, the check fails.

Motivation

rules_py performance is heavily dependent on the Bash launcher emitted by py_binary. Small changes to environment-setup logic, runfiles resolution, or interpreter flags can have outsized impact on user-visible startup time. Until now we had no automated way to detect these regressions before merge.

How it works

  1. Three isolated slots are measured in the same CI job:

    • BCR (aspect_rules_py from the Bazel Central Registry, pinned to 1.11.5) — shown as a historical baseline only.
    • HEAD main — current main branch checked out side-by-side.
    • This PR — the PR merge commit.
  2. Build isolation — each slot uses its own --output_base (/tmp/bazel-{bcr,main,pr}) and an explicit empty --disk_cache= so there is zero cross-slot cache contamination.

  3. Measurementhyperfine --warmup 5 --runs 50 runs a no-op py_binary (main.py is just pass). Wall-clock time captures launcher overhead + Python startup; no custom instrumentation is injected into the launcher, which keeps the measurement representative of real user binaries.

  4. Comparisoncompare.py reads the three hyperfine JSON outputs plus optional *-build.json files (cold bazel build time) and emits a Markdown table.

  5. Regression gate — the only gate that can block a PR is PR vs HEAD main (threshold: 10 %). Comparing against BCR is intentionally not used as a gate because transitive dependency versions drift between releases, which would attribute upstream changes to this project.

Files added

  • benchmark/startup/MODULE.bazel.template — template for generating the benchmark workspace.
  • benchmark/startup/generate_module.py — script that produces MODULE.bazel for either BCR or local_path_override mode.
  • benchmark/startup/BUILD.bazel / main.py — minimal no-op py_binary target.
  • benchmark/startup/compare.py — parses hyperfine JSON, computes deltas, prints Markdown table, gates on regression, and writes GITHUB_OUTPUT.
  • .github/workflows/startup-benchmark.yml — CI workflow (installs hyperfine, runs the three slots, posts PR comment).
  • docs/startup-benchmark.md — design doc and local run instructions.

Running locally

cd benchmark/startup

# BCR baseline
python3 generate_module.py bcr --version 1.11.5
bazel build //:bench
BIN=$(bazel cquery //:bench --output=starlark --starlark:expr='target.files_to_run.executable.path' | tail -n1)
hyperfine --warmup 5 --runs 50 --export-json /tmp/bcr.json "$BIN"

# Local checkout (current tree)
python3 generate_module.py local --path ../..
bazel build //:bench
BIN=$(bazel cquery //:bench --output=starlark --starlark:expr='target.files_to_run.executable.path' | tail -n1)
hyperfine --warmup 5 --runs 50 --export-json /tmp/local.json "$BIN"

python3 compare.py /tmp/bcr.json /tmp/local.json /tmp/local.json

Design decisions

  • No _BENCH_T0_NS injection: we intentionally avoided modifying the launcher to capture an internal timestamp. hyperfine wall-clock on a no-op binary is simpler, survives refactoring, and measures exactly what users feel.
  • PR vs main gating: BCR is displayed for context but never blocks, because BCR and main resolve different transitive dependency graphs (e.g. rules_python@1.0.0 vs @1.7.0), making BCR an invalid regression reference.
  • Isolated output bases: guarantees that the build graph and action cache of one slot cannot influence another.

@xangcastle xangcastle force-pushed the xangcastle/performance-benchmarks branch from b85404a to da35f85 Compare May 18, 2026 14:26
@aspect-workflows
Copy link
Copy Markdown

aspect-workflows Bot commented May 18, 2026

Bazel 8 (Test)

All tests were cache hits

181 tests (100.0%) were fully cached saving 1m 5s.


Bazel 9 (Test)

All tests were cache hits

180 tests (100.0%) were fully cached saving 1m 7s.


Bazel 8 (Test)

e2e

All tests were cache hits

113 tests (100.0%) were fully cached saving 52s.


Bazel 9 (Test)

e2e

All tests were cache hits

107 tests (100.0%) were fully cached saving 58s.


Bazel 8 (Test)

examples/uv_pip_compile

All tests were cache hits

1 test (100.0%) was fully cached saving 444ms.


Buildifier      Gazelle

@xangcastle xangcastle force-pushed the xangcastle/performance-benchmarks branch 4 times, most recently from 8e060e5 to 5b6543e Compare May 18, 2026 16:45
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

py_binary startup benchmark

Version Mean (ms) Median (ms) ± stddev vs BCR vs main Build (s)
BCR 1.11.5 (baseline) 306.746 305.620 ±6.201 41.09
HEAD main 128.797 127.705 ±3.767 -58.0% 15.02
This PR 128.938 128.446 ±1.667 -58.0% +0.1% 11.74

Measured with hyperfine --warmup 5 --runs 50 on Linux
Gate: PR vs HEAD main (threshold: 10%). BCR is shown only as a historical baseline.
Build time: cold bazel build //:bench with isolated output base, no disk cache.

@xangcastle xangcastle changed the title feat: performance benchmarks feat: add CI-gated py_binary startup benchmark May 19, 2026
@xangcastle xangcastle marked this pull request as ready for review May 19, 2026 16:11

steps:
- name: Checkout PR
uses: actions/checkout@v4
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there a more modern version?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see actions/checkout@v6 all over the place...

Comment thread benchmark/startup/MODULE.bazel Outdated
@xangcastle xangcastle force-pushed the xangcastle/performance-benchmarks branch from 5b6543e to 06429c3 Compare May 20, 2026 15:50
@xangcastle xangcastle force-pushed the xangcastle/performance-benchmarks branch from 06429c3 to fe38de1 Compare May 20, 2026 16:59
@github-actions
Copy link
Copy Markdown

py_binary startup benchmark

Version Mean (ms) Median (ms) ± stddev vs BCR vs main Build (s)
BCR 1.11.5 (baseline) 330.399 328.437 ±8.244 38.70
HEAD main 138.087 137.869 ±2.121 -58.2% 13.62
This PR 138.597 138.476 ±2.880 -58.1% +0.4% 11.20

Measured with hyperfine --warmup 5 --runs 50 on Linux
Gate: PR vs HEAD main (threshold: 10%). BCR is shown only as a historical baseline.
Build time: cold bazel build //:bench with isolated output base, no disk cache.

@jbedard
Copy link
Copy Markdown
Member

jbedard commented May 20, 2026

Should we make this GHA manual and not do it on 100% of PRs? I wonder if most of the time it will just be noise and cause confusion for people and we'll only want it when we are interested?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants