ci: split build job so post-build chain overlaps test+lint#40233
ci: split build job so post-build chain overlaps test+lint#40233bryangingechen wants to merge 1 commit into
Conversation
PR summary 3b6eb1a35cImport changes for modified filesNo significant changes to the import graph Import changes for all files
|
Peel `test` and `lint` out of the self-hosted `build` job into a new self-hosted `test_lint` job, so the GitHub-hosted post-build chain (`upload_cache` -> `post_steps`) no longer waits behind the ~300s test+lint tail of `build`. Why --- `build_template.yml` is a DAG: build -> upload_cache -> post_steps -> final (plus lint_style in parallel). `upload_cache`/`post_steps` `needs: [build]`, so they cannot start until the *entire* build job finishes -- including test+lint, which they do not depend on (they only need the staged oleans, which `build` uploads as the `cache-staging` artifact ~300s before it runs `test`). With the recent ~2 min "dump declarations" step added to `post_steps`, that serial GitHub-hosted chain is now ~349s sitting back-to-back with the ~308s test+lint tail, though the two are independent. Splitting `build` at the staging boundary lets `test_lint` and the post-build chain run in parallel. On the representative warm fork run this cuts the typical end-to-end contributor wait from ~790s to ~520-580s. Because the overlapped chain is on free GitHub-hosted runners, the only self-hosted cost is one re-paid setup+cache-fetch prefix (the fresh `test_lint` runner), and per-run PEAK pool concurrency is unchanged (build and test_lint run sequentially on the pool). See the analysis in the companion mathlib4_ci_parallel_simulation study. What changed ------------ - New `test_lint` job (self-hosted, `needs: [build, upload_cache]`): re-does the setup prefix, fetches this run's oleans with `cache get` + verifies the Mathlib cache is complete (mirrors `post_steps`), then runs test, lint, and the nightly-testing comment. Its PR-branch checkout is shallow (fetch-depth: 1): unlike `build` it does no parent-commit cache warmup, so it only needs HEAD. - `build` job: `test`/`lint` removed; `noisy` moved up to right after the Counterexamples build (it only needs the just-built oleans), so it stays in `build` and `test_lint` need not fetch Archive/Counterexamples. `test`/`lint` outputs move to `test_lint`. - `final`: now also `needs: test_lint` and requires its success. Reporting behavior is preserved ------------------------------- The split reproduces the monolith's per-step gating via the build job's outputs, so CI reporting is unchanged: - `test_lint` runs when `build` SUCCEEDED *or* FAILED (not cancelled), so `lint` still reports (best-effort, over whatever oleans the cache has) even on a failed build -- as it did under the monolith's `if: always()`. - `test` runs only on a clean build (build + mk_all + archive + counterexamples all success), exactly as before (it had no `if:` and was skipped otherwise). - The cache-completeness check only hard-fails on a successful build; on a failed build it is skipped so it does not block lint reporting. - `final` still requires `build` success, so a failed build fails CI regardless of the best-effort lint -- unchanged. - Minor: on a test/lint failure `post_steps` now still runs (build succeeded), so the decls-diff / import-graph artifacts are produced; `final` is still gated on `test_lint` success, so overall CI status is unaffected. To validate in CI ----------------- - Cache handoff: `test_lint` depends on `upload_cache` and re-fetches from Azure (the proven `post_steps` pattern). A follow-up could have it consume the `cache-staging` artifact directly (`cache unstage` + `unpack`) to drop the ~61s `upload_cache` wait and capture the rest of the latency win; that path needs validation before relying on it. - The duplicated setup prefix between `build` and `test_lint` is a candidate for extraction into a shared composite action in a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a8267bb to
3b6eb1a
Compare
|
Three test runs:
(Comment prepared with Claude code) |
Peel
testandlintout of the self-hostedbuildjob into a new self-hostedtest_lintjob, so the GitHub-hosted post-build chain (upload_cache->post_steps) no longer waits behind the ~300s test+lint tail ofbuild.test_lintjob (self-hosted,needs: [build, upload_cache]): re-does the setup prefix, fetches this run's oleans withcache get+ verifies the Mathlib cache is complete (mirrorspost_steps), then runs test, lint, and the nightly-testing comment. Its PR-branch checkout is shallow (fetch-depth: 1): unlikebuildit does no parent-commit cache warmup, so it only needs HEAD.buildjob:test/lintremoved;noisymoved up to right after the Counterexamples build (it only needs the just-built oleans), so it stays inbuildandtest_lintneed not fetch Archive/Counterexamples.test/lintoutputs move totest_lint.final: now alsoneeds: test_lintand requires its success.Prepared with Claude code
See also this (LLM-generated) report for more details on how a split like this could improve build times.