fix: forward OS shutdown signals to bazel subprocesses#1076
Conversation
828627e to
c6f9e3a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c6f9e3a882
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
✨ Aspect Workflows Tasks📅 Thu May 21 19:53:09 UTC 2026 ✅ 16 successful tasks
⏱ Last updated Thu May 21 19:59:35 UTC 2026 · 📊 GitHub API quota 3,290/15,000 (22% used, resets in 22m) |
c6f9e3a to
e8c5296
Compare
…en Build is GC'd The live-bazel registry's guard was stored on the Starlark `Build` struct, so the PID was only unregistered when the Build object was garbage-collected — an arbitrary delay after `wait()` / `try_wait()` returned. If the OS reused the PID in that window (between child reap and Build GC), a CI cancel arriving during that window would SIGINT / SIGKILL an unrelated, freshly-spawned process. Wrap the guard in `RefCell<Option<LiveBazelGuard>>` and `.take()` it the moment the child is observed exited: - `wait()`: take immediately after `child.wait()` returns, before draining BES / execlog / sink threads (those can take a while). - `try_wait()`: take only when `try_wait()` returns `Some(_)`. A `None` return means the child is still alive, so the registration must stay. If neither is called and the Build is dropped, `Option::drop` still releases the guard — the old cleanup path is preserved as a fallback. Add a regression test in `live.rs` for the `RefCell<Option<…>>` + `.take()` pattern Build relies on. Spotted by Codex review on #1076. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
206c4a8 to
a60f2ed
Compare
…en Build is GC'd The live-bazel registry's guard was stored on the Starlark `Build` struct, so the PID was only unregistered when the Build object was garbage-collected — an arbitrary delay after `wait()` / `try_wait()` returned. If the OS reused the PID in that window (between child reap and Build GC), a CI cancel arriving during that window would SIGINT / SIGKILL an unrelated, freshly-spawned process. Wrap the guard in `RefCell<Option<LiveBazelGuard>>` and `.take()` it the moment the child is observed exited: - `wait()`: take immediately after `child.wait()` returns, before draining BES / execlog / sink threads (those can take a while). - `try_wait()`: take only when `try_wait()` returns `Some(_)`. A `None` return means the child is still alive, so the registration must stay. If neither is called and the Build is dropped, `Option::drop` still releases the guard — the old cleanup path is preserved as a fallback. Add a regression test in `live.rs` for the `RefCell<Option<…>>` + `.take()` pattern Build relies on. Spotted by Codex review on #1076. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a60f2ed to
456a63b
Compare
456a63b to
9255621
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9255621f4d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Per tokio's docs, dropping a `Signal` stream does NOT uninstall the OS-level signal handler that `signal()` registered. Returning early after SIGINT was successfully installed (but SIGTERM install errored) therefore leaves SIGINT swallowed by tokio with no listener: Ctrl-C appears to do nothing, and aspect-cli has to be SIGKILL'd externally — exactly the unkillable hang this module exists to prevent. Continue with SIGINT-only when SIGTERM install fails instead. The common path is unchanged; only the rare double-failure (SIGINT installs, SIGTERM doesn't) is fixed. Caught in PR review by Codex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…akes Match the PR description's framing: list the two failure modes this handler prevents as "potential" flakes (sandbox-state corruption first since it's less rare; orphaned client second), both rare per invocation but with bad blast radius on a warm runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cf10290 to
75671fe
Compare
Problem
Two rare CI flakes, both triggered by the same precondition: a CI cancel reaches the bazel process at a moment bazel can't gracefully recover from. Listed by frequency, most common first.
Potential sandbox-state corruption (bazelbuild/bazel#23880). If the bazel server gets killed at the wrong moment during sandbox cleanup (e.g. SIGKILL'd by the CI runner during the grace window, or killed by aspect-cli's health-check probe), it can strand
_moved_trash_dirandsandbox_stashin the sandbox base. Every subsequent invocation then crashes inafterCommandwith:That state lives on the runner's persistent disk, so once a runner enters it, every job on that runner fails the same way until a human cleans up. Rare per invocation, but the worst symptom — the runner is dead until someone notices.
Potential orphaned bazel client. aspect-cli can die on the signal before it cleanly shuts down its bazel child, or stay alive briefly (blocked in a
spawn_blockingStarlark handler that doesn't yield to async cancellation) while the child keeps running. In the unlucky case where the orphan still holds the JVM-server lock when the nextaspect build/aspect teststarts on that runner, the next invocation hangs atRunning Bazel server needs to be killed. Rarer than (1) and usually self-clears as the orphan exits on its own, but a flake when it happens.Both modes share the same upstream cause: pre-cancellation, aspect-cli has no way to nudge bazel onto its documented graceful-cancel path before the CI's SIGKILL deadline expires.
Fix
Three coordinated pieces. None of them change
bazel; they all just make sure cancellation reaches bazel via the path bazel is designed to handle.1.
bazel_live— process-wide registry of in-flight bazel clientscrates/axl-runtime/src/engine/bazel/live.rs— aMutex<Vec<u32>>plus aLiveBazelGuard(RAII) and aspawn_registered(cmd) -> (Child, Guard)helper. Every bazelCommand::spawn()in the runtime goes throughspawn_registered:Build::spawn(long-runningbazel build/test)info::server_info,info::client_pid,info::is_server_busy,info::server_pid_nonblockingmod.rsbazel info(the Starlarkctx.bazel.info()API)query::Query::raw(largebazel queryruns)health_check::check_bazel_server,health_check::get_output_baseThe guard
.take()s out ofwait()/try_wait()inBuild(build.rs) to release the registration the moment the child is reaped, so a PID-reuse race can't have the shutdown handler signal an unrelated process. A unit test covers this exactRefCell<Option<Guard>>::take()path.2. tokio SIGINT / SIGTERM handler in
aspect-cli/main.rsinstall_shutdown_handlerruns at the top ofrun()— before any Starlark evaluation, before any bazel spawn. It usestokio::signal::unix::signalso it intercepts the signal cooperatively (no default-action terminate) and runs an async shutdown sequence on the tokio runtime.Required tokio features added to
crates/aspect-cli/Cargo.toml:signal,time.3. 3-SIGINT escalation that mirrors bazel's documented cancel state machine
Bazel's cancellation docs define an interactive-Ctrl-C state machine:
run_shutdown_sequence(main.rs:300) drives that protocol against every registered client:Total wall-clock budget: ~6.5s. Fits comfortably inside GHA's ~7.5s SIGTERM-to-SIGKILL grace; ample margin against bazel's typical cancel time.
Exit code follows the shell convention (
128 + signal_number):130for SIGINT,143for SIGTERM.Why force-exit rather than unwind: the AXL drain loop runs on
spawn_blockingand may be doing synchronous network I/O, Starlark eval, etc. There's no cooperative cancellation path; without an explicitprocess::exita single hung handler could pin aspect-cli past the CI cancel deadline — the exact hang this PR exists to prevent.Why this avoids the bazel #23880 corruption: the shutdown handler signals only registered clients, never the server. When the client exits (cleanly via SIGINT, or as a last resort via SIGKILL), the orphaned server detects the client's death and aborts the build through its normal in-process abort path — the same path bazel runs for any client-side cancel. The server stays alive (subject to
RUNNER_TRACKING_ID="", which aspect-cli already sets on GHA) so its sandbox cleanup runs to completion under its own control. We never SIGKILL the server.Relationship to
ctx.bazel.cancel_invocation()(engine/bazel/cancel.rs)That AXL-level cancellation API is unchanged by this PR. Files diff stat confirms
cancel.rsis not touched. The two paths coexist:cancel_invocation()path (cancel.rs)ctx.bazel.cancel_invocation()Mutex<Vec<u32>>)bazel --noblock_for_lock infoprobeprocess::exit(130|143)SIGINT_GRACE = 5s(matchesFORCE_KILL_TIMEOUT_MS)If both fire (CI cancels while AXL is mid
cancel_invocation().wait()), the OS path wins: it sweeps the registry, escalates, andprocess::exits — terminating the AXL.wait()loop along with everything else. No deadlock, no double-kill problem; bazel just receives a flurry of SIGINTs from both senders and deduplicates them per its own state machine.The one subtle interaction: cancel.rs's
force_killdiscovers the client PID viainfo::client_pid→ which now goes throughspawn_registered. If the OS handler is sweeping at exactly the moment that probe is running, the probe gets SIGINT'd, returns no PID, and cancel.rs falls through to the "client gone, SIGKILL server" branch. But the OS handler'sprocess::exitis imminent at that point, so any work cancel.rs does in that window is moot.Trade-offs