fix: forward OS shutdown signals to bazel subprocesses by gregmagolan · Pull Request #1076 · aspect-build/aspect-cli

gregmagolan · 2026-05-12T22:41:06Z

Problem

Two rare CI flakes, both triggered by the same precondition: a CI cancel reaches the bazel process at a moment bazel can't gracefully recover from. Listed by frequency, most common first.

Potential sandbox-state corruption (bazelbuild/bazel#23880). If the bazel server gets killed at the wrong moment during sandbox cleanup (e.g. SIGKILL'd by the CI runner during the grace window, or killed by aspect-cli's health-check probe), it can strand _moved_trash_dir and sandbox_stash in the sandbox base. Every subsequent invocation then crashes in afterCommand with:
```
java.lang.IllegalStateException: Found unexpected entries in sandbox base.
  The entries are: linux-sandbox, _moved_trash_dir, sandbox_stash
```
That state lives on the runner's persistent disk, so once a runner enters it, every job on that runner fails the same way until a human cleans up. Rare per invocation, but the worst symptom — the runner is dead until someone notices.
Potential orphaned bazel client. aspect-cli can die on the signal before it cleanly shuts down its bazel child, or stay alive briefly (blocked in a spawn_blocking Starlark handler that doesn't yield to async cancellation) while the child keeps running. In the unlucky case where the orphan still holds the JVM-server lock when the next aspect build / aspect test starts on that runner, the next invocation hangs at Running Bazel server needs to be killed. Rarer than (1) and usually self-clears as the orphan exits on its own, but a flake when it happens.

Both modes share the same upstream cause: pre-cancellation, aspect-cli has no way to nudge bazel onto its documented graceful-cancel path before the CI's SIGKILL deadline expires.

Fix

Three coordinated pieces. None of them change bazel; they all just make sure cancellation reaches bazel via the path bazel is designed to handle.

1. `bazel_live` — process-wide registry of in-flight bazel clients

crates/axl-runtime/src/engine/bazel/live.rs — a Mutex<Vec<u32>> plus a LiveBazelGuard (RAII) and a spawn_registered(cmd) -> (Child, Guard) helper. Every bazel Command::spawn() in the runtime goes through spawn_registered:

Call site	Why it must be reachable on cancel
`Build::spawn` (long-running `bazel build`/`test`)	The hang we're trying to prevent.
`info::server_info`, `info::client_pid`, `info::is_server_busy`, `info::server_pid_nonblocking`	All four can hang on a busy/wedged JVM server.
`mod.rs` `bazel info` (the Starlark `ctx.bazel.info()` API)	User-driven; can run during a hung server.
`query::Query::raw` (large `bazel query` runs)	Can take many seconds; would outlive an aborted aspect-cli without registration.
`health_check::check_bazel_server`, `health_check::get_output_base`	Health-check probes during pre-flight.

The guard .take()s out of wait() / try_wait() in Build (build.rs) to release the registration the moment the child is reaped, so a PID-reuse race can't have the shutdown handler signal an unrelated process. A unit test covers this exact RefCell<Option<Guard>>::take() path.

2. tokio SIGINT / SIGTERM handler in `aspect-cli/main.rs`

install_shutdown_handler runs at the top of run() — before any Starlark evaluation, before any bazel spawn. It uses tokio::signal::unix::signal so it intercepts the signal cooperatively (no default-action terminate) and runs an async shutdown sequence on the tokio runtime.

Required tokio features added to crates/aspect-cli/Cargo.toml: signal, time.

3. 3-SIGINT escalation that mirrors bazel's documented cancel state machine

Bazel's cancellation docs define an interactive-Ctrl-C state machine:

SIGINT (1st) → graceful cancel of the in-flight command
SIGINT (2nd) → still graceful; allows a brief cleanup window
SIGINT (3rd) → calls KillServerProcess and hard-exits the client

run_shutdown_sequence (main.rs:300) drives that protocol against every registered client:

SIGINT_TICK = 150ms  — short; just needs the signal delivered
SIGINT_GRACE = 5s    — matches FORCE_KILL_TIMEOUT_MS in engine/bazel/cancel.rs
POST_KILL_GRACE = 1s — kernel-settle margin after SIGKILL

3 × SIGINT (150ms apart) → 5s grace → SIGKILL survivors → 1s settle → exit(130|143)

Total wall-clock budget: ~6.5s. Fits comfortably inside GHA's ~7.5s SIGTERM-to-SIGKILL grace; ample margin against bazel's typical cancel time.

Exit code follows the shell convention (128 + signal_number): 130 for SIGINT, 143 for SIGTERM.

Why force-exit rather than unwind: the AXL drain loop runs on spawn_blocking and may be doing synchronous network I/O, Starlark eval, etc. There's no cooperative cancellation path; without an explicit process::exit a single hung handler could pin aspect-cli past the CI cancel deadline — the exact hang this PR exists to prevent.

Why this avoids the bazel #23880 corruption: the shutdown handler signals only registered clients, never the server. When the client exits (cleanly via SIGINT, or as a last resort via SIGKILL), the orphaned server detects the client's death and aborts the build through its normal in-process abort path — the same path bazel runs for any client-side cancel. The server stays alive (subject to RUNNER_TRACKING_ID="", which aspect-cli already sets on GHA) so its sandbox cleanup runs to completion under its own control. We never SIGKILL the server.

Relationship to `ctx.bazel.cancel_invocation()` (engine/bazel/cancel.rs)

That AXL-level cancellation API is unchanged by this PR. Files diff stat confirms cancel.rs is not touched. The two paths coexist:

	OS-signal path (this PR)	`cancel_invocation()` path (cancel.rs)
Trigger	CI cancel — SIGINT/SIGTERM to aspect-cli	AXL code calls `ctx.bazel.cancel_invocation()`
Target discovery	Live registry (in-process `Mutex<Vec<u32>>`)	`bazel --noblock_for_lock info` probe
Scope	All registered clients	The single client holding the server lock
Last-resort SIGKILL	Client only (server preserved)	Client and server
Terminal action	`process::exit(130\|143)`	Returns to AXL caller
Shared timeout	`SIGINT_GRACE = 5s` (matches `FORCE_KILL_TIMEOUT_MS`)

If both fire (CI cancels while AXL is mid cancel_invocation().wait()), the OS path wins: it sweeps the registry, escalates, and process::exits — terminating the AXL .wait() loop along with everything else. No deadlock, no double-kill problem; bazel just receives a flurry of SIGINTs from both senders and deduplicates them per its own state machine.

The one subtle interaction: cancel.rs's force_kill discovers the client PID via info::client_pid → which now goes through spawn_registered. If the OS handler is sweeping at exactly the moment that probe is running, the probe gets SIGINT'd, returns no PID, and cancel.rs falls through to the "client gone, SIGKILL server" branch. But the OS handler's process::exit is imminent at that point, so any work cancel.rs does in that window is moot.

Trade-offs

6.5s grace is fixed. Hard-coded to balance "long enough for bazel's documented graceful cancel" against "short enough that a hung handler can't pin a CI job."
Force-exit skips Drop / unwind. Anything that relies on Drop running during shutdown (e.g. async-flushed telemetry) needs to be set up to flush opportunistically before this point, not via destructors.
The shutdown handler runs only once per signal — repeated Ctrl-C presses don't accelerate the sequence. Intentional: bazel's state machine handles the 3-SIGINT cadence itself; piling on extra signals just makes diagnostics noisier.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6f9e3a882

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

aspect-workflows · 2026-05-12T22:52:18Z

✨ Aspect Workflows Tasks

📅 Thu May 21 19:53:09 UTC 2026

✅ 16 successful tasks

✅ build (build-gha) · ⏱ 31.1s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Bazel build complete (161 built)
✅ build (build-gha-debug) · ⏱ 3m 51s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Bazel build complete (161 built)
✅ format (buildifier-gha) · ⏱ 13.9s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Format complete (clean)
✅ format (buildifier-gha-debug) · ⏱ 17.5s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Format complete (clean)
✅ delivery (delivery-gha) · ⏱ 54.8s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Delivery complete (1 delivered)
✅ delivery (delivery-gha-debug) · ⏱ 5m 9s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Delivery complete (1 delivered)
✅ format (format-gha) · ⏱ 19s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Format complete (clean)
✅ format (format-gha-debug) · ⏱ 22.9s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Format complete (clean)
✅ gazelle (gazelle-from-source-gha) · ⏱ 17.8s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Gazelle complete (clean)
✅ gazelle (gazelle-from-source-gha-debug) · ⏱ 22s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Gazelle complete (clean)
✅ gazelle (gazelle-gha) · ⏱ 14.3s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Gazelle complete (clean)
✅ gazelle (gazelle-gha-debug) · ⏱ 14.3s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Gazelle complete (clean)
✅ lint (lint-gha) · ⏱ 17s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Lint complete (clean)
✅ lint (lint-gha-debug) · ⏱ 18.6s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Lint complete (clean)
✅ test (test-gha) · ⏱ 30.9s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Bazel test complete (25/25 passed · 25 cached)
✅ test (test-gha-debug) · ⏱ 3m 50s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
💬 Bazel test complete (25/25 passed · 25 cached)

_{⏱ Last updated Thu May 21 19:59:35 UTC 2026 · 📊 GitHub API quota 3,290/15,000 (22% used, resets in 22m)}
_{🚀 Powered by Aspect CLI | Aspect Build · X · LinkedIn · YouTube}

…en Build is GC'd The live-bazel registry's guard was stored on the Starlark `Build` struct, so the PID was only unregistered when the Build object was garbage-collected — an arbitrary delay after `wait()` / `try_wait()` returned. If the OS reused the PID in that window (between child reap and Build GC), a CI cancel arriving during that window would SIGINT / SIGKILL an unrelated, freshly-spawned process. Wrap the guard in `RefCell<Option<LiveBazelGuard>>` and `.take()` it the moment the child is observed exited: - `wait()`: take immediately after `child.wait()` returns, before draining BES / execlog / sink threads (those can take a while). - `try_wait()`: take only when `try_wait()` returns `Some(_)`. A `None` return means the child is still alive, so the registration must stay. If neither is called and the Build is dropped, `Option::drop` still releases the guard — the old cleanup path is preserved as a fallback. Add a regression test in `live.rs` for the `RefCell<Option<…>>` + `.take()` pattern Build relies on. Spotted by Codex review on #1076. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9255621f4d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Per tokio's docs, dropping a `Signal` stream does NOT uninstall the OS-level signal handler that `signal()` registered. Returning early after SIGINT was successfully installed (but SIGTERM install errored) therefore leaves SIGINT swallowed by tokio with no listener: Ctrl-C appears to do nothing, and aspect-cli has to be SIGKILL'd externally — exactly the unkillable hang this module exists to prevent. Continue with SIGINT-only when SIGTERM install fails instead. The common path is unchanged; only the rare double-failure (SIGINT installs, SIGTERM doesn't) is fixed. Caught in PR review by Codex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…akes Match the PR description's framing: list the two failure modes this handler prevents as "potential" flakes (sandbox-state corruption first since it's less rare; orphaned client second), both rare per invocation but with bad blast radius on a warm runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gregmagolan force-pushed the cli_shutdown_fixes branch from 828627e to c6f9e3a Compare May 12, 2026 22:46

gregmagolan marked this pull request as ready for review May 12, 2026 22:48

gregmagolan requested a review from thesayyn May 12, 2026 22:48

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

Comment thread crates/axl-runtime/src/engine/bazel/build.rs Outdated

gregmagolan force-pushed the cli_shutdown_fixes branch from c6f9e3a to e8c5296 Compare May 13, 2026 01:45

gregmagolan force-pushed the cli_shutdown_fixes branch from 206c4a8 to a60f2ed Compare May 15, 2026 01:29

gregmagolan force-pushed the cli_shutdown_fixes branch from a60f2ed to 456a63b Compare May 15, 2026 06:44

gregmagolan marked this pull request as draft May 20, 2026 01:37

fix: forward OS shutdown signals to bazel subprocesses

9255621

gregmagolan force-pushed the cli_shutdown_fixes branch from 456a63b to 9255621 Compare May 21, 2026 08:53

gregmagolan marked this pull request as ready for review May 21, 2026 19:16

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread crates/aspect-cli/src/main.rs Outdated

gregmagolan and others added 2 commits May 21, 2026 12:41

gregmagolan force-pushed the cli_shutdown_fixes branch from cf10290 to 75671fe Compare May 21, 2026 19:50

gregmagolan merged commit 4843960 into main May 21, 2026
45 checks passed

gregmagolan deleted the cli_shutdown_fixes branch May 21, 2026 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: forward OS shutdown signals to bazel subprocesses#1076

fix: forward OS shutdown signals to bazel subprocesses#1076
gregmagolan merged 3 commits into
mainfrom
cli_shutdown_fixes

gregmagolan commented May 12, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

aspect-workflows Bot commented May 12, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gregmagolan commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

1. bazel_live — process-wide registry of in-flight bazel clients

2. tokio SIGINT / SIGTERM handler in aspect-cli/main.rs

3. 3-SIGINT escalation that mirrors bazel's documented cancel state machine

Relationship to ctx.bazel.cancel_invocation() (engine/bazel/cancel.rs)

Trade-offs

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

aspect-workflows Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Aspect Workflows Tasks

✅ 16 successful tasks

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gregmagolan commented May 12, 2026 •

edited

Loading

1. `bazel_live` — process-wide registry of in-flight bazel clients

2. tokio SIGINT / SIGTERM handler in `aspect-cli/main.rs`

Relationship to `ctx.bazel.cancel_invocation()` (engine/bazel/cancel.rs)

aspect-workflows Bot commented May 12, 2026 •

edited

Loading