Skip to content

fix: forward OS shutdown signals to bazel subprocesses#1076

Merged
gregmagolan merged 3 commits into
mainfrom
cli_shutdown_fixes
May 21, 2026
Merged

fix: forward OS shutdown signals to bazel subprocesses#1076
gregmagolan merged 3 commits into
mainfrom
cli_shutdown_fixes

Conversation

@gregmagolan
Copy link
Copy Markdown
Member

@gregmagolan gregmagolan commented May 12, 2026

Problem

Two rare CI flakes, both triggered by the same precondition: a CI cancel reaches the bazel process at a moment bazel can't gracefully recover from. Listed by frequency, most common first.

  1. Potential sandbox-state corruption (bazelbuild/bazel#23880). If the bazel server gets killed at the wrong moment during sandbox cleanup (e.g. SIGKILL'd by the CI runner during the grace window, or killed by aspect-cli's health-check probe), it can strand _moved_trash_dir and sandbox_stash in the sandbox base. Every subsequent invocation then crashes in afterCommand with:

    java.lang.IllegalStateException: Found unexpected entries in sandbox base.
      The entries are: linux-sandbox, _moved_trash_dir, sandbox_stash
    

    That state lives on the runner's persistent disk, so once a runner enters it, every job on that runner fails the same way until a human cleans up. Rare per invocation, but the worst symptom — the runner is dead until someone notices.

  2. Potential orphaned bazel client. aspect-cli can die on the signal before it cleanly shuts down its bazel child, or stay alive briefly (blocked in a spawn_blocking Starlark handler that doesn't yield to async cancellation) while the child keeps running. In the unlucky case where the orphan still holds the JVM-server lock when the next aspect build / aspect test starts on that runner, the next invocation hangs at Running Bazel server needs to be killed. Rarer than (1) and usually self-clears as the orphan exits on its own, but a flake when it happens.

Both modes share the same upstream cause: pre-cancellation, aspect-cli has no way to nudge bazel onto its documented graceful-cancel path before the CI's SIGKILL deadline expires.

Fix

Three coordinated pieces. None of them change bazel; they all just make sure cancellation reaches bazel via the path bazel is designed to handle.

1. bazel_live — process-wide registry of in-flight bazel clients

crates/axl-runtime/src/engine/bazel/live.rs — a Mutex<Vec<u32>> plus a LiveBazelGuard (RAII) and a spawn_registered(cmd) -> (Child, Guard) helper. Every bazel Command::spawn() in the runtime goes through spawn_registered:

Call site Why it must be reachable on cancel
Build::spawn (long-running bazel build/test) The hang we're trying to prevent.
info::server_info, info::client_pid, info::is_server_busy, info::server_pid_nonblocking All four can hang on a busy/wedged JVM server.
mod.rs bazel info (the Starlark ctx.bazel.info() API) User-driven; can run during a hung server.
query::Query::raw (large bazel query runs) Can take many seconds; would outlive an aborted aspect-cli without registration.
health_check::check_bazel_server, health_check::get_output_base Health-check probes during pre-flight.

The guard .take()s out of wait() / try_wait() in Build (build.rs) to release the registration the moment the child is reaped, so a PID-reuse race can't have the shutdown handler signal an unrelated process. A unit test covers this exact RefCell<Option<Guard>>::take() path.

2. tokio SIGINT / SIGTERM handler in aspect-cli/main.rs

install_shutdown_handler runs at the top of run() — before any Starlark evaluation, before any bazel spawn. It uses tokio::signal::unix::signal so it intercepts the signal cooperatively (no default-action terminate) and runs an async shutdown sequence on the tokio runtime.

Required tokio features added to crates/aspect-cli/Cargo.toml: signal, time.

3. 3-SIGINT escalation that mirrors bazel's documented cancel state machine

Bazel's cancellation docs define an interactive-Ctrl-C state machine:

SIGINT (1st) → graceful cancel of the in-flight command
SIGINT (2nd) → still graceful; allows a brief cleanup window
SIGINT (3rd) → calls KillServerProcess and hard-exits the client

run_shutdown_sequence (main.rs:300) drives that protocol against every registered client:

SIGINT_TICK = 150ms  — short; just needs the signal delivered
SIGINT_GRACE = 5s    — matches FORCE_KILL_TIMEOUT_MS in engine/bazel/cancel.rs
POST_KILL_GRACE = 1s — kernel-settle margin after SIGKILL

3 × SIGINT (150ms apart) → 5s grace → SIGKILL survivors → 1s settle → exit(130|143)

Total wall-clock budget: ~6.5s. Fits comfortably inside GHA's ~7.5s SIGTERM-to-SIGKILL grace; ample margin against bazel's typical cancel time.

Exit code follows the shell convention (128 + signal_number): 130 for SIGINT, 143 for SIGTERM.

Why force-exit rather than unwind: the AXL drain loop runs on spawn_blocking and may be doing synchronous network I/O, Starlark eval, etc. There's no cooperative cancellation path; without an explicit process::exit a single hung handler could pin aspect-cli past the CI cancel deadline — the exact hang this PR exists to prevent.

Why this avoids the bazel #23880 corruption: the shutdown handler signals only registered clients, never the server. When the client exits (cleanly via SIGINT, or as a last resort via SIGKILL), the orphaned server detects the client's death and aborts the build through its normal in-process abort path — the same path bazel runs for any client-side cancel. The server stays alive (subject to RUNNER_TRACKING_ID="", which aspect-cli already sets on GHA) so its sandbox cleanup runs to completion under its own control. We never SIGKILL the server.

Relationship to ctx.bazel.cancel_invocation() (engine/bazel/cancel.rs)

That AXL-level cancellation API is unchanged by this PR. Files diff stat confirms cancel.rs is not touched. The two paths coexist:

OS-signal path (this PR) cancel_invocation() path (cancel.rs)
Trigger CI cancel — SIGINT/SIGTERM to aspect-cli AXL code calls ctx.bazel.cancel_invocation()
Target discovery Live registry (in-process Mutex<Vec<u32>>) bazel --noblock_for_lock info probe
Scope All registered clients The single client holding the server lock
Last-resort SIGKILL Client only (server preserved) Client and server
Terminal action process::exit(130|143) Returns to AXL caller
Shared timeout SIGINT_GRACE = 5s (matches FORCE_KILL_TIMEOUT_MS)

If both fire (CI cancels while AXL is mid cancel_invocation().wait()), the OS path wins: it sweeps the registry, escalates, and process::exits — terminating the AXL .wait() loop along with everything else. No deadlock, no double-kill problem; bazel just receives a flurry of SIGINTs from both senders and deduplicates them per its own state machine.

The one subtle interaction: cancel.rs's force_kill discovers the client PID via info::client_pid → which now goes through spawn_registered. If the OS handler is sweeping at exactly the moment that probe is running, the probe gets SIGINT'd, returns no PID, and cancel.rs falls through to the "client gone, SIGKILL server" branch. But the OS handler's process::exit is imminent at that point, so any work cancel.rs does in that window is moot.

Trade-offs

  • 6.5s grace is fixed. Hard-coded to balance "long enough for bazel's documented graceful cancel" against "short enough that a hung handler can't pin a CI job."
  • Force-exit skips Drop / unwind. Anything that relies on Drop running during shutdown (e.g. async-flushed telemetry) needs to be set up to flush opportunistically before this point, not via destructors.
  • The shutdown handler runs only once per signal — repeated Ctrl-C presses don't accelerate the sequence. Intentional: bazel's state machine handles the 3-SIGINT cadence itself; piling on extra signals just makes diagnostics noisier.

@gregmagolan gregmagolan force-pushed the cli_shutdown_fixes branch from 828627e to c6f9e3a Compare May 12, 2026 22:46
@gregmagolan gregmagolan marked this pull request as ready for review May 12, 2026 22:48
@gregmagolan gregmagolan requested a review from thesayyn May 12, 2026 22:48
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6f9e3a882

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/axl-runtime/src/engine/bazel/build.rs Outdated
@aspect-workflows
Copy link
Copy Markdown

aspect-workflows Bot commented May 12, 2026

✨ Aspect Workflows Tasks

📅 Thu May 21 19:53:09 UTC 2026

✅ 16 successful tasks

  • ✅ build (build-gha) · ⏱ 31.1s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Bazel build complete (161 built)
  • ✅ build (build-gha-debug) · ⏱ 3m 51s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Bazel build complete (161 built)
  • ✅ format (buildifier-gha) · ⏱ 13.9s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Format complete (clean)
  • ✅ format (buildifier-gha-debug) · ⏱ 17.5s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Format complete (clean)
  • ✅ delivery (delivery-gha) · ⏱ 54.8s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Delivery complete (1 delivered)
  • ✅ delivery (delivery-gha-debug) · ⏱ 5m 9s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Delivery complete (1 delivered)
  • ✅ format (format-gha) · ⏱ 19s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Format complete (clean)
  • ✅ format (format-gha-debug) · ⏱ 22.9s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Format complete (clean)
  • ✅ gazelle (gazelle-from-source-gha) · ⏱ 17.8s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Gazelle complete (clean)
  • ✅ gazelle (gazelle-from-source-gha-debug) · ⏱ 22s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Gazelle complete (clean)
  • ✅ gazelle (gazelle-gha) · ⏱ 14.3s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Gazelle complete (clean)
  • ✅ gazelle (gazelle-gha-debug) · ⏱ 14.3s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Gazelle complete (clean)
  • ✅ lint (lint-gha) · ⏱ 17s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Lint complete (clean)
  • ✅ lint (lint-gha-debug) · ⏱ 18.6s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Lint complete (clean)
  • ✅ test (test-gha) · ⏱ 30.9s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Bazel test complete (25/25 passed · 25 cached)
  • ✅ test (test-gha-debug) · ⏱ 3m 50s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Bazel test complete (25/25 passed · 25 cached)

⏱ Last updated Thu May 21 19:59:35 UTC 2026 · 📊 GitHub API quota 3,290/15,000 (22% used, resets in 22m)
🚀 Powered by Aspect CLI  |  Aspect Build · X · LinkedIn · YouTube

@gregmagolan gregmagolan force-pushed the cli_shutdown_fixes branch from c6f9e3a to e8c5296 Compare May 13, 2026 01:45
gregmagolan added a commit that referenced this pull request May 13, 2026
…en Build is GC'd

The live-bazel registry's guard was stored on the Starlark `Build`
struct, so the PID was only unregistered when the Build object was
garbage-collected — an arbitrary delay after `wait()` / `try_wait()`
returned. If the OS reused the PID in that window (between child
reap and Build GC), a CI cancel arriving during that window would
SIGINT / SIGKILL an unrelated, freshly-spawned process.

Wrap the guard in `RefCell<Option<LiveBazelGuard>>` and `.take()` it
the moment the child is observed exited:

- `wait()`: take immediately after `child.wait()` returns, before
  draining BES / execlog / sink threads (those can take a while).
- `try_wait()`: take only when `try_wait()` returns `Some(_)`. A
  `None` return means the child is still alive, so the registration
  must stay.

If neither is called and the Build is dropped, `Option::drop` still
releases the guard — the old cleanup path is preserved as a fallback.

Add a regression test in `live.rs` for the `RefCell<Option<…>>` +
`.take()` pattern Build relies on.

Spotted by Codex review on #1076.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gregmagolan gregmagolan force-pushed the cli_shutdown_fixes branch from 206c4a8 to a60f2ed Compare May 15, 2026 01:29
gregmagolan added a commit that referenced this pull request May 15, 2026
…en Build is GC'd

The live-bazel registry's guard was stored on the Starlark `Build`
struct, so the PID was only unregistered when the Build object was
garbage-collected — an arbitrary delay after `wait()` / `try_wait()`
returned. If the OS reused the PID in that window (between child
reap and Build GC), a CI cancel arriving during that window would
SIGINT / SIGKILL an unrelated, freshly-spawned process.

Wrap the guard in `RefCell<Option<LiveBazelGuard>>` and `.take()` it
the moment the child is observed exited:

- `wait()`: take immediately after `child.wait()` returns, before
  draining BES / execlog / sink threads (those can take a while).
- `try_wait()`: take only when `try_wait()` returns `Some(_)`. A
  `None` return means the child is still alive, so the registration
  must stay.

If neither is called and the Build is dropped, `Option::drop` still
releases the guard — the old cleanup path is preserved as a fallback.

Add a regression test in `live.rs` for the `RefCell<Option<…>>` +
`.take()` pattern Build relies on.

Spotted by Codex review on #1076.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gregmagolan gregmagolan force-pushed the cli_shutdown_fixes branch from a60f2ed to 456a63b Compare May 15, 2026 06:44
@gregmagolan gregmagolan marked this pull request as draft May 20, 2026 01:37
@gregmagolan gregmagolan force-pushed the cli_shutdown_fixes branch from 456a63b to 9255621 Compare May 21, 2026 08:53
@gregmagolan gregmagolan marked this pull request as ready for review May 21, 2026 19:16
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9255621f4d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/aspect-cli/src/main.rs Outdated
gregmagolan and others added 2 commits May 21, 2026 12:41
Per tokio's docs, dropping a `Signal` stream does NOT uninstall the
OS-level signal handler that `signal()` registered. Returning early
after SIGINT was successfully installed (but SIGTERM install errored)
therefore leaves SIGINT swallowed by tokio with no listener: Ctrl-C
appears to do nothing, and aspect-cli has to be SIGKILL'd externally
— exactly the unkillable hang this module exists to prevent.

Continue with SIGINT-only when SIGTERM install fails instead. The
common path is unchanged; only the rare double-failure (SIGINT
installs, SIGTERM doesn't) is fixed.

Caught in PR review by Codex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…akes

Match the PR description's framing: list the two failure modes this
handler prevents as "potential" flakes (sandbox-state corruption first
since it's less rare; orphaned client second), both rare per
invocation but with bad blast radius on a warm runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gregmagolan gregmagolan force-pushed the cli_shutdown_fixes branch from cf10290 to 75671fe Compare May 21, 2026 19:50
@gregmagolan gregmagolan merged commit 4843960 into main May 21, 2026
45 checks passed
@gregmagolan gregmagolan deleted the cli_shutdown_fixes branch May 21, 2026 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant