Skip to content

(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:upstream-retry-bringup
Open

(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:upstream-retry-bringup

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

In colocate mode, start_rollout_servers brings up the SGLang engine Ray actors. Each engine process comes up healthy ("server is fired up and ready to roll!", KV cache allocated), but a momentary Ray control-plane gRPC heartbeat miss — under the resource-saturated multi-node bootstrap (simultaneous large-model weight load + cuda-graph capture + offload) — can mark a peer engine actor temporarily unavailable:

ray.exceptions.ActorUnavailableError: The actor ... is unavailable: ...
RpcError: ... Failed to connect to remote host: FD Shutdown rpc_code: 14 (UNAVAILABLE).

Because the bringup runs inside RolloutManager.__init__ (a @ray.remote actor) with no retry, the transient escapes actor construction as ActorDiedError, the raylet mass-kills the leased workers, and the whole job fails at bootstrap. Observed killing 4 consecutive bootstraps of a 4-node colocate run (physical network/IB healthy; not OOM/crash/config). It is the actor-layer analogue of #2024 (unbounded waits at startup).

Fix

Wrap the engine-init wait (ray.get on the init handles) in a bounded, jittered-backoff retry.

  • Retry the wait, not the engines. The engines have already fired up before the heartbeat miss, so re-ray.get-ing the same already-completed object refs is idempotent — no engines are recreated and nothing leaks. (Recreating on retry would be unsafe here: already-init'd engines stay registered with the router, which runs with disable_health_check=True and never prunes them, and Ray's actor kill is async, so re-requesting the same placement-group GPU bundle could hang.)
  • Does not mask permanent failures. Only ActorUnavailableError is retried. ActorDiedError (permanent actor death) and real init errors (CUDA OOM, config/assert, model-load) are not ActorUnavailableError, so they propagate on the first attempt. Bounded to 3 attempts (~6-8 s), so a genuinely-down cluster fails fast rather than hanging.
  • Generic mechanism. slime/utils/retry.py adds a small, backend-agnostic retry_with_backoff(thunk, *, should_retry, ...) — the synchronous counterpart to the existing async should_retry + jittered-backoff idiom in rollout/openai_workflow/client.py and rollout/rm_hub. The only Ray-specific code is a one-line isinstance(e, ActorUnavailableError) predicate at the call site.

Tests

Two CPU-only test files (no GPU / Ray cluster / sglang), both registered in the cpu-unittest matrix:

  • tests/test_retry.py — unit tests for the generic helper: recovery after transient failures, exhaustion re-raises the last error, predicate-rejected errors (a permanent backend error, a generic RuntimeError) propagate on the first attempt, and KeyboardInterrupt/SystemExit are never intercepted — the helper catches Exception only, so they propagate without the predicate even being consulted.
  • tests/test_rollout_bringup_retry.py — wiring test that runs the real start_rollout_servers with ray/sglang/torch stubbed at import (same style as the other CPU rollout tests): injecting an ActorUnavailableError on the first ray.get of the engine-init handles must still bring the server up, re-awaiting the same handles rather than recreating engines (fails on the unwrapped ray.get); a non-transient error (RuntimeError: CUDA out of memory) must propagate on the first ray.get with no retry (fails if the retry predicate is widened).

ruff clean.

🤖 Generated with Claude Code

@EazyReal EazyReal changed the title Retry transient Ray ActorUnavailableError during rollout engine bringup (fix) retry transient Ray ActorUnavailableError during rollout engine bringup Jun 12, 2026
@EazyReal EazyReal force-pushed the upstream-retry-bringup branch from 99af9c1 to f456935 Compare June 12, 2026 08:12
In colocate mode, bringing up the SGLang rollout engines races against a
resource-saturated multi-node bootstrap (simultaneous large-model weight load,
cuda-graph capture and offload). A momentary Ray control-plane gRPC heartbeat
miss can mark an already-"fired up" peer engine actor temporarily unavailable
(ActorUnavailableError, gRPC status 14 UNAVAILABLE). Because the bringup runs
inside RolloutManager.__init__ (a @ray.remote actor), the unretried transient
escapes actor construction as ActorDiedError and the raylet mass-kills the
leased workers, failing the whole job at bootstrap.

Wrap the engine-init wait (ray.get on the init handles) in a bounded,
jittered-backoff retry. Re-getting already-completed object refs is idempotent,
so engines are not recreated and nothing leaks. ActorDiedError and real init
errors (OOM, config, assert) are not ActorUnavailableError, so they propagate
on the first attempt with no masking.

The retry mechanism is a small generic, backend-agnostic helper
(slime/utils/retry.py, mirroring the async should_retry+loop idiom in
rollout/openai_workflow/client.py and rollout/rm_hub) that takes a should_retry
predicate; the only Ray-specific bit is the one-line isinstance classification
at the rollout.py call site.

Tests (both CPU-only, registered in the cpu-unittest CI job):
- tests/test_retry.py unit-tests the generic helper (recovery, exhaustion,
  immediate propagation of predicate-rejected errors).
- tests/test_rollout_bringup_retry.py pins the call-site wiring: it runs the
  real start_rollout_servers with ray/sglang/torch stubbed, injects an
  ActorUnavailableError on the first ray.get of the engine-init handles, and
  asserts bringup still succeeds with the same handles re-awaited (fails on
  the unwrapped ray.get), and that a non-transient error propagates on the
  first attempt (fails if the retry predicate is widened).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@EazyReal EazyReal force-pushed the upstream-retry-bringup branch from f456935 to 52d247d Compare June 12, 2026 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant