(fix) retry transient Ray ActorUnavailableError during rollout engine bringup by EazyReal · Pull Request #2059 · THUDM/slime

EazyReal · 2026-06-12T01:14:30Z

Problem

In colocate mode, start_rollout_servers brings up the SGLang engine Ray actors. Each engine process comes up healthy ("server is fired up and ready to roll!", KV cache allocated), but a momentary Ray control-plane gRPC heartbeat miss — under the resource-saturated multi-node bootstrap (simultaneous large-model weight load + cuda-graph capture + offload) — can mark a peer engine actor temporarily unavailable:

ray.exceptions.ActorUnavailableError: The actor ... is unavailable: ...
RpcError: ... Failed to connect to remote host: FD Shutdown rpc_code: 14 (UNAVAILABLE).

Because the bringup runs inside RolloutManager.__init__ (a @ray.remote actor) with no retry, the transient escapes actor construction as ActorDiedError, the raylet mass-kills the leased workers, and the whole job fails at bootstrap. Observed killing 4 consecutive bootstraps of a 4-node colocate run (physical network/IB healthy; not OOM/crash/config). It is the actor-layer analogue of #2024 (unbounded waits at startup).

Fix

Wrap the engine-init wait (ray.get on the init handles) in a bounded, jittered-backoff retry.

Retry the wait, not the engines. The engines have already fired up before the heartbeat miss, so re-ray.get-ing the same already-completed object refs is idempotent — no engines are recreated and nothing leaks. (Recreating on retry would be unsafe here: already-init'd engines stay registered with the router, which runs with disable_health_check=True and never prunes them, and Ray's actor kill is async, so re-requesting the same placement-group GPU bundle could hang.)
Does not mask permanent failures. Only ActorUnavailableError is retried. ActorDiedError (permanent actor death) and real init errors (CUDA OOM, config/assert, model-load) are not ActorUnavailableError, so they propagate on the first attempt. Bounded to 3 attempts (~6-8 s), so a genuinely-down cluster fails fast rather than hanging.
Generic mechanism. slime/utils/retry.py adds a small, backend-agnostic retry_with_backoff(thunk, *, should_retry, ...) — the synchronous counterpart to the existing async should_retry + jittered-backoff idiom in rollout/openai_workflow/client.py and rollout/rm_hub. The only Ray-specific code is a one-line isinstance(e, ActorUnavailableError) predicate at the call site.

Tests

Two CPU-only test files (no GPU / Ray cluster / sglang), both registered in the cpu-unittest matrix:

tests/test_retry.py — unit tests for the generic helper: recovery after transient failures, exhaustion re-raises the last error, predicate-rejected errors (a permanent backend error, a generic RuntimeError) propagate on the first attempt, and KeyboardInterrupt/SystemExit are never intercepted — the helper catches Exception only, so they propagate without the predicate even being consulted.
tests/test_rollout_bringup_retry.py — wiring test that runs the real start_rollout_servers with ray/sglang/torch stubbed at import (same style as the other CPU rollout tests): injecting an ActorUnavailableError on the first ray.get of the engine-init handles must still bring the server up, re-awaiting the same handles rather than recreating engines (fails on the unwrapped ray.get); a non-transient error (RuntimeError: CUDA out of memory) must propagate on the first ray.get with no retry (fails if the retry predicate is widened).

ruff clean.

🤖 Generated with Claude Code

In colocate mode, bringing up the SGLang rollout engines races against a resource-saturated multi-node bootstrap (simultaneous large-model weight load, cuda-graph capture and offload). A momentary Ray control-plane gRPC heartbeat miss can mark an already-"fired up" peer engine actor temporarily unavailable (ActorUnavailableError, gRPC status 14 UNAVAILABLE). Because the bringup runs inside RolloutManager.__init__ (a @ray.remote actor), the unretried transient escapes actor construction as ActorDiedError and the raylet mass-kills the leased workers, failing the whole job at bootstrap. Wrap the engine-init wait (ray.get on the init handles) in a bounded, jittered-backoff retry. Re-getting already-completed object refs is idempotent, so engines are not recreated and nothing leaks. ActorDiedError and real init errors (OOM, config, assert) are not ActorUnavailableError, so they propagate on the first attempt with no masking. The retry mechanism is a small generic, backend-agnostic helper (slime/utils/retry.py, mirroring the async should_retry+loop idiom in rollout/openai_workflow/client.py and rollout/rm_hub) that takes a should_retry predicate; the only Ray-specific bit is the one-line isinstance classification at the rollout.py call site. Tests (both CPU-only, registered in the cpu-unittest CI job): - tests/test_retry.py unit-tests the generic helper (recovery, exhaustion, immediate propagation of predicate-rejected errors). - tests/test_rollout_bringup_retry.py pins the call-site wiring: it runs the real start_rollout_servers with ray/sglang/torch stubbed, injects an ActorUnavailableError on the first ray.get of the engine-init handles, and asserts bringup still succeeds with the same handles re-awaited (fails on the unwrapped ray.get), and that a non-transient error propagates on the first attempt (fails if the retry predicate is widened). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

EazyReal changed the title ~~Retry transient Ray ActorUnavailableError during rollout engine bringup~~ (fix) retry transient Ray ActorUnavailableError during rollout engine bringup Jun 12, 2026

EazyReal force-pushed the upstream-retry-bringup branch from 99af9c1 to f456935 Compare June 12, 2026 08:12

EazyReal force-pushed the upstream-retry-bringup branch from f456935 to 52d247d Compare June 12, 2026 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059

(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:upstream-retry-bringup

EazyReal commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EazyReal commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EazyReal commented Jun 12, 2026 •

edited

Loading