(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059
Open
EazyReal wants to merge 1 commit into
Open
(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059EazyReal wants to merge 1 commit into
EazyReal wants to merge 1 commit into
Conversation
99af9c1 to
f456935
Compare
In colocate mode, bringing up the SGLang rollout engines races against a resource-saturated multi-node bootstrap (simultaneous large-model weight load, cuda-graph capture and offload). A momentary Ray control-plane gRPC heartbeat miss can mark an already-"fired up" peer engine actor temporarily unavailable (ActorUnavailableError, gRPC status 14 UNAVAILABLE). Because the bringup runs inside RolloutManager.__init__ (a @ray.remote actor), the unretried transient escapes actor construction as ActorDiedError and the raylet mass-kills the leased workers, failing the whole job at bootstrap. Wrap the engine-init wait (ray.get on the init handles) in a bounded, jittered-backoff retry. Re-getting already-completed object refs is idempotent, so engines are not recreated and nothing leaks. ActorDiedError and real init errors (OOM, config, assert) are not ActorUnavailableError, so they propagate on the first attempt with no masking. The retry mechanism is a small generic, backend-agnostic helper (slime/utils/retry.py, mirroring the async should_retry+loop idiom in rollout/openai_workflow/client.py and rollout/rm_hub) that takes a should_retry predicate; the only Ray-specific bit is the one-line isinstance classification at the rollout.py call site. Tests (both CPU-only, registered in the cpu-unittest CI job): - tests/test_retry.py unit-tests the generic helper (recovery, exhaustion, immediate propagation of predicate-rejected errors). - tests/test_rollout_bringup_retry.py pins the call-site wiring: it runs the real start_rollout_servers with ray/sglang/torch stubbed, injects an ActorUnavailableError on the first ray.get of the engine-init handles, and asserts bringup still succeeds with the same handles re-awaited (fails on the unwrapped ray.get), and that a non-transient error propagates on the first attempt (fails if the retry predicate is widened). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
f456935 to
52d247d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In colocate mode,
start_rollout_serversbrings up the SGLang engine Ray actors. Each engine process comes up healthy ("server is fired up and ready to roll!", KV cache allocated), but a momentary Ray control-plane gRPC heartbeat miss — under the resource-saturated multi-node bootstrap (simultaneous large-model weight load + cuda-graph capture + offload) — can mark a peer engine actor temporarily unavailable:Because the bringup runs inside
RolloutManager.__init__(a@ray.remoteactor) with no retry, the transient escapes actor construction asActorDiedError, the raylet mass-kills the leased workers, and the whole job fails at bootstrap. Observed killing 4 consecutive bootstraps of a 4-node colocate run (physical network/IB healthy; not OOM/crash/config). It is the actor-layer analogue of #2024 (unbounded waits at startup).Fix
Wrap the engine-init wait (
ray.geton the init handles) in a bounded, jittered-backoff retry.ray.get-ing the same already-completed object refs is idempotent — no engines are recreated and nothing leaks. (Recreating on retry would be unsafe here: already-init'd engines stay registered with the router, which runs withdisable_health_check=Trueand never prunes them, and Ray's actor kill is async, so re-requesting the same placement-group GPU bundle could hang.)ActorUnavailableErroris retried.ActorDiedError(permanent actor death) and real init errors (CUDA OOM, config/assert, model-load) are notActorUnavailableError, so they propagate on the first attempt. Bounded to 3 attempts (~6-8 s), so a genuinely-down cluster fails fast rather than hanging.slime/utils/retry.pyadds a small, backend-agnosticretry_with_backoff(thunk, *, should_retry, ...)— the synchronous counterpart to the existing asyncshould_retry+ jittered-backoff idiom inrollout/openai_workflow/client.pyandrollout/rm_hub. The only Ray-specific code is a one-lineisinstance(e, ActorUnavailableError)predicate at the call site.Tests
Two CPU-only test files (no GPU / Ray cluster / sglang), both registered in the
cpu-unittestmatrix:tests/test_retry.py— unit tests for the generic helper: recovery after transient failures, exhaustion re-raises the last error, predicate-rejected errors (a permanent backend error, a genericRuntimeError) propagate on the first attempt, andKeyboardInterrupt/SystemExitare never intercepted — the helper catchesExceptiononly, so they propagate without the predicate even being consulted.tests/test_rollout_bringup_retry.py— wiring test that runs the realstart_rollout_serverswith ray/sglang/torch stubbed at import (same style as the other CPU rollout tests): injecting anActorUnavailableErroron the firstray.getof the engine-init handles must still bring the server up, re-awaiting the same handles rather than recreating engines (fails on the unwrappedray.get); a non-transient error (RuntimeError: CUDA out of memory) must propagate on the firstray.getwith no retry (fails if the retry predicate is widened).ruffclean.🤖 Generated with Claude Code