fix(rollout): isolate per-trajectory exceptions in generate_and_rm_group#2078
Open
aoshen02 wants to merge 1 commit into
Open
fix(rollout): isolate per-trajectory exceptions in generate_and_rm_group#2078aoshen02 wants to merge 1 commit into
aoshen02 wants to merge 1 commit into
Conversation
`generate_and_rm_group` gathers per-trajectory tasks with a bare `asyncio.gather(*tasks)`. If any single trajectory raises an unhandled exception, gather cancels the siblings and propagates, crashing the entire rollout via CancelledError — which also swallows the root exception (logs show only CancelledError, not what actually failed). This is benign for plain RLVR rollouts but agentic rollouts can raise after the custom generate() returns (e.g. trajectory token-merge / prefix-drift edge cases). Observed on a 500-instance SWE-bench eval: a single bad trajectory (~1 in 350-400) reproducibly took down all 500. Fix: - `return_exceptions=True` so one failure no longer cancels the batch. - `logger.error(..., exc_info=res)` to surface the real traceback. - Substitute an ABORTED placeholder with the same fan-out list shape, reusing the existing abort contract (tokens=[0,0], loss_mask=[0], status=ABORTED, reward=0.0). Mirror of vllm-project/vime#200. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
generate_and_rm_groupgathers per-trajectory tasks with a bareasyncio.gather(*tasks)(noreturn_exceptions=True). If any single trajectory raises an unhandled exception, gather cancels the siblings and propagates, crashing the entire rollout viaCancelledError— which also swallows the root exception (logs show onlyCancelledError, not what actually failed).This is benign for plain RLVR rollouts (where
generate_and_rmreliably catches its own errors), which is why it has not surfaced before. But agentic rollouts can raise after the customgenerate()returns — e.g. trajectory token-merge / prefix-drift edge cases that run outsidegenerate()'s own try/except. Observed on a 500-instance SWE-bench eval: a single bad trajectory (~1 in 350-400) reproducibly took down all 500 (crashed ~321 and ~373 on two runs), with only aCancelledErrorin the logs.Fix
Catch per-trajectory exceptions at the group gather:
return_exceptions=Trueso one failure no longer cancels the batch.logger.error(..., exc_info=res)to surface the real traceback (currently swallowed byCancelledError).ABORTED/resolved=Falseplaceholder with the same fan-out list shape, reusing the existing_abort()sample contract (tokens=[0,0],loss_mask=[0],status=ABORTED,reward=0.0).ABORTEDis already a first-class status that downstream short-circuits (reward-model skip atgenerate_and_rm, routing-replay skip), so the placeholder introduces no new sample shape.Notes
🤖 Generated with Claude Code