Skip to content

fix(rollout): isolate per-trajectory exceptions in generate_and_rm_group#2078

Open
aoshen02 wants to merge 1 commit into
THUDM:mainfrom
aoshen02:fix/isolate-trajectory-exceptions
Open

fix(rollout): isolate per-trajectory exceptions in generate_and_rm_group#2078
aoshen02 wants to merge 1 commit into
THUDM:mainfrom
aoshen02:fix/isolate-trajectory-exceptions

Conversation

@aoshen02

Copy link
Copy Markdown
Contributor

Problem

generate_and_rm_group gathers per-trajectory tasks with a bare asyncio.gather(*tasks) (no return_exceptions=True). If any single trajectory raises an unhandled exception, gather cancels the siblings and propagates, crashing the entire rollout via CancelledError — which also swallows the root exception (logs show only CancelledError, not what actually failed).

This is benign for plain RLVR rollouts (where generate_and_rm reliably catches its own errors), which is why it has not surfaced before. But agentic rollouts can raise after the custom generate() returns — e.g. trajectory token-merge / prefix-drift edge cases that run outside generate()'s own try/except. Observed on a 500-instance SWE-bench eval: a single bad trajectory (~1 in 350-400) reproducibly took down all 500 (crashed ~321 and ~373 on two runs), with only a CancelledError in the logs.

Fix

Catch per-trajectory exceptions at the group gather:

  • return_exceptions=True so one failure no longer cancels the batch.
  • logger.error(..., exc_info=res) to surface the real traceback (currently swallowed by CancelledError).
  • Substitute an ABORTED / resolved=False placeholder with the same fan-out list shape, reusing the existing _abort() sample contract (tokens=[0,0], loss_mask=[0], status=ABORTED, reward=0.0).

ABORTED is already a first-class status that downstream short-circuits (reward-model skip at generate_and_rm, routing-replay skip), so the placeholder introduces no new sample shape.

Notes

  • Mirror of vllm-project/vime#200.
  • AI-assisted (Claude). Test: ran 500-instance SWE-bench eval; pre-fix crashed at ~321/~373, post-fix completed 500/500 clean.

🤖 Generated with Claude Code

`generate_and_rm_group` gathers per-trajectory tasks with a bare
`asyncio.gather(*tasks)`. If any single trajectory raises an unhandled
exception, gather cancels the siblings and propagates, crashing the
entire rollout via CancelledError — which also swallows the root
exception (logs show only CancelledError, not what actually failed).

This is benign for plain RLVR rollouts but agentic rollouts can raise
after the custom generate() returns (e.g. trajectory token-merge /
prefix-drift edge cases). Observed on a 500-instance SWE-bench eval:
a single bad trajectory (~1 in 350-400) reproducibly took down all 500.

Fix:
- `return_exceptions=True` so one failure no longer cancels the batch.
- `logger.error(..., exc_info=res)` to surface the real traceback.
- Substitute an ABORTED placeholder with the same fan-out list shape,
  reusing the existing abort contract (tokens=[0,0], loss_mask=[0],
  status=ABORTED, reward=0.0).

Mirror of vllm-project/vime#200.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant