Skip to content

feat(rollout): add rollout-side torch profiler trigger via sglang#2038

Open
niu0421 wants to merge 1 commit into
THUDM:mainfrom
niu0421:feat/rollout-sglang-profiler
Open

feat(rollout): add rollout-side torch profiler trigger via sglang#2038
niu0421 wants to merge 1 commit into
THUDM:mainfrom
niu0421:feat/rollout-sglang-profiler

Conversation

@niu0421

@niu0421 niu0421 commented Jun 9, 2026

Copy link
Copy Markdown

Add CLI-driven torch profiler support for the rollout (sglang) side, complementing the existing train-side --profile-target.

Currently, profiling sglang engines during rollout requires manually calling the /start_profile HTTP endpoint. This PR exposes the
existing SGLangEngine.start_profile() / stop_profile() API through RolloutManager, triggered at a user-specified rollout_id.

Changes:

  • slime/ray/rollout.py — add _maybe_start_rollout_profile() / _stop_rollout_profile() in generate(), calling
    engine.start_profile.remote() / engine.stop_profile.remote() on all rollout engines
  • slime/utils/arguments.py — add --rollout-profile-* arguments:
    • --rollout-profile-start-rollout-id: rollout id to trigger profiling (disabled when unset)
    • --rollout-profile-num-steps: number of sglang forward steps to profile
    • --rollout-profile-start-step: warmup steps to skip before recording
    • --rollout-profile-activities: CPU/GPU/MEM activities to capture
    • --rollout-profile-output-dir: trace output directory
    • --rollout-profile-by-stage: split prefill/decode into separate traces
    • --rollout-profile-with-stack: record Python stacks
    • --rollout-profile-record-shapes: record tensor shapes

Usage:
--rollout-profile-start-rollout-id 1
--rollout-profile-num-steps 10
--rollout-profile-output-dir /tmp/rollout_traces

Validation:
Tested with tests/test_qwen2.5_0.5B_async_short.py on 4×H20:

  • Default disabled: no profile logs, training completes normally, no performance regression
  • Enabled at rollout_id=1: profiler starts at correct rollout, each engine generates a trace.json.gz, profiler stops cleanly
  • All engines log Start profile / Profiling done as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant