feat(rl): add REINFORCE advantage estimator by EazyReal · Pull Request #2083 · THUDM/slime

EazyReal · 2026-06-15T20:44:22Z

What

Adds a plain reinforce advantage estimator: GRPO-style group-normalized advantages with the additive -A·logπ surrogate (compute_reinforce_loss) — no PPO/IS ratio, no clipping, gradient only through log_probs. Reuses the existing GRPO returns / group-normalization plumbing; only the surrogate differs from grpo.

This is the on-policy REINFORCE base. Off-policy importance-sampling corrections layer on top via the existing TIS hook — see the companion PR (current-policy IS correction), which composed with reinforce reproduces the CISPO surrogate.

Changes

compute_reinforce_loss (ppo_utils.py); reinforce routed through the GRPO returns path + group-normalization (loss.py, ray/rollout.py); added to --advantage-estimator choices/help (arguments.py) and the policy_loss_function dispatch (loss.py), next to the other reinforce variants.

Tests — `tests/test_reinforce.py` (CPU, `NUM_GPUS = 0`)

Pure-torch (like test_chunked_gae.py); REINFORCE closed-form loss + gradient-only-through-log_probs.

Relation to the CISPO estimator

CISPO already ships as a dedicated --advantage-estimator cispo (#2067). This PR plus the companion current-policy IS correction express the same surrogate compositionally — reinforce + that correction reproduces CISPO exactly (verified) — while the hook also enables other off-policy corrections. Both paths can coexist; if you'd prefer a single one, the composable form can subsume the dedicated estimator. Raising it for your preference. (miles counterpart: radixark/miles#1343.)

🤖 Generated with Claude Code

Add a plain `reinforce` advantage estimator: GRPO-style group-normalized advantages with the additive `-A*logpi` surrogate (`compute_reinforce_loss` in ppo_utils) -- no PPO/IS ratio, no clipping, gradient only through `log_probs`. Reuses the existing GRPO returns / group-normalization plumbing; only the surrogate differs. This is the on-policy base for layering off-policy importance-sampling corrections on top via the TIS hook (added separately). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

EazyReal mentioned this pull request Jun 15, 2026

feat(rl): composable current-policy importance-sampling correction (TIS hook) #2084

Open

EazyReal force-pushed the reinforce-estimator branch from 0a6ea75 to 8f1c408 Compare June 15, 2026 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rl): add REINFORCE advantage estimator#2083

feat(rl): add REINFORCE advantage estimator#2083
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:reinforce-estimator

EazyReal commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EazyReal commented Jun 15, 2026

What

Changes

Tests — tests/test_reinforce.py (CPU, NUM_GPUS = 0)

Relation to the CISPO estimator

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tests — `tests/test_reinforce.py` (CPU, `NUM_GPUS = 0`)