Skip to content

feat(rl): add REINFORCE advantage estimator#2083

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:reinforce-estimator
Open

feat(rl): add REINFORCE advantage estimator#2083
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:reinforce-estimator

Conversation

@EazyReal

Copy link
Copy Markdown
Contributor

What

Adds a plain reinforce advantage estimator: GRPO-style group-normalized advantages with the additive -A·logπ surrogate (compute_reinforce_loss) — no PPO/IS ratio, no clipping, gradient only through log_probs. Reuses the existing GRPO returns / group-normalization plumbing; only the surrogate differs from grpo.

This is the on-policy REINFORCE base. Off-policy importance-sampling corrections layer on top via the existing TIS hook — see the companion PR (current-policy IS correction), which composed with reinforce reproduces the CISPO surrogate.

Changes

compute_reinforce_loss (ppo_utils.py); reinforce routed through the GRPO returns path + group-normalization (loss.py, ray/rollout.py); added to --advantage-estimator choices/help (arguments.py) and the policy_loss_function dispatch (loss.py), next to the other reinforce variants.

Tests — tests/test_reinforce.py (CPU, NUM_GPUS = 0)

Pure-torch (like test_chunked_gae.py); REINFORCE closed-form loss + gradient-only-through-log_probs.

Relation to the CISPO estimator

CISPO already ships as a dedicated --advantage-estimator cispo (#2067). This PR plus the companion current-policy IS correction express the same surrogate compositionallyreinforce + that correction reproduces CISPO exactly (verified) — while the hook also enables other off-policy corrections. Both paths can coexist; if you'd prefer a single one, the composable form can subsume the dedicated estimator. Raising it for your preference. (miles counterpart: radixark/miles#1343.)

🤖 Generated with Claude Code

Add a plain `reinforce` advantage estimator: GRPO-style group-normalized
advantages with the additive `-A*logpi` surrogate (`compute_reinforce_loss` in
ppo_utils) -- no PPO/IS ratio, no clipping, gradient only through `log_probs`.
Reuses the existing GRPO returns / group-normalization plumbing; only the
surrogate differs.

This is the on-policy base for layering off-policy importance-sampling
corrections on top via the TIS hook (added separately).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@EazyReal EazyReal force-pushed the reinforce-estimator branch from 0a6ea75 to 8f1c408 Compare June 15, 2026 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant