feat(rl): composable current-policy importance-sampling correction (TIS hook) by EazyReal · Pull Request #2084 · THUDM/slime

EazyReal · 2026-06-15T20:44:44Z

What

Makes off-policy importance sampling composable through the existing TIS correction hook, by exposing the current policy log-probs to corrections.

Hook contract (loss.py): policy_loss_function now passes cur_log_probs (current grad-carrying π_θ, per-sample list) into the correction kwargs, alongside the frozen train_log_probs (π_θ_old) and rollout_log_probs (π_rollout). Existing corrections (vanilla_tis_function, icepop_function) ignore it via **kwargs.
off_policy_is_function (ppo_utils.py): truncated IS between the current policy and the actual rollout generator — the (detached) weight is clip(π_θ / π_rollout) against the real rollout/inference-engine logprob, so a single weight corrects both the train/inference mismatch and async (multi-version) staleness. The existing TIS only had π_θ_old / π_rollout (equals this only in the single-update-per-rollout limit).

CISPO via composition

On the plain REINFORCE base (#2083), this reproduces CISPO (MiniMax-M1) exactly:

--advantage-estimator reinforce --use-tis \
  --custom-tis-function-path slime.utils.ppo_utils.off_policy_is_function

→ L = -clip(π_θ/π_rollout, 1-eps_clip, 1+eps_clip_high).detach() · A · logπ, gradient only through logπ. Single-sided CISPO: --eps-clip 1.0.

Depends on #2083 for the CISPO use-case; code here is independent and the test uses a manual REINFORCE base so it stands alone.

Tests — `tests/test_off_policy_is.py` (CPU, `NUM_GPUS = 0`)

Clip/clipfrac/mask pass-through; single-sided at eps_clip=1.0; equivalence to the closed-form CISPO surrogate on a manual REINFORCE base (loss + gradient). (The cur_log_probs wiring lives in loss.py, which imports megatron, so it's exercised in the GPU CI suites.)

Relation to the dedicated CISPO estimator

CISPO already ships as --advantage-estimator cispo (compute_cispo_loss, #2067). This PR (+ #2083) offers the same surrogate compositionally, plus a general current-policy IS hook for other corrections. Both can coexist; if you'd prefer a single path, the composable form can subsume the dedicated estimator. Raising it for your preference — happy to align either way. (miles counterpart: radixark/miles#1344.)

🤖 Generated with Claude Code

Expose the current grad-carrying log-probs to the policy-loss TIS hook as `cur_log_probs`, and add `off_policy_is_function` (in ppo_utils, next to compute_policy_loss/compute_cispo_loss) -- a truncated-IS correction between the *current* policy and the *actual rollout generator*: the (detached) weight is `clip(pi_theta / pi_rollout)` against the real rollout logprob, so one weight corrects both the train/inference mismatch and async (multi-version) staleness. The existing TIS hook only had pi_theta_old / pi_rollout, which equals this only in the single-update-per-rollout limit. On a plain REINFORCE base (`--advantage-estimator reinforce`) this reproduces the CISPO surrogate expressed as a correction rather than the dedicated `compute_cispo_loss` estimator. Existing corrections ignore the new kwarg via **kwargs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rl): composable current-policy importance-sampling correction (TIS hook)#2084

feat(rl): composable current-policy importance-sampling correction (TIS hook)#2084
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:off-policy-is

EazyReal commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EazyReal commented Jun 15, 2026

What

CISPO via composition

Tests — tests/test_off_policy_is.py (CPU, NUM_GPUS = 0)

Relation to the dedicated CISPO estimator

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tests — `tests/test_off_policy_is.py` (CPU, `NUM_GPUS = 0`)