Skip to content

feat(rl): composable current-policy importance-sampling correction (TIS hook)#2084

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:off-policy-is
Open

feat(rl): composable current-policy importance-sampling correction (TIS hook)#2084
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:off-policy-is

Conversation

@EazyReal

Copy link
Copy Markdown
Contributor

What

Makes off-policy importance sampling composable through the existing TIS correction hook, by exposing the current policy log-probs to corrections.

  • Hook contract (loss.py): policy_loss_function now passes cur_log_probs (current grad-carrying π_θ, per-sample list) into the correction kwargs, alongside the frozen train_log_probs (π_θ_old) and rollout_log_probs (π_rollout). Existing corrections (vanilla_tis_function, icepop_function) ignore it via **kwargs.
  • off_policy_is_function (ppo_utils.py): truncated IS between the current policy and the actual rollout generator — the (detached) weight is clip(π_θ / π_rollout) against the real rollout/inference-engine logprob, so a single weight corrects both the train/inference mismatch and async (multi-version) staleness. The existing TIS only had π_θ_old / π_rollout (equals this only in the single-update-per-rollout limit).

CISPO via composition

On the plain REINFORCE base (#2083), this reproduces CISPO (MiniMax-M1) exactly:

--advantage-estimator reinforce --use-tis \
  --custom-tis-function-path slime.utils.ppo_utils.off_policy_is_function

L = -clip(π_θ/π_rollout, 1-eps_clip, 1+eps_clip_high).detach() · A · logπ, gradient only through logπ. Single-sided CISPO: --eps-clip 1.0.

Depends on #2083 for the CISPO use-case; code here is independent and the test uses a manual REINFORCE base so it stands alone.

Tests — tests/test_off_policy_is.py (CPU, NUM_GPUS = 0)

Clip/clipfrac/mask pass-through; single-sided at eps_clip=1.0; equivalence to the closed-form CISPO surrogate on a manual REINFORCE base (loss + gradient). (The cur_log_probs wiring lives in loss.py, which imports megatron, so it's exercised in the GPU CI suites.)

Relation to the dedicated CISPO estimator

CISPO already ships as --advantage-estimator cispo (compute_cispo_loss, #2067). This PR (+ #2083) offers the same surrogate compositionally, plus a general current-policy IS hook for other corrections. Both can coexist; if you'd prefer a single path, the composable form can subsume the dedicated estimator. Raising it for your preference — happy to align either way. (miles counterpart: radixark/miles#1344.)

🤖 Generated with Claude Code

Expose the current grad-carrying log-probs to the policy-loss TIS hook as
`cur_log_probs`, and add `off_policy_is_function` (in ppo_utils, next to
compute_policy_loss/compute_cispo_loss) -- a truncated-IS correction between the
*current* policy and the *actual rollout generator*: the (detached) weight is
`clip(pi_theta / pi_rollout)` against the real rollout logprob, so one weight
corrects both the train/inference mismatch and async (multi-version) staleness.
The existing TIS hook only had pi_theta_old / pi_rollout, which equals this only
in the single-update-per-rollout limit.

On a plain REINFORCE base (`--advantage-estimator reinforce`) this reproduces the
CISPO surrogate expressed as a correction rather than the dedicated
`compute_cispo_loss` estimator. Existing corrections ignore the new kwarg via **kwargs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant