feat(rl): composable current-policy importance-sampling correction (TIS hook)#2084
Open
EazyReal wants to merge 1 commit into
Open
feat(rl): composable current-policy importance-sampling correction (TIS hook)#2084EazyReal wants to merge 1 commit into
EazyReal wants to merge 1 commit into
Conversation
Expose the current grad-carrying log-probs to the policy-loss TIS hook as `cur_log_probs`, and add `off_policy_is_function` (in ppo_utils, next to compute_policy_loss/compute_cispo_loss) -- a truncated-IS correction between the *current* policy and the *actual rollout generator*: the (detached) weight is `clip(pi_theta / pi_rollout)` against the real rollout logprob, so one weight corrects both the train/inference mismatch and async (multi-version) staleness. The existing TIS hook only had pi_theta_old / pi_rollout, which equals this only in the single-update-per-rollout limit. On a plain REINFORCE base (`--advantage-estimator reinforce`) this reproduces the CISPO surrogate expressed as a correction rather than the dedicated `compute_cispo_loss` estimator. Existing corrections ignore the new kwarg via **kwargs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Makes off-policy importance sampling composable through the existing TIS correction hook, by exposing the current policy log-probs to corrections.
loss.py):policy_loss_functionnow passescur_log_probs(current grad-carrying π_θ, per-sample list) into the correction kwargs, alongside the frozentrain_log_probs(π_θ_old) androllout_log_probs(π_rollout). Existing corrections (vanilla_tis_function,icepop_function) ignore it via**kwargs.off_policy_is_function(ppo_utils.py): truncated IS between the current policy and the actual rollout generator — the (detached) weight isclip(π_θ / π_rollout)against the real rollout/inference-engine logprob, so a single weight corrects both the train/inference mismatch and async (multi-version) staleness. The existing TIS only hadπ_θ_old / π_rollout(equals this only in the single-update-per-rollout limit).CISPO via composition
On the plain REINFORCE base (#2083), this reproduces CISPO (MiniMax-M1) exactly:
→
L = -clip(π_θ/π_rollout, 1-eps_clip, 1+eps_clip_high).detach() · A · logπ, gradient only throughlogπ. Single-sided CISPO:--eps-clip 1.0.Tests —
tests/test_off_policy_is.py(CPU,NUM_GPUS = 0)Clip/clipfrac/mask pass-through; single-sided at
eps_clip=1.0; equivalence to the closed-form CISPO surrogate on a manual REINFORCE base (loss + gradient). (Thecur_log_probswiring lives inloss.py, which imports megatron, so it's exercised in the GPU CI suites.)Relation to the dedicated CISPO estimator
CISPO already ships as
--advantage-estimator cispo(compute_cispo_loss, #2067). This PR (+ #2083) offers the same surrogate compositionally, plus a general current-policy IS hook for other corrections. Both can coexist; if you'd prefer a single path, the composable form can subsume the dedicated estimator. Raising it for your preference — happy to align either way. (miles counterpart: radixark/miles#1344.)🤖 Generated with Claude Code