[feat:] Add CISPO loss by pengdurice · Pull Request #2531 · NVIDIA-NeMo/RL

pengdurice · 2026-05-19T17:24:04Z

What does this PR do ?

Continue working on the existing CISPO PR #2187.

Adds CISPO support to the GRPO loss path and provides matched GRPO/CISPO Qwen3-30B-A3B high-off-policy recipes to validate the clipped importance-sampling objective under repeated updates per rollout.

Notes:

This PR at this moment includes the yaml and sh files that can be used to fully reproduce the results below, can remove them later.
PR with test files cleaned.

Issues

N/A

Usage

You can potentially add a usage example below.

# GRPO baseline
bash tests/test_suites/llm/cispo-mm1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-grpo.sh

# CISPO treatment
bash tests/test_suites/llm/cispo-mm1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-cispo.sh

The recipes use Qwen3-30B-A3B with Megatron policy training and colocated vLLM generation. They are matched except for the loss configuration:

GRPO: standard clipped ratio objective.
CISPO: detached clipped IS-weight objective with a wider upper bound.

Validation Results

High-off-policy validation was run with Qwen3-30B-A3B on 2 nodes x 8 GPUs. The setup uses 32 prompts x 16 generations = 512 trajectories per rollout and train_global_batch_size=32, giving 16 policy updates per rollout.

Metric	GRPO	CISPO	Notes
Job ID	11703	11721	Both completed successfully
Average train reward, 100 steps	0.5399	0.5406	Roughly tied
Average train reward, last 10 steps	0.5383	0.5422	CISPO slightly higher
Final validation accuracy	0.5391	0.5703	CISPO higher
Average validation accuracy	0.5357	0.5368	Roughly tied
Final generation KL	0.0049	0.0023	CISPO lower
Average generation KL	0.0033	0.0020	CISPO lower
Average step time	204.0s	205.3s	Similar
Last-10 average step time	201.0s	199.6s	Similar

Mechanistic checks:

Metric	GRPO	CISPO	Notes
`train/probs_ratio_clamped_max`	1.2	6.0	CISPO uses the intended wide clipped IS-weight path
`train/cispo_diag/grpo_would_clip_frac` average	0.00354	0.00319	Similar rate of tokens that standard GRPO would hard-clip
`train/cispo_diag/would_clip_and_low_prob_frac` average	0.00051	0.00071	CISPO saw more low-probability clipped-token mass

Async lag-1 high-off-policy validation was also run with the same model and rollout/update ratio, but with non-colocated async vLLM generation and max_trajectory_age_steps=1.

Metric	GRPO	CISPO	Notes
Job ID	11833	11834	Both completed successfully
Average train reward, 100 steps	0.5352	0.5414	CISPO higher
Validation accuracy @ step 0	0.5234	0.5234	Matched start
Validation accuracy @ step 20	0.5312	0.5312	Matched
Validation accuracy @ step 40	0.5391	0.5781	CISPO higher
Validation accuracy @ step 60	0.5391	0.5391	Matched
Validation accuracy @ step 80	0.5938	0.5234	GRPO higher
Validation accuracy @ step 100	0.5156	0.5703	CISPO higher
Average validation accuracy	0.5404	0.5443	CISPO slightly higher
Average validation response length	2962.3	2971.4	Similar

Async lag-1 mechanistic checks:

Metric	GRPO	CISPO	Notes
`train/probs_ratio_clamped_max`	1.2	6.0	CISPO uses the intended wide clipped IS-weight path
`train/reward` average	0.5352	0.5414	Same as average train reward
`train/policy_kl_error` average	2294.65	2.81	GRPO had much larger outliers in this run
`train/token_mult_prob_error` average	2295.66	3.83	GRPO had much larger outliers in this run

Summary: Across the synchronous high-off-policy and async lag-1 validations, CISPO shows modest positive evidence: average train reward is roughly tied to slightly higher, average validation accuracy is slightly higher, and the async lag-1 run finishes with higher final validation accuracy. The effect is not a clean sweep at every checkpoint, but CISPO is at least competitive with GRPO and appears more stable on the ratio/KL outlier diagnostics in the async lag-1 setting.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
- Functional validation: 100-step GRPO/CISPO high-off-policy runs on Qwen3-30B-A3B completed.
- Functional validation: 100-step GRPO/CISPO async lag-1 high-off-policy runs on Qwen3-30B-A3B completed.
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

CISPO validation used the final high-off-policy recipes:

examples/configs/recipes/llm/cispo-mm1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-grpo.yaml
examples/configs/recipes/llm/cispo-mm1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-cispo.yaml
examples/configs/recipes/llm/cispo-mm1-async-lag1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-grpo.yaml
examples/configs/recipes/llm/cispo-mm1-async-lag1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-cispo.yaml

The corresponding launch scripts are:

tests/test_suites/llm/cispo-mm1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-grpo.sh
tests/test_suites/llm/cispo-mm1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-cispo.sh
tests/test_suites/llm/cispo-mm1-async-lag1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-grpo.sh
tests/test_suites/llm/cispo-mm1-async-lag1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-cispo.sh

copy-pr-bot · 2026-05-19T17:24:12Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: slikhite-1 <slikhite@nvidia.com>

…ove / check later Signed-off-by: pengdurice <pengduhit@gmail.com>

Signed-off-by: pengdurice <pengduhit@gmail.com>

github-actions Bot added the community-request label May 19, 2026

slikhite-1 and others added 7 commits May 20, 2026 18:30

CISPO implementation

49e6b55

Signed-off-by: slikhite-1 <slikhite@nvidia.com>

≈test cases

b3cc275

Signed-off-by: slikhite-1 <slikhite@nvidia.com>

docs

32af27d

Signed-off-by: slikhite-1 <slikhite@nvidia.com>

removed assertion

617bc93

Signed-off-by: slikhite-1 <slikhite@nvidia.com>

assert removed

96f9b2e

Signed-off-by: slikhite-1 <slikhite@nvidia.com>

initial fix of the previous PR, add many test cases now, and will rem…

bfa408f

…ove / check later Signed-off-by: pengdurice <pengduhit@gmail.com>

only include used tests and will do more clean up later

fa3b68a

Signed-off-by: pengdurice <pengduhit@gmail.com>

pengdurice force-pushed the peng-cispo-v1 branch from 2494632 to fa3b68a Compare May 20, 2026 19:00

pengdurice added 4 commits May 20, 2026 21:38

Fix CISPO rebase cleanup issues

333759e

Signed-off-by: pengdurice <pengduhit@gmail.com>

add async yaml and sh files

b081112

Signed-off-by: pengdurice <pengduhit@gmail.com>

clean up some sh and yaml files, add one nightly and to doc

9d44d97

Signed-off-by: pengdurice <pengduhit@gmail.com>

remove diagnositic

4907872

Signed-off-by: pengdurice <pengduhit@gmail.com>

github-actions Bot added the Documentation Improvements or additions to documentation label May 21, 2026

add cispo.md

06c4fee

Signed-off-by: pengdurice <pengduhit@gmail.com>

pengdurice changed the title ~~[feat:] [Draft] Initial fix of the CISPO PR~~ [feat:] Add CISPO loss May 21, 2026

pengdurice marked this pull request as ready for review May 21, 2026 23:35

pengdurice requested review from a team as code owners May 21, 2026 23:35

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat:] Add CISPO loss#2531

[feat:] Add CISPO loss#2531
pengdurice wants to merge 12 commits into
NVIDIA-NeMo:mainfrom
pengdurice:peng-cispo-v1

pengdurice commented May 19, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pengdurice commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Notes:

Issues

Usage

Validation Results

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pengdurice commented May 19, 2026 •

edited

Loading