feat: PPO with MCore by bg51717 · Pull Request #2530 · NVIDIA-NeMo/RL

bg51717 · 2026-05-19T14:43:35Z

What does this PR do ?

Adds full Proximal Policy Optimization (PPO) support to NeMo-RL with an actor-critic architecture, using the Megatron-Core (mcore) backend for both the policy and value models. The policy (actor) and value function (critic) are jointly trained using Generalized Advantage Estimation (GAE). Both models run on Megatron-Core with GPU/CPU offloading for colocated execution on the same set of GPUs as vLLM generation.

Backend support: This PR implements PPO on the Megatron-Core backend only. The DTensor/FSDP2 backend is not yet supported for the value model. All recipes and tests use megatron_cfg.enabled: true.

Issues

close #2048

Summary of Changes

PPO Training Algorithm

Complete PPO training loop with critic-before-actor update order
Multiple training steps per rollout (steps_per_epoch)
Configurable critic warmup (policy_training_start_step) — trains value model alone before policy updates begin
Dynamic sampling support
Colocated architecture with GPU memory management via model offloading

Generalized Advantage Estimation (GAE)

Token-level GAE with carry-forward masking for correct multi-turn/padding handling
Token-level KL penalty in rewards (configurable coefficient and KL type: k1/k3)
VAPO decoupled GAE: separate lambda for value returns vs. policy advantages
Length-adaptive lambda: lambda_policy = 1 - 1/(alpha * response_length)
Reward whitening

Value Model (Megatron-Core backend)

LM backbone + scalar value head, reusing Megatron-Core policy infrastructure
Supports TP/PP/DP parallelism, distributed optimizer, sequence packing
GPU/CPU offloading for colocated execution with policy and vLLM
Full checkpoint save/load including value head weights
Clipped MSE value loss with configurable loss scale and clip range
VAPO NLL auxiliary loss on correct samples

Shared Algorithm Improvements

Refactored clipped PG loss to support both GRPO and PPO
Added Reinforce++ and raw-reward advantage estimators
GSM8K answer extraction and verification environment

Configuration and Recipes

Base config: examples/configs/ppo_math_1B_megatron.yaml (DAPO-style PPO: no KL penalty, asymmetric clipping, dual-clip, reward scaling)
ppo-dsr1-7b-math-8n8g-megatron — DeepSeek-R1-7B on DAPOMath-17K, 8 nodes, KL penalty + importance sampling
ppo-qwen2.5-1.5b-gsm8k-1n8g-megatron — Qwen2.5-1.5B-Instruct on GSM8K, 1 node, VAPO decoupled GAE

Tests

Unit tests: 17 tests for GAE computation, value loss, advantage estimator factory; 78 tests for Megatron model setup
Functional test: End-to-end PPO training on 2 GPUs with metric assertions on ratio clipping
Nightly tests (2 recipes):
- 8-node DeepSeek-R1-7B on DAPOMath, 40 steps, checks reward > 0.3 and accuracy > 0.42 at step 40
- 1-node Qwen2.5-1.5B on GSM8K, 100 steps, checks reward > 0.85 and accuracy > 0.7 at step 100
Reference config snapshot tests for all algorithms

Documentation

Algorithm overview: key differences from GRPO
In-depth guide: value model, GAE, VAPO decoupled GAE, training loop, loss functions, configuration, and metrics

Architecture

PPO Training Loop (mcore)

Generate responses (vLLM, colocated)
Score with environment (math verification)
Value inference → per-token V(s_t)
Policy logprobs → π_θ(a|s)
GAE → advantages A_t, returns R_t
Train critic (MSE on returns)
Train actor (clipped surrogate objective)
8.Steps 6-7 repeat steps_per_epoch times

Experimental Results

GSM8K: Qwen2.5-1.5B-Instruct, 1 node x 8 GPUs

val:accuracy over steps — shows convergence on GSM8K test set
train/reward over steps — shows reward progression

DAPOMath-17K: DeepSeek-R1-7B, 8 nodes x 8 GPUs

Metrics to screenshot from wandb (project: nemo-rl, run: ppo-dsr1-7b-math-8n8g-megatron):

val:accuracy (AIME 2024) over steps — shows convergence on competition math
train/reward over steps — shows reward progression

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Signed-off-by: Gerald Shen <geshen@nvidia.com>

Signed-off-by: bg51717 <biguo@nvidia.com>

copy-pr-bot · 2026-05-19T14:43:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: bg51717 <biguo@nvidia.com>

bg51717 · 2026-05-21T10:42:37Z

/ok to test 50e878e

hXl3s and others added 30 commits May 19, 2026 07:03

feat(ppo): Implementation scaffolding

f9dc82d

feat(ppo): advantage estimator fix

6e3f9c2

fix(ppo): fix advantage computation

65c2a0f

feat(ppo): add support for value model and reward whitening loss

2fed353

feat: update automodel repo

c808882

revert: accidentaly removed reference model

c656db1

fix: advantage computation and better logging

bfb66f4

chore: random tests for convergence

97115fc

fix reward bug

44c9a43

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix value

9ef3fdd

Signed-off-by: Gerald Shen <geshen@nvidia.com>

mcore

4d5811a

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

3d50752

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

85c8d2a

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

1d4f40e

Signed-off-by: Gerald Shen <geshen@nvidia.com>

check

95bd4e8

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix21

191ec73

Signed-off-by: Gerald Shen <geshen@nvidia.com>

add

199b110

Signed-off-by: Gerald Shen <geshen@nvidia.com>

offload

a83a383

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

d4870f2

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

d23b1eb

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix config

141737a

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

5b24aa5

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

fb206ba

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

201737a

Signed-off-by: Gerald Shen <geshen@nvidia.com>

match more things to verl

a087df5

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix metric and lr step is per optim step

3607a59

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

9d38dd7

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix metric

08373cb

Signed-off-by: Gerald Shen <geshen@nvidia.com>

add vapo stuff

fe173c9

Signed-off-by: Gerald Shen <geshen@nvidia.com>

add metric

11fa90c

Signed-off-by: Gerald Shen <geshen@nvidia.com>

gshennvm and others added 5 commits May 19, 2026 07:06

metric

fc83183

Signed-off-by: Gerald Shen <geshen@nvidia.com>

sync geshen's latest: step-based critic warmup & optional value model

ed46f24

Signed-off-by: bg51717 <biguo@nvidia.com>

add gsm8k dataset

9d9768c

Signed-off-by: bg51717 <biguo@nvidia.com>

fix: update stale imports and submodule after rebase

6bacbde

Signed-off-by: bg51717 <biguo@nvidia.com>

code clean up and rebase

0cccea5

Signed-off-by: bg51717 <biguo@nvidia.com>

github-actions Bot added the Documentation Improvements or additions to documentation label May 19, 2026

metrics check

50e878e

Signed-off-by: bg51717 <biguo@nvidia.com>

bg51717 force-pushed the ppo branch from 70e6054 to 50e878e Compare May 21, 2026 10:29

bg51717 added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed Documentation Improvements or additions to documentation labels May 21, 2026

bg51717 marked this pull request as ready for review May 21, 2026 10:41

bg51717 requested review from a team as code owners May 21, 2026 10:41

copy-pr-bot Bot temporarily deployed to public May 21, 2026 10:42 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 21, 2026 10:43 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci May 21, 2026 10:43 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 10:43 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 10:47 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PPO with MCore#2530

feat: PPO with MCore#2530
bg51717 wants to merge 36 commits into
NVIDIA-NeMo:mainfrom
bg51717:ppo

bg51717 commented May 19, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

bg51717 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bg51717 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Summary of Changes

PPO Training Algorithm

Generalized Advantage Estimation (GAE)

Value Model (Megatron-Core backend)

Shared Algorithm Improvements

Configuration and Recipes

Tests

Documentation

Architecture

Experimental Results

GSM8K: Qwen2.5-1.5B-Instruct, 1 node x 8 GPUs

DAPOMath-17K: DeepSeek-R1-7B, 8 nodes x 8 GPUs

Before your PR is "Ready for review"

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

bg51717 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bg51717 commented May 19, 2026 •

edited

Loading