Skip to content

feat: PPO with MCore#2530

Open
bg51717 wants to merge 36 commits into
NVIDIA-NeMo:mainfrom
bg51717:ppo
Open

feat: PPO with MCore#2530
bg51717 wants to merge 36 commits into
NVIDIA-NeMo:mainfrom
bg51717:ppo

Conversation

@bg51717
Copy link
Copy Markdown
Contributor

@bg51717 bg51717 commented May 19, 2026

What does this PR do ?

Adds full Proximal Policy Optimization (PPO) support to NeMo-RL with an actor-critic architecture, using the Megatron-Core (mcore) backend for both the policy and value models. The policy (actor) and value function (critic) are jointly trained using Generalized Advantage Estimation (GAE). Both models run on Megatron-Core with GPU/CPU offloading for colocated execution on the same set of GPUs as vLLM generation.

Backend support: This PR implements PPO on the Megatron-Core backend only. The DTensor/FSDP2 backend is not yet supported for the value model. All recipes and tests use megatron_cfg.enabled: true.

Issues

close #2048

Summary of Changes

PPO Training Algorithm

  • Complete PPO training loop with critic-before-actor update order
  • Multiple training steps per rollout (steps_per_epoch)
  • Configurable critic warmup (policy_training_start_step) — trains value model alone before policy updates begin
  • Dynamic sampling support
  • Colocated architecture with GPU memory management via model offloading

Generalized Advantage Estimation (GAE)

  • Token-level GAE with carry-forward masking for correct multi-turn/padding handling
  • Token-level KL penalty in rewards (configurable coefficient and KL type: k1/k3)
  • VAPO decoupled GAE: separate lambda for value returns vs. policy advantages
  • Length-adaptive lambda: lambda_policy = 1 - 1/(alpha * response_length)
  • Reward whitening

Value Model (Megatron-Core backend)

  • LM backbone + scalar value head, reusing Megatron-Core policy infrastructure
  • Supports TP/PP/DP parallelism, distributed optimizer, sequence packing
  • GPU/CPU offloading for colocated execution with policy and vLLM
  • Full checkpoint save/load including value head weights
  • Clipped MSE value loss with configurable loss scale and clip range
  • VAPO NLL auxiliary loss on correct samples

Shared Algorithm Improvements

  • Refactored clipped PG loss to support both GRPO and PPO
  • Added Reinforce++ and raw-reward advantage estimators
  • GSM8K answer extraction and verification environment

Configuration and Recipes

  • Base config: examples/configs/ppo_math_1B_megatron.yaml (DAPO-style PPO: no KL penalty, asymmetric clipping, dual-clip, reward scaling)
  • ppo-dsr1-7b-math-8n8g-megatron — DeepSeek-R1-7B on DAPOMath-17K, 8 nodes, KL penalty + importance sampling
  • ppo-qwen2.5-1.5b-gsm8k-1n8g-megatron — Qwen2.5-1.5B-Instruct on GSM8K, 1 node, VAPO decoupled GAE

Tests

  • Unit tests: 17 tests for GAE computation, value loss, advantage estimator factory; 78 tests for Megatron model setup
  • Functional test: End-to-end PPO training on 2 GPUs with metric assertions on ratio clipping
  • Nightly tests (2 recipes):
    • 8-node DeepSeek-R1-7B on DAPOMath, 40 steps, checks reward > 0.3 and accuracy > 0.42 at step 40
    • 1-node Qwen2.5-1.5B on GSM8K, 100 steps, checks reward > 0.85 and accuracy > 0.7 at step 100
  • Reference config snapshot tests for all algorithms

Documentation

  • Algorithm overview: key differences from GRPO
  • In-depth guide: value model, GAE, VAPO decoupled GAE, training loop, loss functions, configuration, and metrics

Architecture

PPO Training Loop (mcore)

  1. Generate responses (vLLM, colocated)
  2. Score with environment (math verification)
  3. Value inference → per-token V(s_t)
  4. Policy logprobs → π_θ(a|s)
  5. GAE → advantages A_t, returns R_t
  6. Train critic (MSE on returns)
  7. Train actor (clipped surrogate objective)
    8.Steps 6-7 repeat steps_per_epoch times

Experimental Results

GSM8K: Qwen2.5-1.5B-Instruct, 1 node x 8 GPUs

截屏2026-05-21 18 36 47
  1. val:accuracy over steps — shows convergence on GSM8K test set
  2. train/reward over steps — shows reward progression

DAPOMath-17K: DeepSeek-R1-7B, 8 nodes x 8 GPUs

截屏2026-05-21 18 38 29

Metrics to screenshot from wandb (project: nemo-rl, run: ppo-dsr1-7b-math-8n8g-megatron):

  1. val:accuracy (AIME 2024) over steps — shows convergence on competition math
  2. train/reward over steps — shows reward progression

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

hXl3s and others added 30 commits May 19, 2026 07:03
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
gshennvm and others added 5 commits May 19, 2026 07:06
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label May 19, 2026
Signed-off-by: bg51717 <biguo@nvidia.com>
@bg51717 bg51717 added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed Documentation Improvements or additions to documentation labels May 21, 2026
@bg51717 bg51717 marked this pull request as ready for review May 21, 2026 10:41
@bg51717 bg51717 requested review from a team as code owners May 21, 2026 10:41
@bg51717
Copy link
Copy Markdown
Contributor Author

bg51717 commented May 21, 2026

/ok to test 50e878e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[mcore] PPO

3 participants