feat: PPO with MCore#2530
Open
bg51717 wants to merge 36 commits into
Open
Conversation
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Contributor
Author
|
/ok to test 50e878e |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Adds full Proximal Policy Optimization (PPO) support to NeMo-RL with an actor-critic architecture, using the Megatron-Core (mcore) backend for both the policy and value models. The policy (actor) and value function (critic) are jointly trained using Generalized Advantage Estimation (GAE). Both models run on Megatron-Core with GPU/CPU offloading for colocated execution on the same set of GPUs as vLLM generation.
Issues
close #2048
Summary of Changes
PPO Training Algorithm
steps_per_epoch)policy_training_start_step) — trains value model alone before policy updates beginGeneralized Advantage Estimation (GAE)
lambda_policy = 1 - 1/(alpha * response_length)Value Model (Megatron-Core backend)
Shared Algorithm Improvements
Configuration and Recipes
examples/configs/ppo_math_1B_megatron.yaml(DAPO-style PPO: no KL penalty, asymmetric clipping, dual-clip, reward scaling)ppo-dsr1-7b-math-8n8g-megatron— DeepSeek-R1-7B on DAPOMath-17K, 8 nodes, KL penalty + importance samplingppo-qwen2.5-1.5b-gsm8k-1n8g-megatron— Qwen2.5-1.5B-Instruct on GSM8K, 1 node, VAPO decoupled GAETests
Documentation
Architecture
PPO Training Loop (mcore)
8.Steps 6-7 repeat
steps_per_epochtimesExperimental Results
GSM8K: Qwen2.5-1.5B-Instruct, 1 node x 8 GPUs
val:accuracyover steps — shows convergence on GSM8K test settrain/rewardover steps — shows reward progressionDAPOMath-17K: DeepSeek-R1-7B, 8 nodes x 8 GPUs
Metrics to screenshot from wandb (project:
nemo-rl, run:ppo-dsr1-7b-math-8n8g-megatron):val:accuracy(AIME 2024) over steps — shows convergence on competition mathtrain/rewardover steps — shows reward progressionBefore your PR is "Ready for review"
Pre checks: