refactor(grpo): migrate TypedDict configs to pydantic BaseModel by NolenLiang · Pull Request #2518 · NVIDIA-NeMo/RL

NolenLiang · 2026-05-18T07:34:18Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Results before / after the changes

copy-pr-bot · 2026-05-18T07:34:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

NolenLiang · 2026-05-18T07:44:46Z

/ok to test b37bbb9

NolenLiang · 2026-05-18T07:52:21Z

/ok to test 77aec62768f3f5e17bba09dfea25e8b58b4df7b1

copy-pr-bot · 2026-05-18T07:52:24Z

/ok to test 77aec62768f3f5e17bba09dfea25e8b58b4df7b1

@NolenLiang, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

NolenLiang · 2026-05-18T07:53:47Z

/ok to test 77aec62

NolenLiang · 2026-05-18T08:01:56Z

/ok to test 77aec62

NolenLiang · 2026-05-18T08:51:39Z

/ok to test 36dc855

NolenLiang · 2026-05-18T08:59:24Z

/ok to test 85be127

NolenLiang · 2026-05-18T14:08:36Z

/ok to test 363aace

NolenLiang · 2026-05-19T02:50:26Z

/ok to test 7bfac42

Convert 6 TypedDict classes to pydantic BaseModel with extra="allow": - GRPOConfig, AsyncGRPOConfig, AdvEstimatorConfig, RewardScalingConfig, GRPOSaveState (grpo.py) - RewardShapingConfig (reward_functions.py) Update all dict-style access (config["key"]) to attribute access (config.key) across grpo.py, reward_functions.py, utils.py, trajectory_collector.py, and related tests. Signed-off-by: nliang <nliang@nvidia.com>

Tests use MasterConfig.model_construct() which skips pydantic validation and does not auto-convert nested dicts to BaseModel instances. Wrap grpo dict with GRPOConfig.model_construct(), and nested reward_scaling/reward_shaping/async_grpo with their respective BaseModel.model_construct() calls. Signed-off-by: nliang <nliang@nvidia.com>

Signed-off-by: nliang <nliang@nvidia.com>

- Run ruff format on grpo.py and trajectory_collector.py - Add skip_reference_policy_logprobs_calculation and calculate_advantages_on_gpu to grpo_math_1B.yaml reference config (new BaseModel defaults now serialize these keys) Signed-off-by: nliang <nliang@nvidia.com>

Signed-off-by: nliang <nliang@nvidia.com>

…construct - advantage_estimator.py: change estimator_config["key"] to estimator_config.key (AdvEstimatorConfig is now BaseModel) - test_async_utils.py: wrap grpo dict in GRPOConfig.model_construct() and async_grpo in AsyncGRPOConfig.model_construct() Signed-off-by: nliang <nliang@nvidia.com>

Signed-off-by: nliang <nliang@nvidia.com>

…construct 8 occurrences of plain dict estimator_config in test_grpo.py converted to AdvEstimatorConfig.model_construct(). Signed-off-by: nliang <nliang@nvidia.com>

…_functions RewardShapingConfig is now BaseModel; config["key"] = val must use config.key = val. Signed-off-by: nliang <nliang@nvidia.com>

run_grpo.py, run_grpo_nemo_gym.py, run_grpo_sliding_puzzle.py still used config.grpo["key"] and "async_grpo" in config.grpo patterns. The "in" check on a BaseModel doesn't work like dict membership, causing async routing to fall through to sync grpo_train. Signed-off-by: nliang <nliang@nvidia.com>

checkpointer.load_training_info() returns a plain dict from JSON deserialization. With GRPOSaveState as BaseModel, cast() alone doesn't convert it. Add isinstance check to construct BaseModel from dict on checkpoint resume. Signed-off-by: nliang <nliang@nvidia.com>

All fields in GRPOConfig, RewardScalingConfig, AsyncGRPOConfig, AdvEstimatorConfig, and RewardShapingConfig now have default values matching the reference config (grpo_math_1B.yaml). This allows constructing configs with just the overrides needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: nliang <nliang@nvidia.com>

NolenLiang · 2026-05-19T06:25:38Z

/ok to test 72c0892

NolenLiang requested review from a team as code owners May 18, 2026 07:34

copy-pr-bot Bot temporarily deployed to public May 18, 2026 07:35 Inactive

NolenLiang added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 18, 2026

copy-pr-bot Bot temporarily deployed to public May 18, 2026 07:36 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 07:40 Inactive

NolenLiang changed the title ~~Nliang/typeddict to basemodel grpo~~ refactor(grpo): migrate TypedDict configs to pydantic BaseModel May 18, 2026

copy-pr-bot Bot temporarily deployed to public May 18, 2026 08:02 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 18, 2026 08:02 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 18, 2026 08:02 Failure

copy-pr-bot Bot temporarily deployed to public May 18, 2026 08:02 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 08:06 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 08:51 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 18, 2026 08:52 Error

copy-pr-bot Bot temporarily deployed to nemo-ci May 18, 2026 08:52 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 08:52 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 08:56 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 08:59 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 12:53 Inactive

NolenLiang added CI:L1 Run doctests, unit tests, and functional tests and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels May 18, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci May 18, 2026 15:33 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 18, 2026 15:54 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 18, 2026 17:38 Failure

copy-pr-bot Bot temporarily deployed to public May 19, 2026 02:50 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 19, 2026 02:50 Inactive

copy-pr-bot Bot temporarily deployed to public May 19, 2026 02:51 Inactive

copy-pr-bot Bot temporarily deployed to public May 19, 2026 02:54 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 19, 2026 03:48 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 19, 2026 05:27 Error

NolenLiang and others added 13 commits May 18, 2026 23:18

style: fix ruff format for long seq_logprob_error_threshold lines

bc01670

Signed-off-by: nliang <nliang@nvidia.com>

style: run ruff format on test files

9373559

Signed-off-by: nliang <nliang@nvidia.com>

style: fix import order in test_grpo.py (isort)

cb0124e

Signed-off-by: nliang <nliang@nvidia.com>

fix(test): wrap adv_estimator dict in AdvEstimatorConfig.model_construct

126e0f9

Signed-off-by: nliang <nliang@nvidia.com>

fix(test): wrap estimator_config dicts with AdvEstimatorConfig.model_…

585958f

…construct 8 occurrences of plain dict estimator_config in test_grpo.py converted to AdvEstimatorConfig.model_construct(). Signed-off-by: nliang <nliang@nvidia.com>

fix(test): convert dict assignment to attribute access in test_reward…

ae469af

…_functions RewardShapingConfig is now BaseModel; config["key"] = val must use config.key = val. Signed-off-by: nliang <nliang@nvidia.com>

yuki-97 mentioned this pull request May 21, 2026

feat: data plane transfer queue integration #2439

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(grpo): migrate TypedDict configs to pydantic BaseModel#2518

refactor(grpo): migrate TypedDict configs to pydantic BaseModel#2518
NolenLiang wants to merge 13 commits into
mainfrom
nliang/typeddict-to-basemodel-grpo

NolenLiang commented May 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 19, 2026

Uh oh!

NolenLiang commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NolenLiang commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Results before / after the changes

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 18, 2026

Uh oh!

NolenLiang commented May 19, 2026

Uh oh!

NolenLiang commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NolenLiang commented May 18, 2026 •

edited

Loading