Skip to content

refactor(grpo): migrate TypedDict configs to pydantic BaseModel#2518

Open
NolenLiang wants to merge 13 commits into
mainfrom
nliang/typeddict-to-basemodel-grpo
Open

refactor(grpo): migrate TypedDict configs to pydantic BaseModel#2518
NolenLiang wants to merge 13 commits into
mainfrom
nliang/typeddict-to-basemodel-grpo

Conversation

@NolenLiang
Copy link
Copy Markdown
Contributor

@NolenLiang NolenLiang commented May 18, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Results before / after the changes

image image image

@NolenLiang NolenLiang requested review from a team as code owners May 18, 2026 07:34
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@NolenLiang NolenLiang added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 18, 2026
@NolenLiang NolenLiang changed the title Nliang/typeddict to basemodel grpo refactor(grpo): migrate TypedDict configs to pydantic BaseModel May 18, 2026
@NolenLiang
Copy link
Copy Markdown
Contributor Author

/ok to test b37bbb9

@NolenLiang
Copy link
Copy Markdown
Contributor Author

/ok to test 77aec62768f3f5e17bba09dfea25e8b58b4df7b1

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

/ok to test 77aec62768f3f5e17bba09dfea25e8b58b4df7b1

@NolenLiang, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@NolenLiang
Copy link
Copy Markdown
Contributor Author

/ok to test 77aec62

@NolenLiang
Copy link
Copy Markdown
Contributor Author

/ok to test 77aec62

@NolenLiang
Copy link
Copy Markdown
Contributor Author

/ok to test 36dc855

@NolenLiang
Copy link
Copy Markdown
Contributor Author

/ok to test 85be127

@NolenLiang NolenLiang added CI:L1 Run doctests, unit tests, and functional tests and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels May 18, 2026
@NolenLiang
Copy link
Copy Markdown
Contributor Author

/ok to test 363aace

@NolenLiang
Copy link
Copy Markdown
Contributor Author

/ok to test 7bfac42

NolenLiang and others added 13 commits May 18, 2026 23:18
Convert 6 TypedDict classes to pydantic BaseModel with extra="allow":
- GRPOConfig, AsyncGRPOConfig, AdvEstimatorConfig, RewardScalingConfig,
  GRPOSaveState (grpo.py)
- RewardShapingConfig (reward_functions.py)

Update all dict-style access (config["key"]) to attribute access
(config.key) across grpo.py, reward_functions.py, utils.py,
trajectory_collector.py, and related tests.

Signed-off-by: nliang <nliang@nvidia.com>
Tests use MasterConfig.model_construct() which skips pydantic
validation and does not auto-convert nested dicts to BaseModel
instances. Wrap grpo dict with GRPOConfig.model_construct(), and
nested reward_scaling/reward_shaping/async_grpo with their
respective BaseModel.model_construct() calls.

Signed-off-by: nliang <nliang@nvidia.com>
- Run ruff format on grpo.py and trajectory_collector.py
- Add skip_reference_policy_logprobs_calculation and
  calculate_advantages_on_gpu to grpo_math_1B.yaml reference config
  (new BaseModel defaults now serialize these keys)

Signed-off-by: nliang <nliang@nvidia.com>
Signed-off-by: nliang <nliang@nvidia.com>
Signed-off-by: nliang <nliang@nvidia.com>
…construct

- advantage_estimator.py: change estimator_config["key"] to
  estimator_config.key (AdvEstimatorConfig is now BaseModel)
- test_async_utils.py: wrap grpo dict in GRPOConfig.model_construct()
  and async_grpo in AsyncGRPOConfig.model_construct()

Signed-off-by: nliang <nliang@nvidia.com>
…construct

8 occurrences of plain dict estimator_config in test_grpo.py
converted to AdvEstimatorConfig.model_construct().

Signed-off-by: nliang <nliang@nvidia.com>
…_functions

RewardShapingConfig is now BaseModel; config["key"] = val must
use config.key = val.

Signed-off-by: nliang <nliang@nvidia.com>
run_grpo.py, run_grpo_nemo_gym.py, run_grpo_sliding_puzzle.py still
used config.grpo["key"] and "async_grpo" in config.grpo patterns.
The "in" check on a BaseModel doesn't work like dict membership,
causing async routing to fall through to sync grpo_train.

Signed-off-by: nliang <nliang@nvidia.com>
checkpointer.load_training_info() returns a plain dict from JSON
deserialization. With GRPOSaveState as BaseModel, cast() alone
doesn't convert it. Add isinstance check to construct BaseModel
from dict on checkpoint resume.

Signed-off-by: nliang <nliang@nvidia.com>
All fields in GRPOConfig, RewardScalingConfig, AsyncGRPOConfig,
AdvEstimatorConfig, and RewardShapingConfig now have default values
matching the reference config (grpo_math_1B.yaml). This allows
constructing configs with just the overrides needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: nliang <nliang@nvidia.com>
@NolenLiang
Copy link
Copy Markdown
Contributor Author

/ok to test 72c0892

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant