Skip to content

Comments

Implement DPPO Algorithm#1037

Open
jonahsamost wants to merge 4 commits intoNovaSky-AI:mainfrom
jonahsamost:jonah_dppo
Open

Implement DPPO Algorithm#1037
jonahsamost wants to merge 4 commits intoNovaSky-AI:mainfrom
jonahsamost:jonah_dppo

Conversation

@jonahsamost
Copy link

Referencing #1028

Implements DPPO, which replaces PPO's ratio based clipping with their own divergence based binary masking

@jonahsamost jonahsamost marked this pull request as ready for review February 7, 2026 15:56
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the Divergence Proximal Policy Optimization (DPPO) algorithm, including the configuration, the loss function implementation, and corresponding tests. The changes are well-structured and the implementation appears correct. I've only found a minor typo in the reference link to the DPPO paper in both the configuration file and the docstring, for which I've suggested corrections.

cispo_eps_clip_high: 5 # offset for upper bound of importance sampling ratio clipping (as opposed to PPO token update clipping)

# DPPO parameters (only used when policy_loss_type: "dppo")
# See: https://arxiv.org/abs/2602.04879
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a typo in the arXiv link. The ID 2602.04879 seems to be incorrect. The correct ID for 'Divergence Proximal Policy Optimization' is likely 2402.04879.

    # See: https://arxiv.org/abs/2402.04879

Section G.2 the authors find Top-K masking provides no significant benefit
over the simpler binary approximation, so we only implement binary here.

See: https://arxiv.org/abs/2602.04879
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a typo in the arXiv link. The ID 2602.04879 seems to be incorrect. The correct ID for 'Divergence Proximal Policy Optimization' is likely 2402.04879.

Suggested change
See: https://arxiv.org/abs/2602.04879
See: https://arxiv.org/abs/2402.04879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant