Skip to content

Comments

Add MaxRL mean normalization over advantages#1126

Open
tamoghnokandar wants to merge 8 commits intoNovaSky-AI:mainfrom
tamoghnokandar:main
Open

Add MaxRL mean normalization over advantages#1126
tamoghnokandar wants to merge 8 commits intoNovaSky-AI:mainfrom
tamoghnokandar:main

Conversation

@tamoghnokandar
Copy link
Contributor

@tamoghnokandar tamoghnokandar commented Feb 15, 2026

Fixes #1030


Open with Devin

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

raise ValueError(f"no score in prompt index: {idx}")
for i in range(bsz):
if len(id2score[index[i]]) > 1:
scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 MAXRL divides by negative mean, inverting advantage signs when group mean reward is negative

The MAXRL advantage formula divides by (id2mean[index[i]] + epsilon) without taking the absolute value of the mean. When the group mean reward is negative, this flips the sign of the advantage, causing the model to reinforce bad responses and penalize good ones.

Detailed Explanation

Consider a group with scores [-3, -5] (mean = -4):

  • Score -3 (better): advantage = (-3 - (-4)) / (-4 + 1e-6) = 1 / -4 ≈ -0.25
  • Score -5 (worse): advantage = (-5 - (-4)) / (-4 + 1e-6) = -1 / -4 ≈ 0.25

The better response (-3) gets a negative advantage and the worse response (-5) gets a positive advantage. This is inverted — the policy gradient will push the model toward worse responses.

The fix should use abs(id2mean) in the denominator to ensure the normalization preserves the correct sign of the centered scores:

scores[i] = (scores[i] - id2mean[index[i]]) / (abs(id2mean[index[i]]) + epsilon)

Impact: Training with MAXRL on any task where group mean rewards can be negative will produce inverted policy gradients, actively degrading model performance.

Suggested change
scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)
scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]].abs() + epsilon)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

else:
raise ValueError(f"no score in prompt index: {idx}")
for i in range(bsz):
scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Same negative-mean division bug in skyrl/ copy of MAXRL

The skyrl/skyrl/backends/skyrl_train/utils/ppo_utils.py copy has the same negative-mean sign-inversion bug as BUG-0001, dividing by (id2mean + epsilon) instead of (abs(id2mean) + epsilon). See BUG-0001 for the detailed explanation of how this inverts advantages when group mean rewards are negative.

Root Cause

At skyrl/skyrl/backends/skyrl_train/utils/ppo_utils.py:1213:

scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)

This should use abs(id2mean[index[i]]) in the denominator to avoid sign inversion when the mean is negative.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[algorithm] Add MaxRL mean normalization over advantages

1 participant