Add MaxRL mean normalization over advantages by tamoghnokandar · Pull Request #1126 · NovaSky-AI/SkyRL

tamoghnokandar · 2026-02-15T19:53:41Z

Fixes #1030

devin-ai-integration

Devin Review found 2 new potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-02-15T20:13:02Z

skyrl-train/skyrl_train/utils/ppo_utils.py

+                raise ValueError(f"no score in prompt index: {idx}")
+        for i in range(bsz):
+            if len(id2score[index[i]]) > 1:
+                scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)


🔴 MAXRL divides by negative mean, inverting advantage signs when group mean reward is negative

The MAXRL advantage formula divides by (id2mean[index[i]] + epsilon) without taking the absolute value of the mean. When the group mean reward is negative, this flips the sign of the advantage, causing the model to reinforce bad responses and penalize good ones.

Detailed Explanation

Consider a group with scores [-3, -5] (mean = -4):

Score -3 (better): advantage = (-3 - (-4)) / (-4 + 1e-6) = 1 / -4 ≈ -0.25

Score -5 (worse): advantage = (-5 - (-4)) / (-4 + 1e-6) = -1 / -4 ≈ 0.25

The better response (-3) gets a negative advantage and the worse response (-5) gets a positive advantage. This is inverted — the policy gradient will push the model toward worse responses.

The fix should use abs(id2mean) in the denominator to ensure the normalization preserves the correct sign of the centered scores:

scores[i] = (scores[i] - id2mean[index[i]]) / (abs(id2mean[index[i]]) + epsilon)

Impact: Training with MAXRL on any task where group mean rewards can be negative will produce inverted policy gradients, actively degrading model performance.

Suggested change

scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)

scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]].abs() + epsilon)

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-02-15T20:13:04Z

skyrl/skyrl/backends/skyrl_train/utils/ppo_utils.py

+            else:
+                raise ValueError(f"no score in prompt index: {idx}")
+        for i in range(bsz):
+            scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)


🔴 Same negative-mean division bug in skyrl/ copy of MAXRL

The skyrl/skyrl/backends/skyrl_train/utils/ppo_utils.py copy has the same negative-mean sign-inversion bug as BUG-0001, dividing by (id2mean + epsilon) instead of (abs(id2mean) + epsilon). See BUG-0001 for the detailed explanation of how this inverts advantages when group mean rewards are negative.

Root Cause

At skyrl/skyrl/backends/skyrl_train/utils/ppo_utils.py:1213:

scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)

This should use abs(id2mean[index[i]]) in the denominator to avoid sign inversion when the mean is negative.

Was this helpful? React with 👍 or 👎 to provide feedback.

tamoghnokandar and others added 4 commits February 15, 2026 11:44

Add MaxRL

1b01e60

Update pyproject.toml

6e8194f

Update pyproject.toml

78e2e91

Update pyproject.toml

e30405e

This comment was marked as resolved.

Sign in to view

Fix the case when mean is 0

162b9c6

devin-ai-integration bot reviewed Feb 15, 2026

View reviewed changes

tamoghnokandar added 3 commits February 15, 2026 12:16

Fix mean=0 error in backend

8e5420d

Fix tests

bd14d5e

Fix naming

59af7a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add MaxRL mean normalization over advantages#1126

Add MaxRL mean normalization over advantages#1126
tamoghnokandar wants to merge 8 commits intoNovaSky-AI:mainfrom
tamoghnokandar:main

tamoghnokandar commented Feb 15, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Feb 15, 2026

Uh oh!

devin-ai-integration bot Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)
	scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]].abs() + epsilon)

Comments

Conversation

tamoghnokandar commented Feb 15, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tamoghnokandar commented Feb 15, 2026 •

edited by devin-ai-integration bot

Loading