Add MaxRL mean normalization over advantages#1126
Add MaxRL mean normalization over advantages#1126tamoghnokandar wants to merge 8 commits intoNovaSky-AI:mainfrom
Conversation
| raise ValueError(f"no score in prompt index: {idx}") | ||
| for i in range(bsz): | ||
| if len(id2score[index[i]]) > 1: | ||
| scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon) |
There was a problem hiding this comment.
🔴 MAXRL divides by negative mean, inverting advantage signs when group mean reward is negative
The MAXRL advantage formula divides by (id2mean[index[i]] + epsilon) without taking the absolute value of the mean. When the group mean reward is negative, this flips the sign of the advantage, causing the model to reinforce bad responses and penalize good ones.
Detailed Explanation
Consider a group with scores [-3, -5] (mean = -4):
- Score -3 (better):
advantage = (-3 - (-4)) / (-4 + 1e-6) = 1 / -4 ≈ -0.25 - Score -5 (worse):
advantage = (-5 - (-4)) / (-4 + 1e-6) = -1 / -4 ≈ 0.25
The better response (-3) gets a negative advantage and the worse response (-5) gets a positive advantage. This is inverted — the policy gradient will push the model toward worse responses.
The fix should use abs(id2mean) in the denominator to ensure the normalization preserves the correct sign of the centered scores:
scores[i] = (scores[i] - id2mean[index[i]]) / (abs(id2mean[index[i]]) + epsilon)Impact: Training with MAXRL on any task where group mean rewards can be negative will produce inverted policy gradients, actively degrading model performance.
| scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon) | |
| scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]].abs() + epsilon) |
Was this helpful? React with 👍 or 👎 to provide feedback.
| else: | ||
| raise ValueError(f"no score in prompt index: {idx}") | ||
| for i in range(bsz): | ||
| scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon) |
There was a problem hiding this comment.
🔴 Same negative-mean division bug in skyrl/ copy of MAXRL
The skyrl/skyrl/backends/skyrl_train/utils/ppo_utils.py copy has the same negative-mean sign-inversion bug as BUG-0001, dividing by (id2mean + epsilon) instead of (abs(id2mean) + epsilon). See BUG-0001 for the detailed explanation of how this inverts advantages when group mean rewards are negative.
Root Cause
At skyrl/skyrl/backends/skyrl_train/utils/ppo_utils.py:1213:
scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon)This should use abs(id2mean[index[i]]) in the denominator to avoid sign inversion when the mean is negative.
Was this helpful? React with 👍 or 👎 to provide feedback.
Fixes #1030