Skip to content

fix(opd): score teacher logprobs at rollout temperature, not 0#2085

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:opd-teacher-temperature
Open

fix(opd): score teacher logprobs at rollout temperature, not 0#2085
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:opd-teacher-temperature

Conversation

@EazyReal

Copy link
Copy Markdown
Contributor

Problem

The on-policy-distillation teacher reward_func (slime/rollout/on_policy_distillation.py) scores teacher log-probs via SGLang with a hardcoded temperature: 0, then uses them for the OPD reverse-KL student - teacher.

SGLang computes input_token_logprobs with temperature scaling (compute_temp_top_p_normalized_logprobs), and the student log-probs are temperature-scaled by rollout_temperature (get_responses). So when rollout_temperature != 1, the two sides of the OPD KL are at different effective temperatures → the distillation signal is biased.

Fix

Score the teacher at rollout_temperature (same as the student), so the KL is between same-temperature distributions. No change at the default rollout_temperature=1.0. (miles counterpart: radixark/miles#1345.)

🤖 Generated with Claude Code

The on-policy-distillation teacher reward_func scored teacher log-probs via SGLang
with a hardcoded `temperature: 0`. SGLang computes input_token_logprobs WITH
temperature scaling (compute_temp_top_p_normalized_logprobs), and the student
log-probs are temperature-scaled by rollout_temperature (get_responses). So when
rollout_temperature != 1 the OPD reverse-KL (student - teacher) compares log-probs
at different effective temperatures and is biased.

Score the teacher at rollout_temperature so both sides of the KL match. No change
at the default rollout_temperature=1.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant