fix(opd): score teacher logprobs at rollout temperature, not 0 by EazyReal · Pull Request #2085 · THUDM/slime

EazyReal · 2026-06-15T20:44:46Z

Problem

The on-policy-distillation teacher reward_func (slime/rollout/on_policy_distillation.py) scores teacher log-probs via SGLang with a hardcoded temperature: 0, then uses them for the OPD reverse-KL student - teacher.

SGLang computes input_token_logprobs with temperature scaling (compute_temp_top_p_normalized_logprobs), and the student log-probs are temperature-scaled by rollout_temperature (get_responses). So when rollout_temperature != 1, the two sides of the OPD KL are at different effective temperatures → the distillation signal is biased.

Fix

Score the teacher at rollout_temperature (same as the student), so the KL is between same-temperature distributions. No change at the default rollout_temperature=1.0. (miles counterpart: radixark/miles#1345.)

🤖 Generated with Claude Code

The on-policy-distillation teacher reward_func scored teacher log-probs via SGLang with a hardcoded `temperature: 0`. SGLang computes input_token_logprobs WITH temperature scaling (compute_temp_top_p_normalized_logprobs), and the student log-probs are temperature-scaled by rollout_temperature (get_responses). So when rollout_temperature != 1 the OPD reverse-KL (student - teacher) compares log-probs at different effective temperatures and is biased. Score the teacher at rollout_temperature so both sides of the KL match. No change at the default rollout_temperature=1.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(opd): score teacher logprobs at rollout temperature, not 0#2085

fix(opd): score teacher logprobs at rollout temperature, not 0#2085
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:opd-teacher-temperature

EazyReal commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EazyReal commented Jun 15, 2026

Problem

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant