[Distillation] Fix eod masking + strategy refactoring by vlad-karp · Pull Request #3478 · AI-Hypercomputer/maxtext

vlad-karp · 2026-03-20T23:43:40Z

Description

The current code didn't correctly mask eod tokens when they are segment delimiters - sft mode was fine, but the pretraining incorrectly included that token into prediction.
The last token in the sample still need to predict the eod token as expected, it is the oed input token which need to be excluded from the loss.
The kl divergence averaging wasn't correct as well.
Since this logic was in anonymous labels_fn function, it is hard to unit test it.
Since we are not longer inheriting from tunix distillation strategy, we can add it directly to our custom strategy class.
The base strategy class has been intrduced.

A new test test_strategy_ignores_segmentation_zero_tokens() has been added to tests/post_training/unit/train_distill_test.py

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-03-20T23:47:33Z

❌ Patch coverage is 76.47059% with 8 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ners/post_train/distillation/distillation_utils.py	74.19%	7 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

vlad-karp added 2 commits March 20, 2026 23:31

fixed eod masking + refactored distill strategy

ce6fce8

fix merge

699ca4f