fix(datasets): align reasoning + assistant loss masks for left padding#2314
Open
khazic wants to merge 1 commit into
Open
fix(datasets): align reasoning + assistant loss masks for left padding#2314khazic wants to merge 1 commit into
khazic wants to merge 1 commit into
Conversation
Contributor
|
/ok to test 861fa9d |
Contributor
|
/claude review |
Contributor
|
/claude review |
_build_multiturn_assistant_mask and _build_reasoning_mask compute span indices from unpadded (padding=False) tokenizations, but the mask arrays are sized to the padded input_ids. When the tokenizer pads on the left, content is right-aligned and the mask positions are off by pad_len. Extract a shared _shift_mask_for_left_padding helper that shifts any token-level mask right by the padding offset when padding_side="left". Apply it to both the assistant mask and the reasoning mask in format_chat_template. For right-padding tokenizers (the vast majority) the helper is a no-op. Signed-off-by: khazic <khazzz1c@gmail.com>
861fa9d to
f35f088
Compare
HuiyingLi
approved these changes
May 26, 2026
Contributor
|
/ok to test f35f088 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fix loss-mask misalignment for left-padding tokenizers in
format_chat_template. Affects both the assistant mask and thereasoning mask paths.
Background
_build_multiturn_assistant_maskand_build_reasoning_maskcomputetheir span indices from unpadded (
padding=False) tokenizationsbecause they walk the formatted text to find role boundaries. The mask
arrays they return are sized to the padded
input_ids, however --and when
tokenizer.padding_side == "left"(e.g. Phi-3,some Qwen-VL configs) the actual content is right-aligned in
input_ids. The mask positions then fall in the left-padding region,the subsequent
attention_mask-based zeroing wipes them, and trainingsilently learns from an all-zero loss mask for ~80% of samples.
Relationship to #2312
#2312 fixes this for the assistant mask path with an inline shift.
This PR is a strict superset of that fix:
_maybe_shift_mask_for_left_paddinghelper.
and the reasoning mask path (which feat(speculative): add Phi-3 support for EAGLE-1/2/3 + correctness fixes #2312 missed).
edge paths).
#2312 has since landed; the rebase conflict at the assistant-mask site has
been resolved by replacing the inline shift with the helper call.
Changelog
components/datasets/llm/formatting_utils.py:_maybe_shift_mask_for_left_padding(mask, tokenizer, attention_mask)helper. Three short-circuit guards make it a no-op for
right-padding tokenizers, missing
attention_mask, andpad_len == 0._build_multiturn_assistant_mask._build_reasoning_mask(the bug feat(speculative): add Phi-3 support for EAGLE-1/2/3 + correctness fixes #2312 missed).tests/unit_tests/datasets/llm/test_shift_mask_left_padding.py:8 unit tests using a
SimpleNamespacefake tokenizer (no HFdependency), covering right-padding no-op, left-padding shift,
zero
pad_lenno-op,attention_mask=None, missingpadding_sideattribute (default right), all-padding,single-content-token.
Verification
Unit tests pass locally:
For right-padding tokenizers (Llama / Qwen / Mistral / ...) the helper
returns the original list object unchanged -- no behavioural change.
For left-padding tokenizers (Phi-3, ...) the previously-silent
all-zero loss_mask is now correctly aligned with content positions.
Before your PR is "Ready for review"
Pre checks:
test_shift_mask_left_padding.py)