Skip to content

fix(datasets): align reasoning + assistant loss masks for left padding#2314

Open
khazic wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
khazic:khazic/fix/reasoning-mask-left-padding
Open

fix(datasets): align reasoning + assistant loss masks for left padding#2314
khazic wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
khazic:khazic/fix/reasoning-mask-left-padding

Conversation

@khazic
Copy link
Copy Markdown
Contributor

@khazic khazic commented May 25, 2026

What does this PR do?

Fix loss-mask misalignment for left-padding tokenizers in
format_chat_template. Affects both the assistant mask and the
reasoning mask paths.

Background

_build_multiturn_assistant_mask and _build_reasoning_mask compute
their span indices from unpadded (padding=False) tokenizations
because they walk the formatted text to find role boundaries. The mask
arrays they return are sized to the padded input_ids, however --
and when tokenizer.padding_side == "left" (e.g. Phi-3,
some Qwen-VL configs) the actual content is right-aligned in
input_ids. The mask positions then fall in the left-padding region,
the subsequent attention_mask-based zeroing wipes them, and training
silently learns from an all-zero loss mask for ~80% of samples.

Relationship to #2312

#2312 fixes this for the assistant mask path with an inline shift.
This PR is a strict superset of that fix:

  1. Extracts the shift into a reusable _maybe_shift_mask_for_left_padding
    helper.
  2. Applies it to both the assistant mask path (same site as feat(speculative): add Phi-3 support for EAGLE-1/2/3 + correctness fixes #2312)
    and the reasoning mask path (which feat(speculative): add Phi-3 support for EAGLE-1/2/3 + correctness fixes #2312 missed).
  3. Adds unit tests for the helper (8 cases covering happy / no-op /
    edge paths).

#2312 has since landed; the rebase conflict at the assistant-mask site has
been resolved by replacing the inline shift with the helper call.

Changelog

  • components/datasets/llm/formatting_utils.py:
  • tests/unit_tests/datasets/llm/test_shift_mask_left_padding.py:
    8 unit tests using a SimpleNamespace fake tokenizer (no HF
    dependency), covering right-padding no-op, left-padding shift,
    zero pad_len no-op, attention_mask=None, missing
    padding_side attribute (default right), all-padding,
    single-content-token.

Verification

Unit tests pass locally:

pytest tests/unit_tests/datasets/llm/test_shift_mask_left_padding.py -v

For right-padding tokenizers (Llama / Qwen / Mistral / ...) the helper
returns the original list object unchanged -- no behavioural change.

For left-padding tokenizers (Phi-3, ...) the previously-silent
all-zero loss_mask is now correctly aligned with content positions.

Before your PR is "Ready for review"

Pre checks:

  • Contributor guidelines followed
  • Unit tests added (test_shift_mask_left_padding.py)
  • No documentation changes needed -- internal helper.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test 861fa9d

@HuiyingLi
Copy link
Copy Markdown
Contributor

/claude review

@HuiyingLi
Copy link
Copy Markdown
Contributor

/claude review

_build_multiturn_assistant_mask and _build_reasoning_mask compute span
indices from unpadded (padding=False) tokenizations, but the mask arrays
are sized to the padded input_ids. When the tokenizer pads on the left,
content is right-aligned and the mask positions are off by pad_len.

Extract a shared _shift_mask_for_left_padding helper that shifts any
token-level mask right by the padding offset when padding_side="left".
Apply it to both the assistant mask and the reasoning mask in
format_chat_template. For right-padding tokenizers (the vast majority)
the helper is a no-op.

Signed-off-by: khazic <khazzz1c@gmail.com>
@khazic khazic force-pushed the khazic/fix/reasoning-mask-left-padding branch from 861fa9d to f35f088 Compare May 26, 2026 09:14
@HuiyingLi HuiyingLi enabled auto-merge (squash) May 26, 2026 09:19
@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test f35f088

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants