feat: adding PP and CP for nemotron v3 models by adil-a · Pull Request #2316 · NVIDIA-NeMo/Automodel

adil-a · 2026-05-25T23:30:55Z

No description provided.

… fix Source-only snapshot for adil/adil-test (YAMLs intentionally excluded). - NemotronV3 pipeline-parallel + MTP wiring (model.py, mtp.py, layers.py, state_dict_adapter.py, common/mtp/mtp.py). - Loss path: PipelineCausalLMLoss, calculate_mtp_loss with cu_seqlens / seq_idx threading, MoEAuxLossAutoScaler-aware grad scaling in train_ft. - Attention preprocess threading for THD + CP (attention/utils.py). - MoE parallelizer: iterate transformer and MTP blocks together for CP hook attach and FSDP wrapping. THD collator fix: - process_input_for_thd: gate trailing-pack-pad absorption on trailing_pad <= max_seqlen; for larger pads, extend cu_seqlens with dummy max_seqlen-sized slots and regenerate position_ids per-slot. - split_batch_into_thd_chunks: emit cu_seqlens_padded whenever any chunk needs it (any-vs-all), absorbed chunks fall back to cu_seqlens. - NemotronV3Model.forward and NemotronHForCausalLM MTP-kwargs: strip -1000 sentinel right-pad from cu_seqlens / cu_seqlens_padded before they reach TE attention / mamba searchsorted. Tests: - tests/unit_tests/distributed/test_thd_utils.py: dummy-slot extension coverage, mixed-chunk short+full coverage; updated existing tests to reflect cu_seqlens emit semantics (real cumsum, not padded). - tests/unit_tests/models/nemotron_v3/test_nemotron_v3_mtp.py + test_nemotron_v3_pp_mtp.py: list-form mtp_layers_block_type resolution, PP-stage metas, MTP placement on last stage, sentinel filter on trimmed mid-stages. Dataset: - nemo_automodel/components/datasets/llm/pre_rendered_chat_dataset.py: new dataset for pre-rendered chat data. Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Not used by the THD/PP/MTP changes on this branch; keep it out of the remote until it ships with its own dedicated PR. Signed-off-by: adil-a <adil.asif2000@hotmail.com>

copy-pr-bot · 2026-05-25T23:30:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…contract The THD collator's trailing-pad absorption grows ``cu_seqlens[-1]`` to ``total_tokens`` so TE can skip the ``pad_between_seqs=True`` path. But ``max_seqlen`` was computed BEFORE the absorption from the unpadded ``seq_lens`` — so we handed TE a ``cu_seqlens`` containing a slot wider than the ``max_seqlen`` value we claimed. That is explicit UB per TE's documented contract (``fused_attn.py:152-159``, ``fused_attn.h:548-551``: ``max_seqlen_q ... may be >= max(seqlen_q_i)``; smaller-than-actual is undefined). cuDNN-fused-attn-bwd happened not to crash on the small-overshoot case (128 vs 112) but did crash on the larger captured "short microbatch" (576 vs 112). Fix: defer the ``max_seqlen`` computation to after the absorption block and source it from the FINAL ``cu_seqlens`` slot widths. With the value now truthful, absorption is contract-clean for any trailing-pad size, so the previous defensive dummy-slot extension is no longer needed. Verified with a single-GPU TE probe (te_thd_repro_MINIMAL.py): feeding ``cu_seqlens=[0,112,224,336,448,1024]`` with truthful ``max_seqlen=576`` runs cleanly; the same layout with the lied ``max_seqlen=112`` still crashes. Smoke tests pass at PP=4 EP=2 CP=1 and PP=4 EP=2 CP=2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

``cu_seqlens`` and ``cu_seqlens_padded`` are constructed inside nested ``if seq_lens is not None``/``if seq_lens_padded is not None`` blocks, so ``cu_seqlens_padded`` can only be non-None when ``cu_seqlens`` is also non-None — the ``else cu_seqlens_padded`` fallback is unreachable. Falling back to the padded variant would also be semantically wrong (memory offsets vs. compute extent, per TE's docs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

…tics After the trailing-pad absorption + max_seqlen fix, ``cu_seqlens`` is emitted from ``seq_lens`` (real lengths), not ``seq_lens_padded``. The docstrings still said the opposite and the example output value was wrong. Also document the optional ``cu_seqlens_padded`` and ``max_seqlen`` keys that the function emits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Keep the comments that explain WHY (TE contract, the absorption gate, the pad_between_seqs flip on cu_seqlens_padded presence). Drop the ones that just restate WHAT the surrounding code already shows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

adil-a added 2 commits May 25, 2026 16:06

chore(datasets): drop pre_rendered_chat_dataset.py from branch

cf8f3de

Not used by the THD/PP/MTP changes on this branch; keep it out of the remote until it ships with its own dedicated PR. Signed-off-by: adil-a <adil.asif2000@hotmail.com>

adil-a and others added 5 commits May 25, 2026 19:15

comments

aa9238e

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adding PP and CP for nemotron v3 models#2316

feat: adding PP and CP for nemotron v3 models#2316
adil-a wants to merge 7 commits into
mainfrom
adil/adil-test

adil-a commented May 25, 2026

Uh oh!

copy-pr-bot Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adil-a commented May 25, 2026

Uh oh!

copy-pr-bot Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant