Skip to content

Commit 2a6fee4

Browse files
jasont314nazar-ospanovzimo0110sanjay-adhikesaven
committed
Enable NemotronH PP/EP SFT path and fix fixed-length SQuAD supervision
Integrates NemotronH PP/EP execution and safety guards, fixes fixed-length SQuAD label/tokenization masking so num_label_tokens stays > 0, and adds PP schedule/device/logging robustness updates. Includes optimized PP+EP SQuAD config, patch notes, training artifacts (baseline/optimized JSONL), and unit-test updates (79 passed, 5 skipped). Co-authored-by: Nazar Ospanov <aimogenius@berkeley.edu> Co-authored-by: Zoir Imomaliev <91550816+zimo0110@users.noreply.github.com> Co-authored-by: Sanjay Adhikesaven <sanjay.adhikesaven1@gmail.com> Signed-off-by: Jason Trinh <jasontrinh@berkeley.edu>
1 parent b001c72 commit 2a6fee4

20 files changed

Lines changed: 2055 additions & 196 deletions

File tree

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# Nemotron PP/EP + SQuAD Patch Notes
2+
3+
This document summarizes the code-level changes prepared for a PR to make Nemotron-Nano-v3 PP/EP training and fixed-length SQuAD SFT stable, debuggable, and reproducible.
4+
5+
## Scope
6+
7+
The patch set keeps core behavior unchanged for existing non-PP/non-EP paths while addressing:
8+
- PP schedule robustness and diagnostics,
9+
- EP mesh/dispatch safety,
10+
- fixed-length SQuAD supervision correctness,
11+
- PR-level cleanup and configuration handoff.
12+
13+
## Major Functional Changes
14+
15+
### 1) Nemotron PP compatibility and PP runtime safeguards
16+
17+
Files:
18+
- `nemo_automodel/components/distributed/pipelining/functional.py`
19+
- `nemo_automodel/recipes/llm/train_ft.py`
20+
21+
Key changes:
22+
- Added explicit invalid-style handling in `stage_ids_this_rank(...)` (`ValueError` on unknown style).
23+
- Isolated `NEMOAUTOMODEL_PP_SKIP_OUTPUT_MERGE` behavior behind a guarded helper (`_enable_skip_output_merge_if_supported`) with compatibility checks before patching private schedule internals.
24+
- Kept skip-output-merge behavior available for train/benchmark runs where schedule outputs are not consumed.
25+
26+
Why:
27+
- Prevent implicit `None` returns and harder-to-debug failures.
28+
- Make PP skip-merge behavior safer across PyTorch internals drift.
29+
30+
### 2) Nemotron EP safety and guardrails
31+
32+
Files:
33+
- `nemo_automodel/components/moe/parallelizer.py`
34+
- `nemo_automodel/recipes/llm/train_ft.py`
35+
36+
Key changes:
37+
- Added null guard for `ep_shard_axis_names` when `moe_mesh` is not available.
38+
- In LLM setup, `ep_axis_name` / `ep_shard_axis_names` are only passed when corresponding mesh dims exist.
39+
40+
Why:
41+
- Avoid null dereference and confusing startup crashes in mixed EP/non-EP code paths.
42+
43+
### 3) AutoPipeline device typing cleanup
44+
45+
Files:
46+
- `nemo_automodel/components/distributed/pipelining/autopipeline.py`
47+
- `nemo_automodel/recipes/llm/train_ft.py`
48+
- `nemo_automodel/recipes/biencoder/train_biencoder.py`
49+
50+
Key changes:
51+
- AutoPipeline now normalizes `device` input (`torch.device | int | str`) to `torch.device` at construction.
52+
- Call sites now pass `torch.device("cuda", torch.cuda.current_device())` instead of raw `int`.
53+
54+
Why:
55+
- Remove type mismatch and prevent API ambiguity/drift.
56+
57+
### 4) FSDP2 diagnostics clarity
58+
59+
File:
60+
- `nemo_automodel/components/distributed/fsdp2.py`
61+
62+
Key change:
63+
- Corrected divisibility error message to match logic using `tp_size * cp_size * pp_size`.
64+
65+
Why:
66+
- Better debugging clarity in distributed setup failures.
67+
68+
### 5) Logging observability default
69+
70+
File:
71+
- `nemo_automodel/components/loggers/log_utils.py`
72+
73+
Key changes:
74+
- `setup_logging(..., filter_warning=False)` by default.
75+
- Added env override: `NEMOAUTOMODEL_FILTER_WARNINGS=1` to re-enable global warning filtering.
76+
77+
Why:
78+
- Avoid hiding warnings by default during PP/EP debugging and PR validation.
79+
80+
## Fixed-Length SQuAD Supervision (NaN-loss root cause)
81+
82+
Files:
83+
- `nemo_automodel/components/datasets/llm/formatting_utils.py`
84+
- `nemo_automodel/components/datasets/llm/squad.py`
85+
86+
Key changes:
87+
- Made prompt-completion mask generation truncation-aware.
88+
- For fixed-length SQuAD (`seq_length`, `padding=max_length`, `truncation=true`), forced truncation settings that preserve supervised answer tokens.
89+
- Disabled chat-template path for this fixed-length SQuAD mode to avoid all-masked labels.
90+
91+
Observed effect:
92+
- `num_label_tokens` moved from `0` to large nonzero values on optimized SFT runs.
93+
- `loss` and `grad_norm` became finite/nonzero.
94+
95+
## Observed SFT Throughput (from training JSONL)
96+
97+
From:
98+
- `checkpoints/baseline_training.jsonl`
99+
- `checkpoints/optimized_training.jsonl`
100+
101+
Measured `tps`:
102+
- Baseline mean `tps`: ~326.10
103+
- Optimized mean `tps`: ~12109.64
104+
- Mean throughput uplift: ~37.1x
105+
106+
For reference, last logged step:
107+
- Baseline last-step `tps`: ~284.51
108+
- Optimized last-step `tps`: ~12104.76
109+
- Last-step throughput uplift: ~42.5x
110+
111+
## New Example Config
112+
113+
Added:
114+
- `examples/llm_finetune/nemotron/nemotron_nano_v3_pp_ep_squad.yaml`
115+
116+
This is the optimized PP+EP SQuAD SFT recipe used for reproducible runs with:
117+
- `pp_size=4`, `ep_size=2`,
118+
- manual PP module mapping,
119+
- fixed-length SQuAD.
120+
121+
## Recommended Runtime Settings (PP+EP)
122+
123+
Use YAML variables under `dist_env` (in `examples/llm_finetune/nemotron/nemotron_nano_v3_pp_ep_squad.yaml`):
124+
125+
```yaml
126+
dist_env:
127+
torch_nccl_use_comm_nonblocking: true
128+
pytorch_alloc_conf: "expandable_segments:True"
129+
nemotronh_ep_use_deepep_dispatch: true
130+
nemotronh_ep_require_deepep: true
131+
nemotronh_ep_physical_partition: true
132+
nemotronh_ep_sync_inactive_experts: true
133+
nemotronh_ep_expert_reshard_after_forward: false
134+
nemoautomodel_pp_skip_output_merge: true
135+
```
136+
137+
Meaning of each variable:
138+
- `torch_nccl_use_comm_nonblocking: true`: enables NCCL non-blocking error handling to reduce hard hangs.
139+
- `pytorch_alloc_conf: "expandable_segments:True"`: reduces allocator fragmentation under large transient GPU allocations.
140+
- `nemotronh_ep_use_deepep_dispatch: true`: uses DeepEP token dispatch path for EP.
141+
- `nemotronh_ep_require_deepep: true`: fails fast if DeepEP is unavailable (prevents silent fallback).
142+
- `nemotronh_ep_physical_partition: true`: uses physical expert partition ownership across EP ranks.
143+
- `nemotronh_ep_sync_inactive_experts: true`: keeps EP/FSDP collectives synchronized even for inactive experts.
144+
- `nemotronh_ep_expert_reshard_after_forward: false`: avoids immediate post-forward expert reshard to reduce short-run overhead.
145+
- `nemoautomodel_pp_skip_output_merge: true`: skips last-stage output merge/concat when schedule outputs are unused, lowering PP memory pressure.
146+
147+
Implementation note:
148+
- The recipe applies these `dist_env` values before CUDA initialization and maps them to their corresponding runtime env vars.
149+
- Existing externally set env vars still take precedence.
150+
151+
Caveats:
152+
- Env precedence is intentional: if a variable is already set externally, YAML will not override it.
153+
- The YAML-to-env hook is currently applied in the `train_ft` setup path; other recipe entrypoints need the same hook for identical behavior.
154+
- Avoid `null` entries in `dist_env.runtime_env`; they would be converted to the string `"None"` when mapped to environment variables.
155+
156+
## Minimal Validation Checklist for PR
157+
158+
1. Fixed-length SQuAD train smoke (few steps):
159+
- `num_label_tokens > 0`
160+
- finite `loss`, finite `grad_norm`
161+
162+
2. PP+EP startup:
163+
- no mesh null dereference in EP shard setup
164+
- PP schedule builds with/without skip-merge patch
165+
166+
3. Lint/syntax:
167+
- no duplicate `nn` imports in MoE parallelizer
168+
- all edited files compile (`py_compile`)

0 commit comments

Comments
 (0)