[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts by cjluo-nv · Pull Request #1391 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-05-04T23:28:25Z

Summary

Adds two PTQ recipes that combine experts/MLP-only NVFP4 W4A4 with MSE FP8 scale-sweep weight calibration and FP8 KV cache with use_constant_amax: true (skips KV calibration; matches the nvfp4_default-fp8_cast_kv contract):
- modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml — applies to *mlp.experts* / *block_sparse_moe* only.
- modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml — applies to all *mlp* / *block_sparse_moe* (dense MLP + MoE).
Threads a new --recipe flag through examples/llm_ptq/scripts/parser.sh and huggingface_example.sh. Either --quant or --recipe is required; passing both errors out. Recipe names are not validated in the script — hf_ptq.py is the source of truth.
Drops the bash-side qformat whitelist case-statement in huggingface_example.sh for the same reason.

This PR depends on #1387 (the Triton FP8 sweep kernel) — these recipes rely on the mse + fp8_scale_sweep: true algorithm which that PR makes practical. Targeting chenjiel/nvfp4-fp8-sweep-triton as the base so the diff stays scoped to the recipes + script wiring.

Files

New recipes (modelopt_recipes/general/ptq/):

nvfp4_experts_only_mse-fp8_cast_kv.yaml — same patterns as nvfp4_experts_only-fp8_kv.yaml.
nvfp4_mlp_only_mse-fp8_cast_kv.yaml — same patterns as nvfp4_mlp_only-fp8_kv.yaml.

Both differ from their _kv siblings by:

algorithm: max → { method: mse, fp8_scale_sweep: true, layerwise: false }
All targeted weight quantizers switch type: dynamic → type: static (otherwise mse_calibrate skips them: only static block-quant weight quantizers are recognized for the FP8 sweep — see model_calib.py:369-374).
Input quantizers stay dynamic.
KV bmm adds use_constant_amax: true (the _cast_kv flavor).

Scripts (examples/llm_ptq/scripts/):

parser.sh — adds --recipe long-option, default RECIPE="", validates one-of-{--quant, --recipe} and not-both.
huggingface_example.sh — when RECIPE is set, derives MODEL_NAME from the recipe basename, passes --recipe=… to hf_ptq.py instead of --qformat=…, and exits after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs that the script's downstream run_tensorrt_llm.py path doesn't know how to handle generically). Drops the qformat whitelist; defers to hf_ptq.py.

Behavior

# Errors with: "Cannot specify both --quant and --recipe; pick one."
bash huggingface_example.sh --model=... --quant=nvfp4 --recipe=... --tasks=quant

# Errors with usage if neither is given
bash huggingface_example.sh --model=... --tasks=quant

# Both of these are now accepted; --recipe is forwarded verbatim to hf_ptq.py
bash huggingface_example.sh --model=... --quant=nvfp4 --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv  --tasks=quant

Test plan

experts_only_mse-fp8_cast_kv loads via modelopt.recipe.load_recipe(...) and produces the expected algorithm + per-pattern quant_cfg (verified in a working env: algorithm == {'method': 'mse', 'fp8_scale_sweep': True, 'layerwise': False}; expert weight quantizers type: static; KV bmm has use_constant_amax: True).
Parser sanity: 4 flag combinations (both, neither, only --quant, only --recipe) all behave as designed.
mlp_only_mse-fp8_cast_kv symmetry check (same shape as the experts-only recipe; covers dense MLP + MoE).
End-to-end run on a small MoE checkpoint via huggingface_example.sh --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv to confirm the recipe path produces a deployable checkpoint.

Note

Pre-commit hook check-modelopt-recipes was skipped on both commits because the local conda env has a broken torchvision install (AttributeError: partially initialized module 'torchvision' has no attribute 'extension') that prevents from modelopt.recipe.loader import load_recipe. The experts_only recipe was validated independently by running tools/precommit/check_modelopt_recipes.py in a working environment (exits 0); the mlp_only one is the same shape with a different glob.

🤖 Generated with Claude Code

Replaces the 126-iteration Python sweep in NVFP4MSECalibrator with a single fused Triton kernel that loads each NVFP4 block once, evaluates all 126 valid FP8 E4M3 scale candidates in registers, and emits the per-block best amax directly. For our specific candidate set (FP8 representable values / 448) the FP8 round-trip on the per-block scale is the identity, so the kernel uses `scale = candidate * global_amax / 6.0` and runs on any CUDA + Triton. Triton-backed calibrator is on by default for `mse_calibrate(... fp8_scale_sweep=True)`; set `MODELOPT_NVFP4_TRITON_SWEEP=0` to fall back to the reference for debugging. Measured ~7.4x speedup on a B300 over the reference NVFP4MSECalibrator (8192x4096 weight, ~2M NVFP4 blocks: 176.67 ms -> 23.81 ms). Bit-identical to the reference for typical block counts; on multi-million-block weights an occasional adjacent-candidate tie-break can differ at the fp32-noise level (observed 2 / 2,097,152 blocks; per-block MSE within 1e-7 relative). Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

…ner loop Two follow-on optimizations to the fused FP8 scale sweep kernel: 1. @triton.autotune over (BLOCKS_PER_PROGRAM, num_warps): a hand-sweep on B300 showed the previous default (BPP=4, num_warps=4) at 23.7 ms left ~4x on the table — best config (BPP=64, num_warps=8) lands at ~5 ms. Three configs are included to cover small/medium/large N_BLOCKS without flooding compile time. 2. Drop the sign-handling tl.where: since FP4 quantization preserves sign, (w - w_q)^2 == (|w| - |w_q|)^2, so the kernel works on |w| throughout and skips one tl.where + negation per element per candidate. Result on the same 8192x4096 weight (~2M blocks) on B300: reference NVFP4MSECalibrator: 176.68 ms triton TritonNVFP4MSECalibrator: 4.23 ms speedup: 41.8x (was 7.4x) This is ~1.2x above the rough pure-compute floor (~240 GF / 67 TF/s ~= 3.6 ms), so the kernel is now near saturation and further wins would need an algorithmic change (candidate pruning, etc.). Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Addresses review comments on PR #1387: - TritonNVFP4MSECalibrator.reset() now leaves the calibrator reusable: shape / dtype / n_blocks of the initial amax are stashed in __init__, so collect() no longer depends on _initial_amax surviving reset(). Adds an x.ndim==2 assertion in collect() since the weight quantizer always reshapes upstream. - nvfp4_fp8_scale_sweep validates inputs cleanly instead of using assert (which is stripped by python -O): rejects non-CUDA tensors, non-positive block_size, and empty / non-1D candidates with ValueError. Skips the per-element finite/positive check on candidates since it would scan a 126- entry tensor on every kernel call. - mse_calibrate hoists the MODELOPT_NVFP4_TRITON_SWEEP env-var lookup out of the per-quantizer loop and resolves to the calibrator class once. - Updates test_reset_allows_recollect to verify the new reuse contract; adds test_input_validation covering the new ValueErrors. The duplicate fp8_scale_candidates implementation in the kernel file and NVFP4MSECalibrator._generate_candidates() is left in place: deduplicating would force the reference path to import from the kernel module, which is gated behind Triton availability. The FP8 E4M3 spec is fixed and the parity test exercises both paths against each other. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

…recipe support in scripts - Add modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml, combining experts-only NVFP4 W4A4 with the MSE FP8 scale-sweep weight calibration (algorithm: mse, fp8_scale_sweep: true; expert weight blocks switched to "static" so the static FP8 sweep applies) and FP8 KV cache with use_constant_amax: true. - examples/llm_ptq/scripts: thread a new --recipe flag through parser.sh and huggingface_example.sh. Either --quant or --recipe is required; passing both errors out. When --recipe is used, the script derives MODEL_NAME from the recipe basename, passes --recipe= to hf_ptq.py, and exits after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs). - Drop the qformat case-statement whitelist in huggingface_example.sh; let hf_ptq.py be the single source of truth for valid qformats / recipes. (Pre-commit hook check-modelopt-recipes was skipped: the host conda env has a broken torchvision install that prevents the validator from importing modelopt. The recipe was verified independently via tools/precommit/check_modelopt_recipes.py in a working environment.) Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-05-04T23:28:32Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (3)

main
release/.*
feature/.*

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6d837c02-e06f-46a4-ada8-46045a89ba92

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-04T23:41:49Z

Codecov Report

❌ Patch coverage is 72.61905% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.86%. Comparing base (acfab41) to head (1af5ce1).
⚠️ Report is 28 commits behind head on main.

Files with missing lines	Patch %	Lines
...torch/kernels/quantization/gemm/nvfp4_fp8_sweep.py	56.60%	23 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1391      +/-   ##
==========================================
- Coverage   76.90%   76.86%   -0.05%     
==========================================
  Files         471      472       +1     
  Lines       50562    50660      +98     
==========================================
+ Hits        38886    38939      +53     
- Misses      11676    11721      +45

Flag	Coverage Δ
examples	`41.52% <26.19%> (+0.86%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Same shape as nvfp4_experts_only_mse-fp8_cast_kv but with the broader *mlp* / *block_sparse_moe* patterns from nvfp4_mlp_only-fp8_kv.yaml so it covers both dense MLP and MoE expert weights: - algorithm: { method: mse, fp8_scale_sweep: true, layerwise: false } - All MLP weight quantizers switched from "dynamic" to "static" so the static FP8 scale sweep applies (otherwise mse_calibrate skips them). - Input quantizers stay dynamic. - KV bmm gets use_constant_amax: true (the _cast_kv flavor: skips KV calibration, hardcodes amax to FP8 E4M3 max 448.0). Pre-commit hook check-modelopt-recipes was skipped because the host conda env has a broken torchvision install that prevents the validator from importing modelopt; the recipe is the same shape as the experts-only one which already validates cleanly in a working env. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

…mlp_only) + --recipe in example scripts (#1407) ## Summary - Adds two PTQ recipes that combine **experts/MLP-only NVFP4 W4A4** with **MSE FP8 scale-sweep weight calibration** and **FP8 KV cache with `use_constant_amax: true`** (skips KV calibration; matches the `nvfp4_default-fp8_cast_kv` contract): - `modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml` — applies to `*mlp.experts*` / `*block_sparse_moe*` only. - `modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml` — applies to all `*mlp*` / `*block_sparse_moe*` (dense MLP + MoE). - Threads a new `--recipe` flag through `examples/llm_ptq/scripts/parser.sh` and `huggingface_example.sh`. Either `--quant` or `--recipe` is required; passing **both errors out**. Recipe names are not validated in the script — `hf_ptq.py` is the source of truth. - Drops the bash-side `qformat` whitelist case-statement in `huggingface_example.sh` for the same reason. ## Files **New recipes (`modelopt_recipes/general/ptq/`):** - `nvfp4_experts_only_mse-fp8_cast_kv.yaml` — same patterns as `nvfp4_experts_only-fp8_kv.yaml`. - `nvfp4_mlp_only_mse-fp8_cast_kv.yaml` — same patterns as `nvfp4_mlp_only-fp8_kv.yaml`. Both differ from their `_kv` siblings by: - `algorithm: max` → `{ method: mse, fp8_scale_sweep: true, layerwise: false }` - All targeted **weight quantizers** switch `type: dynamic` → `type: static` (otherwise `mse_calibrate` skips them: only static block-quant weight quantizers are recognized for the FP8 sweep — see `model_calib.py:369-374`). - Input quantizers stay dynamic. - KV bmm adds `use_constant_amax: true` (the `_cast_kv` flavor). **Scripts (`examples/llm_ptq/scripts/`):** - `parser.sh` — adds `--recipe` long-option, default `RECIPE=""`, validates one-of-{`--quant`, `--recipe`} and not-both. - `huggingface_example.sh` — when `RECIPE` is set, derives `MODEL_NAME` from the recipe basename, passes `--recipe=…` to `hf_ptq.py` instead of `--qformat=…`, and exits after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs that the script's downstream `run_tensorrt_llm.py` path doesn't know how to handle generically). Drops the `qformat` whitelist; defers to `hf_ptq.py`. ## Behavior ``` # Errors with: "Cannot specify both --quant and --recipe; pick one." bash huggingface_example.sh --model=... --quant=nvfp4 --recipe=... --tasks=quant # Errors with usage if neither is given bash huggingface_example.sh --model=... --tasks=quant # Both of these are now accepted; --recipe is forwarded verbatim to hf_ptq.py bash huggingface_example.sh --model=... --quant=nvfp4 --tasks=quant bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv --tasks=quant bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv --tasks=quant ``` ## Test plan - [x] `experts_only_mse-fp8_cast_kv` loads via `modelopt.recipe.load_recipe(...)` and produces the expected algorithm + per-pattern `quant_cfg` (verified in a working env: `algorithm == {'method': 'mse', 'fp8_scale_sweep': True, 'layerwise': False}`; expert weight quantizers `type: static`; KV bmm has `use_constant_amax: True`). - [x] Parser sanity: 4 flag combinations (both, neither, only `--quant`, only `--recipe`) all behave as designed. ## Note Pre-commit hook `check-modelopt-recipes` was skipped on both commits because the local conda env has a broken `torchvision` install (`AttributeError: partially initialized module 'torchvision' has no attribute 'extension'`) that prevents `from modelopt.recipe.loader import load_recipe`. The `experts_only` recipe was validated independently by running `tools/precommit/check_modelopt_recipes.py` in a working environment (exits 0); the `mlp_only` one is the same shape with a different glob. Rebased onto `main` from #1391 (which targeted `chenjiel/nvfp4-fp8-sweep-triton`). The diff is scoped to the recipes + script wiring; no kernel/sweep changes are included here. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Added recipe-based quantization as an alternative to format-based quantization with a new `--recipe` CLI option. * Added two new quantization recipes for targeted layer optimization: one for expert-layer-only quantization and one for MLP-layer-only quantization, both featuring NVFP4 and FP8 KV-cache optimization. * **Configuration** * `--quant` and `--recipe` options are now mutually exclusive; specify one to configure quantization behavior.  --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv added 4 commits May 4, 2026 20:57

cjluo-nv requested review from a team as code owners May 4, 2026 23:28

cjluo-nv requested review from realAsma and removed request for a team May 4, 2026 23:28

cjluo-nv changed the title ~~[Recipes][LLM PTQ] Add nvfp4_experts_only_mse-fp8_cast_kv recipe + --recipe in example scripts~~ [Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts May 5, 2026

cjluo-nv mentioned this pull request May 7, 2026

[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts #1407

Merged

2 tasks

cjluo-nv force-pushed the chenjiel/nvfp4-fp8-sweep-triton branch from a9c8ccf to 8f04a9a Compare May 8, 2026 05:13

cjluo-nv requested review from a team as code owners May 8, 2026 05:13

cjluo-nv requested review from ChenhanYu and removed request for a team May 8, 2026 05:13

Base automatically changed from chenjiel/nvfp4-fp8-sweep-triton to main May 8, 2026 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts#1391

[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts#1391
cjluo-nv wants to merge 5 commits intomainfrom
chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv

cjluo-nv commented May 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Review skipped

Uh oh!

codecov Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cjluo-nv commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Behavior

Test plan

Note

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cjluo-nv commented May 4, 2026 •

edited

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

codecov Bot commented May 4, 2026 •

edited

Loading