Support force tokens to % of total experts during calibration by cjluo-nv · Pull Request #910 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-02-20T03:21:29Z

What does this PR do?

Type of change: New feature

Overview: Adds a configurable moe_calib_experts_ratio parameter that controls the percentage of experts to calibrate during the forward pass in MoE (Mixture of Experts) models. Previously, the calibration forward always routed tokens to all experts, which is expensive. This PR allows the user to specify a ratio (default: 1/4 of all experts) to improve expert calibration coverage without the cost of a full-expert forward. The token counting for the expert coverage table now tracks the calibration routing and runs on CUDA for efficiency.

** !!! This PR changes the MOE model calibration behavior:
Before for some MOE models we force tokens to all experts in calibration
Now we by default force tokens to 25% of the total experts in calibration

Changes include:

New moe_calib_experts_ratio field in QuantizeAlgorithmConfig (config.py)
Propagation of the ratio from the algorithm config to MoE modules during calibration (mode.py)
Updated _QuantSparseMoe.forward to use the configurable ratio instead of hard-coding all experts (huggingface.py)
New --moe_calib_experts_ratio CLI flag in hf_ptq.py (default 0.25)
Moved expert_token_count tensor to CUDA and updated the HTML table title in moe_utils.py

Usage

Via hf_ptq.py CLI — calibrate 50% of experts during MoE calibration
python hf_ptq.py --model --qformat int4_awq --moe_calib_experts_ratio 0.5

Via Python API — pass the ratio through the algorithm config
import modelopt.torch.quantization as mtq

quant_cfg = {
"quant_cfg": { ... },
"algorithm": {
"method": "awq_lite",
"moe_calib_experts_ratio": 0.25, # calibrate 1/4 of experts
},
}
mtq.quantize(model, quant_cfg, forward_loop=calib_loop)

Testing

Test with Qwen3 30B A3B calibration and check the tokens per expert.

Summary by CodeRabbit

Release Notes

New Features
- Added support for configurable expert calibration during Mixture of Experts (MOE) model quantization. Users can now specify the percentage of experts to include during calibration, enabling better expert coverage and improved quantization accuracy for MOE models. Default: 25% of all experts.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-02-20T03:21:48Z

📝 Walkthrough

Walkthrough

This PR introduces a new --moe_calib_experts_ratio parameter to control the percentage of experts calibrated during MOE quantization forward passes. The parameter flows from CLI through configuration layers and is propagated to model modules for dynamic expert selection during calibration.

Changes

Cohort / File(s)	Summary
Changelog `CHANGELOG.rst`	Documents new `--moe_calib_experts_ratio` flag under version 0.43 features.
Quantization Configuration `modelopt/torch/quantization/config.py`	Adds `moe_calib_experts_ratio` field to `QuantizeAlgorithmConfig` as an optional float, allowing specification of the percentage of experts to calibrate.
CLI and Examples `examples/llm_ptq/hf_ptq.py`, `examples/llm_ptq/example_utils.py`	Introduces `--moe_calib_experts_ratio` command-line argument and threads it through `build_quant_cfg` to populate the quantization configuration.
Quantization Pipeline `modelopt/torch/quantization/mode.py`	Adds logic to extract `moe_calib_experts_ratio` from calibration kwargs and propagate it to model modules that expose `_moe_calib_experts_ratio` attribute.
HuggingFace MOE Plugin `modelopt/torch/quantization/plugins/huggingface.py`	Implements ratio-based expert calibration by adjusting top_k during forward passes based on `_moe_calib_experts_ratio`, with device-aware expert token counting and conditional behavior for calibration vs. inference paths.
Export Documentation `modelopt/torch/export/moe_utils.py`	Updates HTML table heading from "Expert Token Counts (per MoE layer)" to "Expert Calib Token Counts (per MoE layer)".

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as hf_ptq.py<br/>(CLI/Main)
    participant Config as build_quant_cfg<br/>(Config Builder)
    participant QMode as quantization/mode.py<br/>(Calibration)
    participant Model as Model Modules<br/>(HF Plugin)
    
    User->>CLI: --moe_calib_experts_ratio 0.5
    CLI->>Config: pass moe_calib_experts_ratio
    Config->>Config: inject into quant_cfg["algorithm"]
    
    Note over CLI,Config: Configuration Stage
    
    CLI->>QMode: trigger calibration<br/>with kwargs
    QMode->>QMode: pop moe_calib_experts_ratio<br/>from kwargs
    QMode->>Model: set _moe_calib_experts_ratio<br/>on modules
    
    Note over QMode,Model: Propagation Stage
    
    Model->>Model: Forward pass (calibration mode)
    Model->>Model: Adjust top_k using ratio
    Model->>Model: Count expert tokens<br/>during forward
    
    Note over Model: Calibration Execution

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main feature: allowing users to specify a percentage of experts to use during MOE calibration, which is the core change across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chenjiel/update_moe_calib_algo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cjluo-nv · 2026-02-20T03:27:41Z

Data for Qwen3 30B:

moe_before.html -- original HF forward, no force routing
moe_after_2.html -- 1/8 routing ratio
moe_after_3.html -- 1/4 routing ratio

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

modelopt/torch/quantization/config.py (1)

1073-1081: Description says "%" but the value is a ratio (0–1), not a percentage (0–100).

The title says "% of experts" which implies a percentage (e.g., 25), but the actual value is a ratio in (0, 1] (e.g., 0.25). Consider clarifying to avoid user confusion:

-        title="% of experts to calibrate during forward pass.",
+        title="Ratio of experts to calibrate during forward pass (0, 1].",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 1073 - 1081, The
title/description for moe_calib_experts_ratio is misleading: it currently reads
"% of experts" but the field expects a ratio in (0,1]. Update the ModeloptField
metadata for moe_calib_experts_ratio (title and/or description) to explicitly
state it is a ratio between 0 and 1 (e.g., "Fraction of experts to calibrate
(0–1)") or alternatively accept a percentage and convert to a ratio internally;
ensure the ModeloptField default/description reflects the chosen semantics so
users aren't confused.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.rst`:
- Line 11: Update the changelog entry to use the actual CLI flag and config
field names: replace `--moe_calib_experts_percentage` with
`--moe_calib_experts_ratio` and mention the matching `moe_calib_experts_ratio`
config field; verify consistency with the flag defined in hf_ptq.py (the
`--moe_calib_experts_ratio` argument) and the config variable in config.py
(`moe_calib_experts_ratio`) so the changelog matches the real names.

In `@examples/llm_ptq/example_utils.py`:
- Around line 236-243: The code crashes when quant_cfg["algorithm"] is None
because the else branch assumes a dict; fix by handling None explicitly: when
moe_calib_experts_ratio is set, if quant_cfg["algorithm"] is a str wrap it as
before, elif it's a dict set the "moe_calib_experts_ratio" key, else (covers
None or other types) assign quant_cfg["algorithm"] = {"moe_calib_experts_ratio":
moe_calib_experts_ratio}; update the logic around quant_cfg["algorithm"] and
moe_calib_experts_ratio to avoid subscripting None.

In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1130-1138: The parser is currently adding
--moe_calib_experts_ratio with a default of 1.0/4 which causes the field to be
injected for all models; change the add_argument in hf_ptq.py to default=None
(and allow float values) so the flag is only set when the user provides it, and
update the downstream logic that injects this into the algorithm config (where
algorithm options are assembled in example_utils.py) to only add
moe_calib_experts_ratio if args.moe_calib_experts_ratio is not None; keep the
argument help text but note it’s optional now.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 504-509: The assertion that self.gate.top_k (computed as
round(self.gate.num_experts * self._moe_calib_experts_ratio)) must be >=
original_top_k is unsafe for small num_experts or low ratios; replace the assert
with logic that clamps the calibrated top_k to at least original_top_k (e.g.,
compute calib_top_k = round(...); set self.gate.top_k = max(calib_top_k,
original_top_k)), and apply the same change to the transformers < 5.0 code path
(the block handling top_k at lines ~516–525) so both code paths guarantee top_k
>= original_top_k instead of asserting.
- Around line 490-533: The forward method currently only expands experts when
_moe_calib_experts_ratio is set; change the logic so that when is_calib is True
and _moe_calib_experts_ratio is None you default it to 1.0 (i.e., all experts)
to match the class docstring; update forward to treat is_calib branches as: if
is_calib: if self._moe_calib_experts_ratio is None: use ratio = 1.0 (or set
self._moe_calib_experts_ratio = 1.0 temporarily), then perform the gate/top_k or
top_k adjustments (refer to forward, _moe_calib_experts_ratio, gate.top_k,
top_k, num_experts, experts) and ensure _count_expert_tokens is True only during
calibration and False for normal inference (remove the current else that sets
_count_expert_tokens=True for non-calibration).
- Line 461: The allocation of expert_token_count is hardcoded to cuda and should
instead use the gate module's device (or defer to first forward); update the
allocation of self.expert_token_count in the class that defines it to infer
device from the gate parameters (e.g., device =
next(self.gate.parameters()).device) and allocate torch.zeros(num_experts,
dtype=torch.long, device=device) or move the allocation into the first forward
so it uses the actual module device; ensure any subsequent references (e.g.,
where .to() was used at line 488) are removed or adjusted accordingly to avoid
redundant device transfers.

---

Nitpick comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 1073-1081: The title/description for moe_calib_experts_ratio is
misleading: it currently reads "% of experts" but the field expects a ratio in
(0,1]. Update the ModeloptField metadata for moe_calib_experts_ratio (title
and/or description) to explicitly state it is a ratio between 0 and 1 (e.g.,
"Fraction of experts to calibrate (0–1)") or alternatively accept a percentage
and convert to a ratio internally; ensure the ModeloptField default/description
reflects the chosen semantics so users aren't confused.

CHANGELOG.rst

coderabbitai · 2026-02-20T03:28:59Z

examples/llm_ptq/example_utils.py

+    if moe_calib_experts_ratio:
+        if isinstance(quant_cfg["algorithm"], str):
+            quant_cfg["algorithm"] = {
+                "method": quant_cfg["algorithm"],
+                "moe_calib_experts_ratio": moe_calib_experts_ratio,
+            }
+        else:
+            quant_cfg["algorithm"]["moe_calib_experts_ratio"] = moe_calib_experts_ratio


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find all algorithm: None entries in config files echo "=== Looking for algorithm: None in config.py ===" rg -n '"algorithm":\s*None' --type python echo "" echo "=== Check hf_ptq.py for moe_calib_experts_ratio default ===" rg -n 'moe_calib_experts_ratio' examples/llm_ptq/hf_ptq.py | head -30 echo "" echo "=== Check example_utils.py around lines 236-243 ===" sed -n '230,250p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1204

🏁 Script executed:

#!/bin/bash # Look at the full context around build_quant_cfg to understand the parameter echo "=== Looking at build_quant_cfg signature in example_utils.py ===" rg -n 'def build_quant_cfg' examples/llm_ptq/example_utils.py -A 10 echo "" echo "=== Check if algorithm can be None by looking at QUANT_CFG_CHOICES ===" rg -n 'QUANT_CFG_CHOICES' examples/llm_ptq/hf_ptq.py -A 50 | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4804

🏁 Script executed:

#!/bin/bash # Find where the default configs are defined echo "=== Search for MXFP8_DEFAULT_CFG definition ===" rg -n 'MXFP8_DEFAULT_CFG' --type-list | head -5 find . -name "*.py" -exec grep -l "MXFP8_DEFAULT_CFG" {} \; echo "" echo "=== Search for config.py or similar in modelopt ===" find . -path "*/quantization/config.py" -o -path "*/quantization/*config*.py" | head -20 echo "" echo "=== Look for algorithm definitions in config files ===" find . -name "*config*.py" -path "*/quantization/*" -exec grep -l "algorithm" {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 1253

🏁 Script executed:

#!/bin/bash # Check the moe_calib_experts_ratio argument parser to see actual default echo "=== Check moe_calib_experts_ratio argument definition ===" sed -n '1130,1145p' examples/llm_ptq/hf_ptq.py echo "" echo "=== Check what mtq is imported as ===" head -50 examples/llm_ptq/hf_ptq.py | grep -E "^import|^from" echo "" echo "=== Try to find MXFP8_DEFAULT_CFG in the codebase ===" find . -type f -name "*.py" | xargs grep -l "MXFP8_DEFAULT_CFG" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 1339

🏁 Script executed:

#!/bin/bash # Look at the MXFP8_DEFAULT_CFG definition echo "=== Search for MXFP8_DEFAULT_CFG in config.py ===" rg -n 'MXFP8_DEFAULT_CFG' modelopt/torch/quantization/config.py -A 15 echo "" echo "=== Also check other configs mentioned in review (MXFP6, MXFP4, W4A8_MXFP4, MXINT8) ===" rg -n 'MXFP6_DEFAULT_CFG|MXFP4_DEFAULT_CFG|W4A8_MXFP4_FP8_CFG|MXINT8_DEFAULT_CFG' modelopt/torch/quantization/config.py -A 15 | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4203

🏁 Script executed:

#!/bin/bash # Check if build_quant_cfg is called in hf_ptq.py echo "=== Check main() function in hf_ptq.py ===" sed -n '850,950p' examples/llm_ptq/hf_ptq.py | grep -n "build_quant_cfg" echo "" echo "=== Look for build_quant_cfg calls ===" rg -n 'build_quant_cfg' examples/llm_ptq/hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 253

🏁 Script executed:

#!/bin/bash # Verify the exact logic at line 243 more carefully echo "=== Full context of lines 230-250 ===" sed -n '230,250p' examples/llm_ptq/example_utils.py echo "" echo "=== Check if there's any guard preventing algorithm=None scenarios ===" sed -n '197,250p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3153

Crash when algorithm is None.

The code will crash when moe_calib_experts_ratio is truthy (the CLI default is 0.25) and the quantization config has "algorithm": None (e.g., mxfp8, mxfp6, mxfp4, mxint8, w4a8_mxfp4_fp8). At line 243, the else branch attempts None["moe_calib_experts_ratio"] = ..., raising a TypeError: 'NoneType' object is not subscriptable.

Any user running with a None-algorithm format (e.g., --qformat mxfp8) using the CLI default will immediately hit this crash.

Proposed fix

if moe_calib_experts_ratio: + if quant_cfg["algorithm"] is None: + quant_cfg["algorithm"] = { + "method": None, + "moe_calib_experts_ratio": moe_calib_experts_ratio, + } - if isinstance(quant_cfg["algorithm"], str): + elif isinstance(quant_cfg["algorithm"], str): quant_cfg["algorithm"] = { "method": quant_cfg["algorithm"], "moe_calib_experts_ratio": moe_calib_experts_ratio, } else: quant_cfg["algorithm"]["moe_calib_experts_ratio"] = moe_calib_experts_ratio

Alternatively, only inject the ratio when the model is actually an MoE model, or change the CLI default to None and only inject when explicitly provided.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if moe_calib_experts_ratio:

if isinstance(quant_cfg["algorithm"], str):

quant_cfg["algorithm"] = {

"method": quant_cfg["algorithm"],

"moe_calib_experts_ratio": moe_calib_experts_ratio,

}

else:

quant_cfg["algorithm"]["moe_calib_experts_ratio"] = moe_calib_experts_ratio

if moe_calib_experts_ratio:

if quant_cfg["algorithm"] is None:

quant_cfg["algorithm"] = {

"method": None,

"moe_calib_experts_ratio": moe_calib_experts_ratio,

}

elif isinstance(quant_cfg["algorithm"], str):

quant_cfg["algorithm"] = {

"method": quant_cfg["algorithm"],

"moe_calib_experts_ratio": moe_calib_experts_ratio,

}

else:

quant_cfg["algorithm"]["moe_calib_experts_ratio"] = moe_calib_experts_ratio

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 236 - 243, The code crashes when quant_cfg["algorithm"] is None because the else branch assumes a dict; fix by handling None explicitly: when moe_calib_experts_ratio is set, if quant_cfg["algorithm"] is a str wrap it as before, elif it's a dict set the "moe_calib_experts_ratio" key, else (covers None or other types) assign quant_cfg["algorithm"] = {"moe_calib_experts_ratio": moe_calib_experts_ratio}; update the logic around quant_cfg["algorithm"] and moe_calib_experts_ratio to avoid subscripting None.

@realAsma, thanks for confirming! This should prevent crashes for users running MoE calibration with formats like mxfp8, mxfp6, mxfp4, mxint8, or w4a8_mxfp4_fp8 that have algorithm: None in their configs. 🐰

coderabbitai · 2026-02-20T03:28:59Z

examples/llm_ptq/hf_ptq.py

+    parser.add_argument(
+        "--moe_calib_experts_ratio",
+        type=float,
+        default=1.0 / 4,
+        help=(
+            "Percentage of experts to calibrate during forward pass. Only used for MOE models. "
+            "This is used to reduce the number of experts to calibrate during forward pass. "
+        ),
+    )


⚠️ Potential issue | 🔴 Critical

Default 0.25 is unconditionally applied to all models, including non-MoE ones.

Since the default is 1.0 / 4 (always truthy), every invocation of hf_ptq.py will inject moe_calib_experts_ratio into the algorithm config—even for non-MoE models. Combined with the crash when algorithm is None (flagged in example_utils.py), this makes --qformat mxfp8 (and similar) unusable out of the box.

Consider defaulting to None so the ratio is only injected when the user explicitly requests it:

parser.add_argument( "--moe_calib_experts_ratio", type=float, - default=1.0 / 4, + default=None, help=( - "Percentage of experts to calibrate during forward pass. Only used for MOE models. " - "This is used to reduce the number of experts to calibrate during forward pass. " + "Ratio of experts to calibrate during forward pass (0, 1]. Only used for MOE models. " + "Default behavior routes to all experts if not specified. " + "Example: 0.25 calibrates 25%% of experts. " ), )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 1130 - 1138, The parser is currently adding --moe_calib_experts_ratio with a default of 1.0/4 which causes the field to be injected for all models; change the add_argument in hf_ptq.py to default=None (and allow float values) so the flag is only set when the user provides it, and update the downstream logic that injects this into the algorithm config (where algorithm options are assembled in example_utils.py) to only add moe_calib_experts_ratio if args.moe_calib_experts_ratio is not None; keep the argument help text but note it’s optional now.

coderabbitai · 2026-02-20T03:28:59Z

modelopt/torch/quantization/plugins/huggingface.py

            num_experts = self.experts.num_experts

-        self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cpu")
+        self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cuda")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file to understand the class structure and initialization head -100 modelopt/torch/quantization/plugins/huggingface.py | tail -50

Repository: NVIDIA/Model-Optimizer

Length of output: 2002

🏁 Script executed:

# Check the __init__ method and surrounding context sed -n '440,480p' modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1707

🏁 Script executed:

# Check the forward method and the area around line 488 sed -n '480,540p' modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3317

🏁 Script executed:

# Search for how expert_token_count is used elsewhere rg -n "expert_token_count" modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 347

🏁 Script executed:

# Check if there's a parent class or module device tracking rg -n "class.*MoELayer\|def __init__" modelopt/torch/quantization/plugins/huggingface.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Search for tests that might use this module on CPU or multi-GPU find . -name "*.py" -path "*/test*" -type f | xargs grep -l "huggingface\|MoELayer" 2>/dev/null | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 270

🏁 Script executed:

# Check if there's any device attribute on the module or parameters rg -n "self\.(device|parameters|cuda|cpu)" modelopt/torch/quantization/plugins/huggingface.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check if there are CPU-based tests rg -n "cpu\|CPU" tests/ --type=py -i | grep -i "test\|def " | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check the parent class QuantModule to understand device tracking rg -n "class QuantModule" modelopt/torch/quantization/ --type=py -B2 -A10 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1117

🏁 Script executed:

# Verify if expert_token_count should be registered as a buffer rg -n "register_buffer" modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check how other modules handle device-specific tensors rg -n "torch.zeros.*device" modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 161

🏁 Script executed:

# Search for .to() or device handling patterns in this file rg -n "\.to\(|device=" modelopt/torch/quantization/plugins/huggingface.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 633

🏁 Script executed:

# Check if this module is used in multi-GPU or distributed settings rg -n "_QuantSparseMoe\|expert_token_count" tests/ --type=py -A2 -B2 | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check how other torch.nn modules handle device at initialization rg -n "register_buffer.*zeros\|register_parameter" modelopt/torch/quantization/ --type=py | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Look at the parent class hierarchy to see if modules get device info rg -n "class DynamicModule\|def __init__" modelopt/torch/quantization/nn/modules/ --type=py -A5 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check if the module has access to parameters to infer device sed -n '440,465p' modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1115

🏁 Script executed:

# Verify if expert_token_count needs to be on same device as parameters rg -n "expert_token_count\|self.gate\|self.experts" modelopt/torch/quantization/plugins/huggingface.py | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check how the hook accesses and uses expert_token_count sed -n '473,495p' modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1308

🏁 Script executed:

# Check if there are any CPU or device-specific test configurations rg -n "device.*cpu\|cpu.*test\|torch.device" tests/ --type=py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check how the module gets its device or if it can access parameters for device inference sed -n '443,475p' modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1446

🏁 Script executed:

# Verify if parameters exist in the module that could provide device info rg -n "self.gate.*parameters\|self.experts.*parameters\|next(self.parameters())" modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check if there's documentation or config for device handling rg -n "device\|cuda" modelopt/torch/quantization/plugins/huggingface.py -B2 -A2 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check if this module is ever called in non-CUDA or multi-GPU contexts rg -n "cuda\|device\|distributed\|DataParallel" tests/gpu/torch/conftest.py -A5 -B5

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check if there are any initialization patterns that handle device rg -n "next(self.parameters())\|self.device\|to\(device" modelopt/torch/quantization/plugins/ --type=py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Look for any tensor initialization patterns with proper device handling rg -n "register_buffer\|torch.zeros.*parameters\|infer.*device" modelopt/torch/ --type=py | head -15

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check if gate or experts have parameters that could be used to infer device sed -n '443,490p' modelopt/torch/quantization/plugins/huggingface.py | grep -n "gate\|experts"

Repository: NVIDIA/Model-Optimizer

Length of output: 1145

🏁 Script executed:

# Verify if gate module has parameters we can use to infer device python3 << 'EOF' # Check best practice for device inference in PyTorch nn.Module code = """ # Common patterns for device inference: # 1. next(self.gate.parameters()).device - if gate has parameters # 2. lazy initialization on first forward call # 3. defer to parent module device tracking """ print(code) EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 264

🏁 Script executed:

# Check if there's a pattern for lazy device initialization in the codebase rg -n "lazy.*device\|first.*forward\|register_forward_pre_hook" modelopt/torch/quantization/ --type=py | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Verify the actual impact: check if counts tensor is accessed before line 488 sed -n '461,495p' modelopt/torch/quantization/plugins/huggingface.py | grep -n "expert_token_count"

Repository: NVIDIA/Model-Optimizer

Length of output: 343

Hardcoded device="cuda" will fail on CPU and multi-GPU setups.

Line 461 allocates expert_token_count on cuda:0 regardless of where the module resides. This breaks:

CPU-only testing environments

Multi-GPU configurations where the module is on cuda:1 or higher (allocates on wrong device, then requires inefficient .to() movement at line 488)

Infer device from the gate module's parameters instead:

Proposed fix

- self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cuda") + device = next(self.gate.parameters()).device if hasattr(self, "gate") else "cuda" + self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device=device)

Alternatively, defer allocation to the first forward pass to avoid device placement assumptions.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` at line 461, The allocation of expert_token_count is hardcoded to cuda and should instead use the gate module's device (or defer to first forward); update the allocation of self.expert_token_count in the class that defines it to infer device from the gate parameters (e.g., device = next(self.gate.parameters()).device) and allocate torch.zeros(num_experts, dtype=torch.long, device=device) or move the allocation into the first forward so it uses the actual module device; ensure any subsequent references (e.g., where .to() was used at line 488) are removed or adjusted accordingly to avoid redundant device transfers.

coderabbitai · 2026-02-20T03:28:59Z

modelopt/torch/quantization/plugins/huggingface.py

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        is_calib = any(getattr(m, "_if_calib", False) for m in self.experts.modules())
-        if is_calib:
+        self._count_expert_tokens = is_calib
+        if is_calib and self._moe_calib_experts_ratio:
+            self._count_expert_tokens = True
+            assert 0 < self._moe_calib_experts_ratio <= 1, (
+                "moe_calib_experts_ratio must be between 0 and 1"
+            )
            # If any of the experts are in calibration mode, we will forward all tokens to all experts
            # This is used only for calibration, we need to re-calculate the actual outputs again using
            # the original top_k
            if TRANSFORMERS_VERSION_GE_5_0:
                assert hasattr(self, "gate") and hasattr(self.gate, "top_k")
                original_top_k = self.gate.top_k
-                self.gate.top_k = self.gate.num_experts
+                self.gate.top_k = round(self.gate.num_experts * self._moe_calib_experts_ratio)
+                assert self.gate.top_k >= original_top_k, (
+                    f"moe_calib_experts_ratio {self._moe_calib_experts_ratio},"
+                    f" calib top_k {self.gate.top_k} smaller than original"
+                    f" top_k {original_top_k}"
+                )
                super().forward(hidden_states)
                self.gate.top_k = original_top_k
            else:
                # Path for transformers < 5.0
                original_top_k = self.top_k
                if hasattr(self, "num_experts"):
-                    self.top_k = self.num_experts
+                    self.top_k = round(self.num_experts * self._moe_calib_experts_ratio)
                elif hasattr(self, "experts"):
-                    self.top_k = self.experts.num_experts
+                    self.top_k = round(self.experts.num_experts * self._moe_calib_experts_ratio)
                else:
                    raise ValueError(f"Could not find num_experts in module {self}")
+                assert self.top_k >= original_top_k, (
+                    f"moe_calib_experts_ratio {self._moe_calib_experts_ratio},"
+                    f" calib top_k {self.top_k} smaller than original"
+                    f" top_k {original_top_k}"
+                )
                super().forward(hidden_states)
                self.top_k = original_top_k
-        # Enable counting only for the real-routing forward during calibration
-        self._count_expert_tokens = is_calib
+            self._count_expert_tokens = False
+        else:
+            self._count_expert_tokens = True
        output = super().forward(hidden_states)
        self._count_expert_tokens = False
        return output


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's see recent commits to understand context git log --oneline -10 -- modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 87

🏁 Script executed:

#!/bin/bash # Get a diff to see what changed in recent commits # Look at the last few commits for this file git log -p -2 -- modelopt/torch/quantization/plugins/huggingface.py | head -300

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

#!/bin/bash # Let's also check the full forward method and surrounding context cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '440,550p'

Repository: NVIDIA/Model-Optimizer

Length of output: 6176

🏁 Script executed:

#!/bin/bash # Search for any tests related to this forward method or _moe_calib_experts_ratio find . -name "*.py" -type f | xargs grep -l "_moe_calib_experts_ratio" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 156

🏁 Script executed:

#!/bin/bash # Look for documentation or comments about the expected behavior grep -r "all experts" --include="*.py" modelopt/torch/quantization/plugins/ 2>/dev/null | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 672

🏁 Script executed:

#!/bin/bash # Let's check what happens in each scenario by examining the control flow # Case analysis of the forward method: # When is_calib=True and ratio is set (line 493 condition is True): # - Line 528: _count_expert_tokens = False (before final forward) # When is_calib=True and ratio is None (line 493 condition is False, goes to else): # - Line 530: _count_expert_tokens = True (before final forward) # When is_calib=False (line 493 condition is False, goes to else): # - Line 530: _count_expert_tokens = True (before final forward) <-- PROBLEM! # Check if there's any conditional that prevents token counting during inference cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '475,495p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1385

🏁 Script executed:

#!/bin/bash # Check the gate_forward_hook to understand what happens when _count_expert_tokens is True cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '475,489p'

Repository: NVIDIA/Model-Optimizer

Length of output: 982

🏁 Script executed:

#!/bin/bash # Search for where _moe_calib_experts_ratio might be set to understand intended usage grep -n "_moe_calib_experts_ratio" modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 699

🏁 Script executed:

#!/by # Check if there are any tests that show expected behavior find . -name "*test*.py" -type f | xargs grep -l "QuantSparseMoe\|_moe_calib_experts_ratio" 2>/dev/null | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 124

🏁 Script executed:

#!/bin/bash # Examine the test file to understand expected behavior cat modelopt/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 152

🏁 Script executed:

#!/bin/bash # Find the correct path for the test file find . -name "test_sparse_moe.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 124

🏁 Script executed:

#!/bin/bash # If not found, search for any test file containing QuantSparseMoe or sparse moe tests find . -name "*test*.py" -type f -exec grep -l "QuantSparseMoe\|sparse.*moe" {} \; 2>/dev/null | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 124

🏁 Script executed:

#!/bin/bash # Read the test file cat ./tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 12565

🏁 Script executed:

#!/bin/bash # Check where _moe_calib_experts_ratio might be set (outside of _setup) grep -rn "_moe_calib_experts_ratio" modelopt/torch/quantization/ --include="*.py" | grep -v "test_sparse"

Repository: NVIDIA/Model-Optimizer

Length of output: 1321

🏁 Script executed:

#!/bin/bash # Check the quantize function and INT8_DEFAULT_CFG to see if they set the ratio grep -rn "INT8_DEFAULT_CFG\|_moe_calib_experts_ratio" modelopt/torch/quantization/ --include="*.py" | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1959

🏁 Script executed:

#!/bin/bash # Look for mode.py to see how _moe_calib_experts_ratio is used cat modelopt/torch/quantization/mode.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 3247

🏁 Script executed:

#!/bin/bash # Look at the full context around line 222 in mode.py to see the function signature cat -n modelopt/torch/quantization/mode.py | sed -n '210,235p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1265

🏁 Script executed:

#!/bin/bash # Search for moe_calib_experts_ratio parameter definition and default value grep -B5 -A5 "moe_calib_experts_ratio" modelopt/torch/quantization/mode.py | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 665

🏁 Script executed:

#!/bin/bash # Check if there's a function signature or parameter with default grep -rn "def.*moe_calib_experts_ratio\|moe_calib_experts_ratio\s*=" modelopt/torch/quantization/ --include="*.py" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 393

Clarify whether all-experts calibration should be the default during quantization.

The class docstring promises "During calibration, we forward all tokens to all experts so that all experts see sufficient tokens to calibrate" (line 445), but this behavior only activates when _moe_calib_experts_ratio is explicitly set in the quantization config. Since it defaults to None, users relying on the documented behavior will not get the expanded-expert forward pass.

Additionally, the else block at lines 529-530 enables token counting for both inference (is_calib=False) and calibration with unset ratio (is_calib=True, ratio=None), creating unnecessary overhead during inference when tokens should not be counted.

Either set a default ratio (e.g., 1.0 for all experts) when entering calibration mode, or update the docstring to clarify that expanded-expert forwarding requires explicit configuration.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 490 - 533, The forward method currently only expands experts when _moe_calib_experts_ratio is set; change the logic so that when is_calib is True and _moe_calib_experts_ratio is None you default it to 1.0 (i.e., all experts) to match the class docstring; update forward to treat is_calib branches as: if is_calib: if self._moe_calib_experts_ratio is None: use ratio = 1.0 (or set self._moe_calib_experts_ratio = 1.0 temporarily), then perform the gate/top_k or top_k adjustments (refer to forward, _moe_calib_experts_ratio, gate.top_k, top_k, num_experts, experts) and ensure _count_expert_tokens is True only during calibration and False for normal inference (remove the current else that sets _count_expert_tokens=True for non-calibration).

coderabbitai · 2026-02-20T03:28:59Z

modelopt/torch/quantization/plugins/huggingface.py

+                self.gate.top_k = round(self.gate.num_experts * self._moe_calib_experts_ratio)
+                assert self.gate.top_k >= original_top_k, (
+                    f"moe_calib_experts_ratio {self._moe_calib_experts_ratio},"
+                    f" calib top_k {self.gate.top_k} smaller than original"
+                    f" top_k {original_top_k}"
+                )


⚠️ Potential issue | 🟠 Major

The assertion calib top_k >= original_top_k may fail for small expert counts.

If num_experts = 2 and original_top_k = 2 and ratio = 0.25, then round(2 * 0.25) = round(0.5) = 0 (Python's banker's rounding), which is less than original_top_k = 2, triggering the assertion. Even with round(2 * 0.25) = 1 (if rounding up), it's still less than 2.

The assertion message says the ratio-based top_k must be >= original_top_k, but this constraint can easily be violated for models with few experts or low ratios. Consider clamping top_k to max(calib_top_k, original_top_k) instead of asserting.

Proposed fix (transformers >= 5.0 path)

- self.gate.top_k = round(self.gate.num_experts * self._moe_calib_experts_ratio) - assert self.gate.top_k >= original_top_k, ( - f"moe_calib_experts_ratio {self._moe_calib_experts_ratio}," - f" calib top_k {self.gate.top_k} smaller than original" - f" top_k {original_top_k}" - ) + self.gate.top_k = max( + round(self.gate.num_experts * self._moe_calib_experts_ratio), + original_top_k, + )

The same applies to the transformers < 5.0 path at lines 516–525.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.gate.top_k = round(self.gate.num_experts * self._moe_calib_experts_ratio)

assert self.gate.top_k >= original_top_k, (

f"moe_calib_experts_ratio {self._moe_calib_experts_ratio},"

f" calib top_k {self.gate.top_k} smaller than original"

f" top_k {original_top_k}"

)

self.gate.top_k = max(

round(self.gate.num_experts * self._moe_calib_experts_ratio),

original_top_k,

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 504 - 509, The assertion that self.gate.top_k (computed as round(self.gate.num_experts * self._moe_calib_experts_ratio)) must be >= original_top_k is unsafe for small num_experts or low ratios; replace the assert with logic that clamps the calibrated top_k to at least original_top_k (e.g., compute calib_top_k = round(...); set self.gate.top_k = max(calib_top_k, original_top_k)), and apply the same change to the transformers < 5.0 code path (the block handling top_k at lines ~516–525) so both code paths guarantee top_k >= original_top_k instead of asserting.

@realAsma, thank you for acknowledging the suggestion! I'll look forward to seeing the fix applied. Feel free to tag me if you have any questions when implementing the change.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

…del-Optimizer into chenjiel/update_moe_calib_algo

modelopt/torch/quantization/plugins/huggingface.py

realAsma · 2026-02-20T19:48:47Z

modelopt/torch/quantization/config.py

        title="This field specifies the name of the calibration algorithm. If None, no calibration is performed.",
    )

+    moe_calib_experts_ratio: float | None = ModeloptField(


Yea this is good idea to put this here.

realAsma

Overall looks great! I left my minor comments.

Co-authored-by: realAsma <86726418+realAsma@users.noreply.github.com> Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

…l/update_moe_calib_algo

…del-Optimizer into chenjiel/update_moe_calib_algo

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

meenchen · 2026-02-20T21:40:32Z

examples/llm_ptq/hf_ptq.py

            "(sensitivity scores, costs, etc.). Only used when auto_quantize_bits is specified."
        ),
    )
+    parser.add_argument(


Should we add something similar to mcore PTQ as well? cc @realAsma @jenchen13 @ChenhanYu

Edwardf0t1 · 2026-02-20T21:49:55Z

@cjluo-nv could we try a larger MoE model, like Kimi-k2.5 or Qwen3.5-MoE with this change? I feel the accuracy impact might be bigger for large MoE models with less MoE calibrated.

Also, if 25% experts is calibrated, can we ensure all experts have a valid weight_scale and input_scale in the exported checkpoint?

Edwardf0t1 · 2026-02-20T21:51:05Z

examples/llm_ptq/hf_ptq.py

+    parser.add_argument(
+        "--moe_calib_experts_ratio",
+        type=float,
+        default=1.0 / 4,


How the 25% is selected?

Copilot

Pull request overview

Adds a configurable expert-calibration routing ratio for MoE calibration to avoid the expensive “route to all experts” calibration forward, and updates token-count tracking to follow the calibration routing.

Changes:

Introduces moe_calib_experts_ratio in the quantization algorithm config and propagates it into MoE modules during calibration.
Updates _QuantSparseMoe.forward to force routing to a fraction of experts (instead of all) and moves expert token counting to CUDA.
Adds --moe_calib_experts_ratio CLI flag and updates the exported HTML token-count table labeling.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
modelopt/torch/quantization/plugins/huggingface.py	Implements ratio-based calibration routing and CUDA-side expert token counting for HF MoE modules.
modelopt/torch/quantization/mode.py	Propagates `moe_calib_experts_ratio` from algorithm config into quantized MoE modules during calibration.
modelopt/torch/quantization/config.py	Adds `moe_calib_experts_ratio` to `QuantizeAlgorithmConfig`.
modelopt/torch/export/moe_utils.py	Renames the HTML table title to reflect calibration token counts.
examples/llm_ptq/hf_ptq.py	Adds CLI flag and passes ratio into quant config builder.
examples/llm_ptq/example_utils.py	Plumbs ratio into the quantization config structure.
CHANGELOG.rst	Documents the new CLI flag and default behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T21:55:44Z

examples/llm_ptq/example_utils.py

    model_type,
    quant_cfg_choices,
    kv_quant_cfg_choices,
+    moe_calib_experts_ratio,


build_quant_cfg gained the moe_calib_experts_ratio parameter, but not all call sites appear to have been updated (e.g. examples/llm_ptq/multinode_ptq.py still calls it with the old signature). This will raise TypeError at runtime; update all callers or provide a default for the new arg.

Suggested change

moe_calib_experts_ratio,

moe_calib_experts_ratio: float | None = None,

Copilot · 2026-02-20T21:55:44Z

examples/llm_ptq/example_utils.py

+    if moe_calib_experts_ratio:
+        if isinstance(quant_cfg["algorithm"], str):
+            quant_cfg["algorithm"] = {
+                "method": quant_cfg["algorithm"],
+                "moe_calib_experts_ratio": moe_calib_experts_ratio,


if moe_calib_experts_ratio: uses truthiness and will silently ignore 0.0 (and will treat negative values as enabled). Prefer if moe_calib_experts_ratio is not None: and validate 0 < ratio <= 1 so invalid values fail fast with a clear error.

Copilot · 2026-02-20T21:55:44Z

examples/llm_ptq/hf_ptq.py

+            "Percentage of experts to calibrate during forward pass. Only used for MOE models. "
+            "This is used to reduce the number of experts to calibrate during forward pass. "
+        ),
+    )

    return parser.parse_args()


The flag help says "Percentage" but the code/examples treat this as a ratio in (0, 1]. Consider clarifying the help text and validating the allowed range in parse_args so invalid values are rejected early.

Suggested change

"Percentage of experts to calibrate during forward pass. Only used for MOE models. "

"This is used to reduce the number of experts to calibrate during forward pass. "

),

)

return parser.parse_args()

"Fraction of experts to calibrate during forward pass (ratio in (0.0, 1.0]). "

"Only used for MOE models; used to reduce the number of experts calibrated during the forward pass."

),

)

args = parser.parse_args()

if not (0.0 < args.moe_calib_experts_ratio <= 1.0):

parser.error("--moe_calib_experts_ratio must be in the range (0.0, 1.0].")

return args

Copilot · 2026-02-20T21:55:44Z

modelopt/torch/quantization/mode.py

        kwargs["algorithm"] = method

+    moe_calib_experts_ratio = kwargs.pop("moe_calib_experts_ratio", None)
+    if moe_calib_experts_ratio is not None:


moe_calib_experts_ratio is propagated into modules without validation. Since it directly impacts routing, validate 0 < ratio <= 1 here and raise a user-facing error early instead of deferring to downstream asserts in the forward path.

Suggested change

if moe_calib_experts_ratio is not None:

if moe_calib_experts_ratio is not None:

# Validate early to avoid downstream assertion failures in the forward path.

if not isinstance(moe_calib_experts_ratio, (int, float)):

raise ValueError(

f"Invalid moe_calib_experts_ratio {moe_calib_experts_ratio!r}: "

"expected a numeric value in the range (0, 1]."

)

if not (0 < moe_calib_experts_ratio <= 1):

raise ValueError(

f"Invalid moe_calib_experts_ratio {moe_calib_experts_ratio!r}: "

"expected 0 < ratio <= 1."

)

Copilot · 2026-02-20T21:55:45Z

modelopt/torch/quantization/plugins/huggingface.py

+        self.expert_token_count = torch.zeros(
+            num_experts, dtype=torch.long, device=next(self.parameters()).device


expert_token_count is created as a plain tensor attribute. Because it is not registered as a buffer, it will not follow .to(...) / .cuda() device moves and may be dropped from state_dict, which can lead to device-mismatch errors in _gate_forward_hook after moving the model. Register it as a (non-persistent) buffer instead of a bare attribute (and initialize it on a known device, e.g., self.gate.weight.device / a parameter device).

Suggested change

self.expert_token_count = torch.zeros(

num_experts, dtype=torch.long, device=next(self.parameters()).device

self.register_buffer(

"expert_token_count",

torch.zeros(num_experts, dtype=torch.long, device=next(self.parameters()).device),

persistent=False,

Copilot · 2026-02-20T21:55:45Z

modelopt/torch/quantization/plugins/huggingface.py

                original_top_k = self.top_k
                if hasattr(self, "num_experts"):
-                    self.top_k = self.num_experts
+                    self.top_k = max(
+                        original_top_k, round(self.num_experts * self._moe_calib_experts_ratio)
+                    )


Same rounding issue here: round(self.num_experts * ratio) can round down at .5 boundaries. Prefer ceil(...) (and then max(original_top_k, ...)) so the requested coverage fraction is not accidentally reduced.

Copilot · 2026-02-20T21:55:45Z

modelopt/torch/quantization/config.py

+    moe_calib_experts_ratio: float | None = ModeloptField(
+        default=None,
+        title="% of experts to calibrate during forward pass.",
+        description=(
+            "If specified, we force forward tokens to % of experts during the calibration"


This new config field doesn't enforce bounds. Since valid values are in (0, 1], add pydantic constraints (e.g., gt=0, le=1) so invalid configs are rejected during parsing rather than failing later at runtime.

Copilot · 2026-02-20T21:55:46Z

CHANGELOG.rst


 - User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
 - ``hf_ptq.py`` now saves the quantization summary and moe expert token count table to the export directory.
+- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to 1/4 of all the experts.


Changelog says the default is "1/4 of all the experts". In the library config the default is currently None (meaning behavior depends on whether the caller sets it, e.g. the CLI). Please clarify in the changelog entry whether the default applies only to hf_ptq.py or to the core API as well, to avoid misleading users.

Copilot · 2026-02-20T21:55:46Z

modelopt/torch/quantization/plugins/huggingface.py

            # If any of the experts are in calibration mode, we will forward all tokens to all experts
            # This is used only for calibration, we need to re-calculate the actual outputs again using
            # the original top_k


The comment says calibration forwards tokens to all experts, but this code now forwards to a configurable fraction via _moe_calib_experts_ratio. Please update the comment to match the new behavior to avoid confusion.

Copilot · 2026-02-20T21:55:46Z

modelopt/torch/quantization/plugins/huggingface.py

+        if is_calib and self._moe_calib_experts_ratio:
+            self._count_expert_tokens = True
+            assert 0 < self._moe_calib_experts_ratio <= 1, (
+                "moe_calib_experts_ratio must be between 0 and 1"
+            )


Input validation uses assert and relies on truthiness (if is_calib and self._moe_calib_experts_ratio:). assert is stripped with python -O, and a ratio of 0.0 would be silently treated as "unset". Prefer an explicit if ratio is not None: check and raise ValueError (or similar) when the value is out of range.

codecov · 2026-02-20T22:37:33Z

Codecov Report

❌ Patch coverage is 50.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.09%. Comparing base (7c4c9fd) to head (76b9765).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/mode.py	40.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #910      +/-   ##
==========================================
- Coverage   73.10%   73.09%   -0.01%     
==========================================
  Files         205      205              
  Lines       22281    22287       +6     
==========================================
+ Hits        16288    16291       +3     
- Misses       5993     5996       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv added 2 commits February 20, 2026 03:15

Enable user to specify MOE expert calibration ratio

3a123be

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Add changelog

e16f044

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv requested review from a team as code owners February 20, 2026 03:21

cjluo-nv requested review from jingyu-ml and realAsma February 20, 2026 03:21

cjluo-nv requested review from Fridah-nv, meenchen and sugunav14 February 20, 2026 03:22

cjluo-nv changed the title ~~Chenjiel/update moe calib algo~~ Support force tokens to % of total experts during calibration Feb 20, 2026

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

cjluo-nv and others added 3 commits February 19, 2026 19:29

Update CHANGELOG.rst

4407981

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

Fix

f89ef8a

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Merge branch 'chenjiel/update_moe_calib_algo' of github.com:NVIDIA/Mo…

0f6f680

…del-Optimizer into chenjiel/update_moe_calib_algo

realAsma reviewed Feb 20, 2026

View reviewed changes

modelopt/torch/quantization/plugins/huggingface.py Outdated Show resolved Hide resolved

realAsma reviewed Feb 20, 2026

View reviewed changes

realAsma approved these changes Feb 20, 2026

View reviewed changes

cjluo-nv and others added 4 commits February 20, 2026 13:16

Update modelopt/torch/quantization/plugins/huggingface.py

0ac598d

Co-authored-by: realAsma <86726418+realAsma@users.noreply.github.com> Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

Merge branch 'main' of github.com:NVIDIA/Model-Optimizer into chenjie…

55eec3e

…l/update_moe_calib_algo

Merge branch 'chenjiel/update_moe_calib_algo' of github.com:NVIDIA/Mo…

c0185d9

…del-Optimizer into chenjiel/update_moe_calib_algo

Address comments

76b9765

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv requested a review from Edwardf0t1 February 20, 2026 21:18

meenchen reviewed Feb 20, 2026

View reviewed changes

Edwardf0t1 requested a review from Copilot February 20, 2026 21:48

Copilot started reviewing on behalf of Edwardf0t1 February 20, 2026 21:49 View session

Edwardf0t1 reviewed Feb 20, 2026

View reviewed changes

Copilot AI reviewed Feb 20, 2026

View reviewed changes

	moe_calib_experts_ratio,
	moe_calib_experts_ratio: float \| None = None,

-            "Percentage of experts to calibrate during forward pass. Only used for MOE models. "
-            "This is used to reduce the number of experts to calibrate during forward pass. "
-        ),
-    )
-    return parser.parse_args()
+            "Fraction of experts to calibrate during forward pass (ratio in (0.0, 1.0]). "
+            "Only used for MOE models; used to reduce the number of experts calibrated during the forward pass."
+        ),
+    )
+    args = parser.parse_args()
+    if not (0.0 < args.moe_calib_experts_ratio <= 1.0):
+        parser.error("--moe_calib_experts_ratio must be in the range (0.0, 1.0].")
+    return args

-    if moe_calib_experts_ratio is not None:
+    if moe_calib_experts_ratio is not None:
+        # Validate early to avoid downstream assertion failures in the forward path.
+        if not isinstance(moe_calib_experts_ratio, (int, float)):
+            raise ValueError(
+                f"Invalid moe_calib_experts_ratio {moe_calib_experts_ratio!r}: "
+                "expected a numeric value in the range (0, 1]."
+            )
+        if not (0 < moe_calib_experts_ratio <= 1):
+            raise ValueError(
+                f"Invalid moe_calib_experts_ratio {moe_calib_experts_ratio!r}: "
+                "expected 0 < ratio <= 1."
+            )

		self.expert_token_count = torch.zeros(
		num_experts, dtype=torch.long, device=next(self.parameters()).device

-        self.expert_token_count = torch.zeros(
-            num_experts, dtype=torch.long, device=next(self.parameters()).device
+        self.register_buffer(
+            "expert_token_count",
+            torch.zeros(num_experts, dtype=torch.long, device=next(self.parameters()).device),
+            persistent=False,

Comments

Conversation

cjluo-nv commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

cjluo-nv commented Feb 20, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

realAsma Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

realAsma Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

realAsma Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

realAsma left a comment

Choose a reason for hiding this comment

Uh oh!

meenchen Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edwardf0t1 Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv commented Feb 20, 2026 •

edited

Loading

coderabbitai bot commented Feb 20, 2026 •

edited

Loading

coderabbitai bot Feb 20, 2026 •

edited

Loading

coderabbitai bot Feb 20, 2026 •

edited

Loading

Edwardf0t1 commented Feb 20, 2026 •

edited

Loading