Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ NVIDIA Model Optimizer Changelog (Linux)

- User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
- ``hf_ptq.py`` now saves the quantization summary and moe expert token count table to the export directory.
- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to 1/4 of all the experts.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog says the default is "1/4 of all the experts". In the library config the default is currently None (meaning behavior depends on whether the caller sets it, e.g. the CLI). Please clarify in the changelog entry whether the default applies only to hf_ptq.py or to the core API as well, to avoid misleading users.

Copilot uses AI. Check for mistakes.
- Add sparse attention optimization for transformer models (``modelopt.torch.sparsity.attention_sparsity``). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.

0.42 (2026-02-xx)
Expand Down
10 changes: 10 additions & 0 deletions examples/llm_ptq/example_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,7 @@ def build_quant_cfg(
model_type,
quant_cfg_choices,
kv_quant_cfg_choices,
moe_calib_experts_ratio,
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_quant_cfg gained the moe_calib_experts_ratio parameter, but not all call sites appear to have been updated (e.g. examples/llm_ptq/multinode_ptq.py still calls it with the old signature). This will raise TypeError at runtime; update all callers or provide a default for the new arg.

Suggested change
moe_calib_experts_ratio,
moe_calib_experts_ratio: float | None = None,

Copilot uses AI. Check for mistakes.
) -> dict[str, Any]:
quant_cfg = {}
assert qformat in quant_cfg_choices, (
Expand Down Expand Up @@ -232,6 +233,15 @@ def build_quant_cfg(
getattr(mtq, kv_quant_cfg_choices[kv_cache_qformat])["quant_cfg"],
)

if moe_calib_experts_ratio:
if isinstance(quant_cfg["algorithm"], str):
quant_cfg["algorithm"] = {
"method": quant_cfg["algorithm"],
"moe_calib_experts_ratio": moe_calib_experts_ratio,
Comment on lines +236 to +240
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if moe_calib_experts_ratio: uses truthiness and will silently ignore 0.0 (and will treat negative values as enabled). Prefer if moe_calib_experts_ratio is not None: and validate 0 < ratio <= 1 so invalid values fail fast with a clear error.

Copilot uses AI. Check for mistakes.
}
else:
quant_cfg["algorithm"]["moe_calib_experts_ratio"] = moe_calib_experts_ratio
Comment on lines +236 to +243
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all algorithm: None entries in config files
echo "=== Looking for algorithm: None in config.py ==="
rg -n '"algorithm":\s*None' --type python

echo ""
echo "=== Check hf_ptq.py for moe_calib_experts_ratio default ==="
rg -n 'moe_calib_experts_ratio' examples/llm_ptq/hf_ptq.py | head -30

echo ""
echo "=== Check example_utils.py around lines 236-243 ==="
sed -n '230,250p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1204


🏁 Script executed:

#!/bin/bash
# Look at the full context around build_quant_cfg to understand the parameter
echo "=== Looking at build_quant_cfg signature in example_utils.py ==="
rg -n 'def build_quant_cfg' examples/llm_ptq/example_utils.py -A 10

echo ""
echo "=== Check if algorithm can be None by looking at QUANT_CFG_CHOICES ==="
rg -n 'QUANT_CFG_CHOICES' examples/llm_ptq/hf_ptq.py -A 50 | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4804


🏁 Script executed:

#!/bin/bash
# Find where the default configs are defined
echo "=== Search for MXFP8_DEFAULT_CFG definition ==="
rg -n 'MXFP8_DEFAULT_CFG' --type-list | head -5
find . -name "*.py" -exec grep -l "MXFP8_DEFAULT_CFG" {} \;

echo ""
echo "=== Search for config.py or similar in modelopt ==="
find . -path "*/quantization/config.py" -o -path "*/quantization/*config*.py" | head -20

echo ""
echo "=== Look for algorithm definitions in config files ==="
find . -name "*config*.py" -path "*/quantization/*" -exec grep -l "algorithm" {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 1253


🏁 Script executed:

#!/bin/bash
# Check the moe_calib_experts_ratio argument parser to see actual default
echo "=== Check moe_calib_experts_ratio argument definition ==="
sed -n '1130,1145p' examples/llm_ptq/hf_ptq.py

echo ""
echo "=== Check what mtq is imported as ==="
head -50 examples/llm_ptq/hf_ptq.py | grep -E "^import|^from"

echo ""
echo "=== Try to find MXFP8_DEFAULT_CFG in the codebase ==="
find . -type f -name "*.py" | xargs grep -l "MXFP8_DEFAULT_CFG" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 1339


🏁 Script executed:

#!/bin/bash
# Look at the MXFP8_DEFAULT_CFG definition
echo "=== Search for MXFP8_DEFAULT_CFG in config.py ==="
rg -n 'MXFP8_DEFAULT_CFG' modelopt/torch/quantization/config.py -A 15

echo ""
echo "=== Also check other configs mentioned in review (MXFP6, MXFP4, W4A8_MXFP4, MXINT8) ==="
rg -n 'MXFP6_DEFAULT_CFG|MXFP4_DEFAULT_CFG|W4A8_MXFP4_FP8_CFG|MXINT8_DEFAULT_CFG' modelopt/torch/quantization/config.py -A 15 | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4203


🏁 Script executed:

#!/bin/bash
# Check if build_quant_cfg is called in hf_ptq.py
echo "=== Check main() function in hf_ptq.py ==="
sed -n '850,950p' examples/llm_ptq/hf_ptq.py | grep -n "build_quant_cfg"

echo ""
echo "=== Look for build_quant_cfg calls ==="
rg -n 'build_quant_cfg' examples/llm_ptq/hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 253


🏁 Script executed:

#!/bin/bash
# Verify the exact logic at line 243 more carefully
echo "=== Full context of lines 230-250 ==="
sed -n '230,250p' examples/llm_ptq/example_utils.py

echo ""
echo "=== Check if there's any guard preventing algorithm=None scenarios ==="
sed -n '197,250p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3153


Crash when algorithm is None.

The code will crash when moe_calib_experts_ratio is truthy (the CLI default is 0.25) and the quantization config has "algorithm": None (e.g., mxfp8, mxfp6, mxfp4, mxint8, w4a8_mxfp4_fp8). At line 243, the else branch attempts None["moe_calib_experts_ratio"] = ..., raising a TypeError: 'NoneType' object is not subscriptable.

Any user running with a None-algorithm format (e.g., --qformat mxfp8) using the CLI default will immediately hit this crash.

Proposed fix
    if moe_calib_experts_ratio:
+       if quant_cfg["algorithm"] is None:
+           quant_cfg["algorithm"] = {
+               "method": None,
+               "moe_calib_experts_ratio": moe_calib_experts_ratio,
+           }
-       if isinstance(quant_cfg["algorithm"], str):
+       elif isinstance(quant_cfg["algorithm"], str):
            quant_cfg["algorithm"] = {
                "method": quant_cfg["algorithm"],
                "moe_calib_experts_ratio": moe_calib_experts_ratio,
            }
        else:
            quant_cfg["algorithm"]["moe_calib_experts_ratio"] = moe_calib_experts_ratio

Alternatively, only inject the ratio when the model is actually an MoE model, or change the CLI default to None and only inject when explicitly provided.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if moe_calib_experts_ratio:
if isinstance(quant_cfg["algorithm"], str):
quant_cfg["algorithm"] = {
"method": quant_cfg["algorithm"],
"moe_calib_experts_ratio": moe_calib_experts_ratio,
}
else:
quant_cfg["algorithm"]["moe_calib_experts_ratio"] = moe_calib_experts_ratio
if moe_calib_experts_ratio:
if quant_cfg["algorithm"] is None:
quant_cfg["algorithm"] = {
"method": None,
"moe_calib_experts_ratio": moe_calib_experts_ratio,
}
elif isinstance(quant_cfg["algorithm"], str):
quant_cfg["algorithm"] = {
"method": quant_cfg["algorithm"],
"moe_calib_experts_ratio": moe_calib_experts_ratio,
}
else:
quant_cfg["algorithm"]["moe_calib_experts_ratio"] = moe_calib_experts_ratio
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 236 - 243, The code crashes
when quant_cfg["algorithm"] is None because the else branch assumes a dict; fix
by handling None explicitly: when moe_calib_experts_ratio is set, if
quant_cfg["algorithm"] is a str wrap it as before, elif it's a dict set the
"moe_calib_experts_ratio" key, else (covers None or other types) assign
quant_cfg["algorithm"] = {"moe_calib_experts_ratio": moe_calib_experts_ratio};
update the logic around quant_cfg["algorithm"] and moe_calib_experts_ratio to
avoid subscripting None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@realAsma, thanks for confirming! This should prevent crashes for users running MoE calibration with formats like mxfp8, mxfp6, mxfp4, mxint8, or w4a8_mxfp4_fp8 that have algorithm: None in their configs. 🐰


# Gemma 7B has accuracy regression using alpha 1. We set 0.5 instead.
if model_type == "gemma" and "int8_sq" in qformat:
quant_cfg["algorithm"] = {"method": "smoothquant", "alpha": 0.5}
Expand Down
10 changes: 10 additions & 0 deletions examples/llm_ptq/hf_ptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -906,6 +906,7 @@ def quantize_main(
model_type,
QUANT_CFG_CHOICES,
KV_QUANT_CFG_CHOICES,
args.moe_calib_experts_ratio,
)

# Exclude MTP layers from quantization if detected (e.g., GLM-4.7's layer 92)
Expand Down Expand Up @@ -1126,6 +1127,15 @@ def parse_args() -> argparse.Namespace:
"(sensitivity scores, costs, etc.). Only used when auto_quantize_bits is specified."
),
)
parser.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add something similar to mcore PTQ as well? cc @realAsma @jenchen13 @ChenhanYu

"--moe_calib_experts_ratio",
type=float,
default=1.0 / 4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the 25% is selected?

help=(
"Percentage of experts to calibrate during forward pass. Only used for MOE models. "
"This is used to reduce the number of experts to calibrate during forward pass. "
),
)
Comment on lines +1130 to +1138
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Default 0.25 is unconditionally applied to all models, including non-MoE ones.

Since the default is 1.0 / 4 (always truthy), every invocation of hf_ptq.py will inject moe_calib_experts_ratio into the algorithm config—even for non-MoE models. Combined with the crash when algorithm is None (flagged in example_utils.py), this makes --qformat mxfp8 (and similar) unusable out of the box.

Consider defaulting to None so the ratio is only injected when the user explicitly requests it:

     parser.add_argument(
         "--moe_calib_experts_ratio",
         type=float,
-        default=1.0 / 4,
+        default=None,
         help=(
-            "Percentage of experts to calibrate during forward pass. Only used for MOE models. "
-            "This is used to reduce the number of experts to calibrate during forward pass. "
+            "Ratio of experts to calibrate during forward pass (0, 1]. Only used for MOE models. "
+            "Default behavior routes to all experts if not specified. "
+            "Example: 0.25 calibrates 25%% of experts. "
         ),
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 1130 - 1138, The parser is currently
adding --moe_calib_experts_ratio with a default of 1.0/4 which causes the field
to be injected for all models; change the add_argument in hf_ptq.py to
default=None (and allow float values) so the flag is only set when the user
provides it, and update the downstream logic that injects this into the
algorithm config (where algorithm options are assembled in example_utils.py) to
only add moe_calib_experts_ratio if args.moe_calib_experts_ratio is not None;
keep the argument help text but note it’s optional now.


return parser.parse_args()
Comment on lines +1135 to 1140
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag help says "Percentage" but the code/examples treat this as a ratio in (0, 1]. Consider clarifying the help text and validating the allowed range in parse_args so invalid values are rejected early.

Suggested change
"Percentage of experts to calibrate during forward pass. Only used for MOE models. "
"This is used to reduce the number of experts to calibrate during forward pass. "
),
)
return parser.parse_args()
"Fraction of experts to calibrate during forward pass (ratio in (0.0, 1.0]). "
"Only used for MOE models; used to reduce the number of experts calibrated during the forward pass."
),
)
args = parser.parse_args()
if not (0.0 < args.moe_calib_experts_ratio <= 1.0):
parser.error("--moe_calib_experts_ratio must be in the range (0.0, 1.0].")
return args

Copilot uses AI. Check for mistakes.

Expand Down
2 changes: 1 addition & 1 deletion modelopt/torch/export/moe_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def save_expert_token_count_table(model: nn.Module, output_dir: str | Path | Non
"th, td { border: 1px solid #ccc; padding: 4px 8px; text-align: right; }",
"th { background: #f0f0f0; }",
"</style></head><body>",
"<h2>Expert Token Counts (per MoE layer)</h2>",
"<h2>Expert Calib Token Counts (per MoE layer)</h2>",
"<table><tr><th>Layer/Expert</th>",
]
html_parts.extend(f"<th>{i}</th>" for i in range(num_experts))
Expand Down
10 changes: 10 additions & 0 deletions modelopt/torch/quantization/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -1091,6 +1091,16 @@ class QuantizeAlgorithmConfig(ModeloptBaseConfig):
title="This field specifies the name of the calibration algorithm. If None, no calibration is performed.",
)

moe_calib_experts_ratio: float | None = ModeloptField(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this is good idea to put this here.

default=None,
title="% of experts to calibrate during forward pass.",
description=(
"If specified, we force forward tokens to % of experts during the calibration"
Comment on lines +1094 to +1098
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new config field doesn't enforce bounds. Since valid values are in (0, 1], add pydantic constraints (e.g., gt=0, le=1) so invalid configs are rejected during parsing rather than failing later at runtime.

Copilot uses AI. Check for mistakes.
" pass. This forward is for calibration purpose only and will not affect the"
" actual inference."
),
)


class MaxCalibConfig(QuantizeAlgorithmConfig):
"""The config for max calibration algorithm.
Expand Down
6 changes: 6 additions & 0 deletions modelopt/torch/quantization/mode.py
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,12 @@ def wrapped_calib_func(
# For backward compatibility
kwargs["algorithm"] = method

moe_calib_experts_ratio = kwargs.pop("moe_calib_experts_ratio", None)
if moe_calib_experts_ratio is not None:
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moe_calib_experts_ratio is propagated into modules without validation. Since it directly impacts routing, validate 0 < ratio <= 1 here and raise a user-facing error early instead of deferring to downstream asserts in the forward path.

Suggested change
if moe_calib_experts_ratio is not None:
if moe_calib_experts_ratio is not None:
# Validate early to avoid downstream assertion failures in the forward path.
if not isinstance(moe_calib_experts_ratio, (int, float)):
raise ValueError(
f"Invalid moe_calib_experts_ratio {moe_calib_experts_ratio!r}: "
"expected a numeric value in the range (0, 1]."
)
if not (0 < moe_calib_experts_ratio <= 1):
raise ValueError(
f"Invalid moe_calib_experts_ratio {moe_calib_experts_ratio!r}: "
"expected 0 < ratio <= 1."
)

Copilot uses AI. Check for mistakes.
for module in model.modules():
if hasattr(module, "_moe_calib_experts_ratio"):
module._moe_calib_experts_ratio = moe_calib_experts_ratio

if func is not None:
# Call the function with forward_loop as a separate argument
func(model, forward_loop=forward_loop, **kwargs)
Expand Down
36 changes: 25 additions & 11 deletions modelopt/torch/quantization/plugins/huggingface.py
Original file line number Diff line number Diff line change
Expand Up @@ -458,8 +458,11 @@ def _setup(self):
elif hasattr(self, "experts") and hasattr(self.experts, "num_experts"):
num_experts = self.experts.num_experts

self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cpu")
self.expert_token_count = torch.zeros(
num_experts, dtype=torch.long, device=next(self.parameters()).device
Comment on lines +461 to +462
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expert_token_count is created as a plain tensor attribute. Because it is not registered as a buffer, it will not follow .to(...) / .cuda() device moves and may be dropped from state_dict, which can lead to device-mismatch errors in _gate_forward_hook after moving the model. Register it as a (non-persistent) buffer instead of a bare attribute (and initialize it on a known device, e.g., self.gate.weight.device / a parameter device).

Suggested change
self.expert_token_count = torch.zeros(
num_experts, dtype=torch.long, device=next(self.parameters()).device
self.register_buffer(
"expert_token_count",
torch.zeros(num_experts, dtype=torch.long, device=next(self.parameters()).device),
persistent=False,

Copilot uses AI. Check for mistakes.
)
self._count_expert_tokens = False
self._moe_calib_experts_ratio = None

if num_experts == 0:
warnings.warn(
Expand All @@ -483,36 +486,47 @@ def _gate_forward_hook(self, module, input, output):
logits = output if not isinstance(output, tuple) else output[0]
top_k = self.gate.top_k if hasattr(self.gate, "top_k") else self.top_k
_, indices = torch.topk(logits.float(), top_k, dim=-1)
counts = torch.bincount(
indices.reshape(-1).cpu(), minlength=len(self.expert_token_count)
)
self.expert_token_count += counts
counts = torch.bincount(indices.reshape(-1), minlength=self.expert_token_count.shape[0])
self.expert_token_count += counts.to(self.expert_token_count.device)

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
is_calib = any(getattr(m, "_if_calib", False) for m in self.experts.modules())
if is_calib:
self._count_expert_tokens = is_calib
if is_calib and self._moe_calib_experts_ratio:
self._count_expert_tokens = True
assert 0 < self._moe_calib_experts_ratio <= 1, (
"moe_calib_experts_ratio must be between 0 and 1"
)
Comment on lines +495 to +499
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input validation uses assert and relies on truthiness (if is_calib and self._moe_calib_experts_ratio:). assert is stripped with python -O, and a ratio of 0.0 would be silently treated as "unset". Prefer an explicit if ratio is not None: check and raise ValueError (or similar) when the value is out of range.

Copilot uses AI. Check for mistakes.
# If any of the experts are in calibration mode, we will forward all tokens to all experts
# This is used only for calibration, we need to re-calculate the actual outputs again using
# the original top_k
Comment on lines 500 to 502
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says calibration forwards tokens to all experts, but this code now forwards to a configurable fraction via _moe_calib_experts_ratio. Please update the comment to match the new behavior to avoid confusion.

Copilot uses AI. Check for mistakes.
if TRANSFORMERS_VERSION_GE_5_0:
assert hasattr(self, "gate") and hasattr(self.gate, "top_k")
original_top_k = self.gate.top_k
self.gate.top_k = self.gate.num_experts
self.gate.top_k = max(
original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio)
)
Comment on lines 505 to +508
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round(self.gate.num_experts * ratio) can under-select experts because Python uses bankers rounding (e.g., 2.5 -> 2). To ensure you calibrate at least the requested fraction, use math.ceil(...) (still clamped by max(original_top_k, ...)).

Copilot uses AI. Check for mistakes.
super().forward(hidden_states)
self.gate.top_k = original_top_k
else:
# Path for transformers < 5.0
original_top_k = self.top_k
if hasattr(self, "num_experts"):
self.top_k = self.num_experts
self.top_k = max(
original_top_k, round(self.num_experts * self._moe_calib_experts_ratio)
)
Comment on lines 513 to +517
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same rounding issue here: round(self.num_experts * ratio) can round down at .5 boundaries. Prefer ceil(...) (and then max(original_top_k, ...)) so the requested coverage fraction is not accidentally reduced.

Copilot uses AI. Check for mistakes.
elif hasattr(self, "experts"):
self.top_k = self.experts.num_experts
self.top_k = max(
original_top_k,
round(self.experts.num_experts * self._moe_calib_experts_ratio),
)
else:
raise ValueError(f"Could not find num_experts in module {self}")
super().forward(hidden_states)
self.top_k = original_top_k
# Enable counting only for the real-routing forward during calibration
self._count_expert_tokens = is_calib
self._count_expert_tokens = False
else:
self._count_expert_tokens = True
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else branch enables _count_expert_tokens unconditionally, so counting runs on every forward (including inference) and expert_token_count will accumulate outside calibration, adding overhead and producing misleading "calib" tables. _count_expert_tokens should stay False when is_calib is False and only be enabled for the intended calibration-routing forward(s).

Suggested change
self._count_expert_tokens = True
self._count_expert_tokens = is_calib

Copilot uses AI. Check for mistakes.
output = super().forward(hidden_states)
self._count_expert_tokens = False
return output
Comment on lines 492 to 532
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's see recent commits to understand context
git log --oneline -10 -- modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 87


🏁 Script executed:

#!/bin/bash
# Get a diff to see what changed in recent commits
# Look at the last few commits for this file
git log -p -2 -- modelopt/torch/quantization/plugins/huggingface.py | head -300

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

#!/bin/bash
# Let's also check the full forward method and surrounding context
cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '440,550p'

Repository: NVIDIA/Model-Optimizer

Length of output: 6176


🏁 Script executed:

#!/bin/bash
# Search for any tests related to this forward method or _moe_calib_experts_ratio
find . -name "*.py" -type f | xargs grep -l "_moe_calib_experts_ratio" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 156


🏁 Script executed:

#!/bin/bash
# Look for documentation or comments about the expected behavior
grep -r "all experts" --include="*.py" modelopt/torch/quantization/plugins/ 2>/dev/null | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 672


🏁 Script executed:

#!/bin/bash
# Let's check what happens in each scenario by examining the control flow
# Case analysis of the forward method:
# When is_calib=True and ratio is set (line 493 condition is True):
#   - Line 528: _count_expert_tokens = False (before final forward)
# When is_calib=True and ratio is None (line 493 condition is False, goes to else):
#   - Line 530: _count_expert_tokens = True (before final forward)
# When is_calib=False (line 493 condition is False, goes to else):
#   - Line 530: _count_expert_tokens = True (before final forward) <-- PROBLEM!

# Check if there's any conditional that prevents token counting during inference
cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '475,495p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1385


🏁 Script executed:

#!/bin/bash
# Check the gate_forward_hook to understand what happens when _count_expert_tokens is True
cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '475,489p'

Repository: NVIDIA/Model-Optimizer

Length of output: 982


🏁 Script executed:

#!/bin/bash
# Search for where _moe_calib_experts_ratio might be set to understand intended usage
grep -n "_moe_calib_experts_ratio" modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 699


🏁 Script executed:

#!/by
# Check if there are any tests that show expected behavior
find . -name "*test*.py" -type f | xargs grep -l "QuantSparseMoe\|_moe_calib_experts_ratio" 2>/dev/null | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 124


🏁 Script executed:

#!/bin/bash
# Examine the test file to understand expected behavior
cat modelopt/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 152


🏁 Script executed:

#!/bin/bash
# Find the correct path for the test file
find . -name "test_sparse_moe.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 124


🏁 Script executed:

#!/bin/bash
# If not found, search for any test file containing QuantSparseMoe or sparse moe tests
find . -name "*test*.py" -type f -exec grep -l "QuantSparseMoe\|sparse.*moe" {} \; 2>/dev/null | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 124


🏁 Script executed:

#!/bin/bash
# Read the test file
cat ./tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 12565


🏁 Script executed:

#!/bin/bash
# Check where _moe_calib_experts_ratio might be set (outside of _setup)
grep -rn "_moe_calib_experts_ratio" modelopt/torch/quantization/ --include="*.py" | grep -v "test_sparse"

Repository: NVIDIA/Model-Optimizer

Length of output: 1321


🏁 Script executed:

#!/bin/bash
# Check the quantize function and INT8_DEFAULT_CFG to see if they set the ratio
grep -rn "INT8_DEFAULT_CFG\|_moe_calib_experts_ratio" modelopt/torch/quantization/ --include="*.py" | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1959


🏁 Script executed:

#!/bin/bash
# Look for mode.py to see how _moe_calib_experts_ratio is used
cat modelopt/torch/quantization/mode.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 3247


🏁 Script executed:

#!/bin/bash
# Look at the full context around line 222 in mode.py to see the function signature
cat -n modelopt/torch/quantization/mode.py | sed -n '210,235p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1265


🏁 Script executed:

#!/bin/bash
# Search for moe_calib_experts_ratio parameter definition and default value
grep -B5 -A5 "moe_calib_experts_ratio" modelopt/torch/quantization/mode.py | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 665


🏁 Script executed:

#!/bin/bash
# Check if there's a function signature or parameter with default
grep -rn "def.*moe_calib_experts_ratio\|moe_calib_experts_ratio\s*=" modelopt/torch/quantization/ --include="*.py" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 393


Clarify whether all-experts calibration should be the default during quantization.

The class docstring promises "During calibration, we forward all tokens to all experts so that all experts see sufficient tokens to calibrate" (line 445), but this behavior only activates when _moe_calib_experts_ratio is explicitly set in the quantization config. Since it defaults to None, users relying on the documented behavior will not get the expanded-expert forward pass.

Additionally, the else block at lines 529-530 enables token counting for both inference (is_calib=False) and calibration with unset ratio (is_calib=True, ratio=None), creating unnecessary overhead during inference when tokens should not be counted.

Either set a default ratio (e.g., 1.0 for all experts) when entering calibration mode, or update the docstring to clarify that expanded-expert forwarding requires explicit configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 490 - 533,
The forward method currently only expands experts when _moe_calib_experts_ratio
is set; change the logic so that when is_calib is True and
_moe_calib_experts_ratio is None you default it to 1.0 (i.e., all experts) to
match the class docstring; update forward to treat is_calib branches as: if
is_calib: if self._moe_calib_experts_ratio is None: use ratio = 1.0 (or set
self._moe_calib_experts_ratio = 1.0 temporarily), then perform the gate/top_k or
top_k adjustments (refer to forward, _moe_calib_experts_ratio, gate.top_k,
top_k, num_experts, experts) and ensure _count_expert_tokens is True only during
calibration and False for normal inference (remove the current else that sets
_count_expert_tokens=True for non-calibration).

Expand Down