NVFP4 Performance issue with __Dyna layer

## Description


I wrote a simple two layer GeMM and use model-optimizer to quantize it.

Here’s the tensorrt profile results with both GeMMs bias=True in pytorch. Notice that the second Dyna layer is taking 1.84ms, which is 3x slower than the GeMM itself.

[01/06/2026-19:14:38] [I] === Profile (566 iterations ) ===
[01/06/2026-19:14:38] [I] Time(ms) Avg.(ms) Median(ms) Time(%) Layer
[01/06/2026-19:14:38] [I] 0.88 0.0016 0.0015 0.0 __mye1641_myl0_0
[01/06/2026-19:14:38] [I] 406.71 0.7186 0.7185 9.6 __myl_ReshCast_myl0_1
[01/06/2026-19:14:38] [I] 265.11 0.4684 0.4684 6.3 __myl_DynaReshReshMove_myl0_2
[01/06/2026-19:14:38] [I] 546.56 0.9656 0.9657 12.9 /fc1/Gemm_myl0_3
[01/06/2026-19:14:38] [I] 1638.46 2.8948 2.8948 38.8 __myl_Cast_myl0_4
[01/06/2026-19:14:38] [I] 1044.57 1.8455 1.8455 24.7 __myl_Dyna_myl0_5
[01/06/2026-19:14:38] [I] 35.88 0.0634 0.0634 0.8 __myl_Move_myl0_6
[01/06/2026-19:14:38] [I] 283.67 0.5012 0.5012 6.7 /fc2/Gemm_myl0_7
[01/06/2026-19:14:38] [I] 0.60 0.0011 0.0011 0.0 __mye1643_myl0_8
[01/06/2026-19:14:38] [I] 4222.43 7.4601 7.4601 100.0 Total

However, if I set the bias=False, the Dyna layer gets fused with the former layer and becomes much faster.

[01/06/2026-19:34:38] [I] === Profile (1411 iterations ) ===
[01/06/2026-19:34:38] [I] Time(ms) Avg.(ms) Median(ms) Time(%) Layer
[01/06/2026-19:34:38] [I] 1.51 0.0011 0.0011 0.0 __mye1674_myl0_0
[01/06/2026-19:34:38] [I] 931.95 0.6605 0.6605 27.1 __myl_ReshCast_myl0_1
[01/06/2026-19:34:38] [I] 707.04 0.5011 0.5011 20.6 __myl_DynaReshReshMove_myl0_2
[01/06/2026-19:34:38] [I] 929.85 0.6590 0.6590 27.1 __myl_FcCastDyna_myl0_3
[01/06/2026-19:34:38] [I] 110.53 0.0783 0.0783 3.2 __myl_Move_myl0_4
[01/06/2026-19:34:38] [I] 750.88 0.5322 0.5322 21.9 /fc2/MatMul_myl0_5
[01/06/2026-19:34:38] [I] 1.49 0.0011 0.0011 0.0 __mye1676_myl0_6
[01/06/2026-19:34:38] [I] 3433.27 2.4332 2.4332 100.0 Total

Even if I changed to apply a bias with parameter (instead of bias=True directly by pytorch) it doesn’t work. Is this expected and why?

Related Model-Optimizer post: https://github.com/NVIDIA/Model-Optimizer/issues/751#issuecomment-4180683089 and they think it's a tensorrt issue. 

## Environment



Container used (if applicable): nvcr pytorch docker
OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 22.04
CPU architecture (x86_64, aarch64): aarch64
GPU name (e.g. H100, A100, L40S): Nvidia Jetson Thor
GPU memory size: 128G
Number of GPUs: 1
Library versions (if applicable):
Python: 3.12
ModelOpt version or commit hash: https://github.com/NVIDIA/Model-Optimizer/commit/cd0d185b5f6288bb484ff5e1d51a120729f351b1
CUDA: 13.1
TensorRT: 10


## Relevant Files


N/A


## Steps To Reproduce



I was testing on nvcr.io/nvidia/pytorch:25.11-py3 docker on Nvidia Jetson Thor (Blackwell).

The two GeMM codes are very simple as follows
```
class TwoGemm(nn.Module):
# MLP-style: (B,T,H) → (B,T,4H) → (B,T,H)
def init(self, H: int):
super().init()
self.fc1 = nn.Linear(H, 4 * H, bias=False)
# self.bias = Parameter(shape=(self.out_features, ), dtype=dtype)
self.fc2 = nn.Linear(4 * H, H, bias=False)
# self.fc3 = nn.Linear(H, 4 * H, bias=True)
def forward(self, x):
b, t, h = x.shape
x = x.reshape(b * t, h) # (B*T, H)
x = self.fc1(x) # GEMM1
# x = F.gelu(x)
x = self.fc2(x) # GEMM2
# x = self.fc3(x)
x = x.reshape(b, t, h).contiguous() # (B, T, H)
return x
```

I use the example in model-optimizer torch-to-quant to generate a quantized onnx and finally tensorrt engine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVFP4 Performance issue with __Dyna layer #4735

Description

Environment

Relevant Files

Steps To Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NVFP4 Performance issue with __Dyna layer #4735

Description

Description

Environment

Relevant Files

Steps To Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions