Skip to content

NVFP4 Performance issue with __Dyna layer #4735

@CHOcho-quan

Description

@CHOcho-quan

Description

I wrote a simple two layer GeMM and use model-optimizer to quantize it.

Here’s the tensorrt profile results with both GeMMs bias=True in pytorch. Notice that the second Dyna layer is taking 1.84ms, which is 3x slower than the GeMM itself.

[01/06/2026-19:14:38] [I] === Profile (566 iterations ) ===
[01/06/2026-19:14:38] [I] Time(ms) Avg.(ms) Median(ms) Time(%) Layer
[01/06/2026-19:14:38] [I] 0.88 0.0016 0.0015 0.0 __mye1641_myl0_0
[01/06/2026-19:14:38] [I] 406.71 0.7186 0.7185 9.6 __myl_ReshCast_myl0_1
[01/06/2026-19:14:38] [I] 265.11 0.4684 0.4684 6.3 __myl_DynaReshReshMove_myl0_2
[01/06/2026-19:14:38] [I] 546.56 0.9656 0.9657 12.9 /fc1/Gemm_myl0_3
[01/06/2026-19:14:38] [I] 1638.46 2.8948 2.8948 38.8 __myl_Cast_myl0_4
[01/06/2026-19:14:38] [I] 1044.57 1.8455 1.8455 24.7 __myl_Dyna_myl0_5
[01/06/2026-19:14:38] [I] 35.88 0.0634 0.0634 0.8 __myl_Move_myl0_6
[01/06/2026-19:14:38] [I] 283.67 0.5012 0.5012 6.7 /fc2/Gemm_myl0_7
[01/06/2026-19:14:38] [I] 0.60 0.0011 0.0011 0.0 __mye1643_myl0_8
[01/06/2026-19:14:38] [I] 4222.43 7.4601 7.4601 100.0 Total

However, if I set the bias=False, the Dyna layer gets fused with the former layer and becomes much faster.

[01/06/2026-19:34:38] [I] === Profile (1411 iterations ) ===
[01/06/2026-19:34:38] [I] Time(ms) Avg.(ms) Median(ms) Time(%) Layer
[01/06/2026-19:34:38] [I] 1.51 0.0011 0.0011 0.0 __mye1674_myl0_0
[01/06/2026-19:34:38] [I] 931.95 0.6605 0.6605 27.1 __myl_ReshCast_myl0_1
[01/06/2026-19:34:38] [I] 707.04 0.5011 0.5011 20.6 __myl_DynaReshReshMove_myl0_2
[01/06/2026-19:34:38] [I] 929.85 0.6590 0.6590 27.1 __myl_FcCastDyna_myl0_3
[01/06/2026-19:34:38] [I] 110.53 0.0783 0.0783 3.2 __myl_Move_myl0_4
[01/06/2026-19:34:38] [I] 750.88 0.5322 0.5322 21.9 /fc2/MatMul_myl0_5
[01/06/2026-19:34:38] [I] 1.49 0.0011 0.0011 0.0 __mye1676_myl0_6
[01/06/2026-19:34:38] [I] 3433.27 2.4332 2.4332 100.0 Total

Even if I changed to apply a bias with parameter (instead of bias=True directly by pytorch) it doesn’t work. Is this expected and why?

Related Model-Optimizer post: NVIDIA/Model-Optimizer#751 (comment) and they think it's a tensorrt issue.

Environment

Container used (if applicable): nvcr pytorch docker
OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 22.04
CPU architecture (x86_64, aarch64): aarch64
GPU name (e.g. H100, A100, L40S): Nvidia Jetson Thor
GPU memory size: 128G
Number of GPUs: 1
Library versions (if applicable):
Python: 3.12
ModelOpt version or commit hash: NVIDIA/Model-Optimizer@cd0d185
CUDA: 13.1
TensorRT: 10

Relevant Files

N/A

Steps To Reproduce

I was testing on nvcr.io/nvidia/pytorch:25.11-py3 docker on Nvidia Jetson Thor (Blackwell).

The two GeMM codes are very simple as follows

class TwoGemm(nn.Module):
# MLP-style: (B,T,H) → (B,T,4H) → (B,T,H)
def init(self, H: int):
super().init()
self.fc1 = nn.Linear(H, 4 * H, bias=False)
# self.bias = Parameter(shape=(self.out_features, ), dtype=dtype)
self.fc2 = nn.Linear(4 * H, H, bias=False)
# self.fc3 = nn.Linear(H, 4 * H, bias=True)
def forward(self, x):
b, t, h = x.shape
x = x.reshape(b * t, h) # (B*T, H)
x = self.fc1(x) # GEMM1
# x = F.gelu(x)
x = self.fc2(x) # GEMM2
# x = self.fc3(x)
x = x.reshape(b, t, h).contiguous() # (B, T, H)
return x

I use the example in model-optimizer torch-to-quant to generate a quantized onnx and finally tensorrt engine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions