Skip to content

[SM120] Emit grouped NVFP4 (ue4m3) block-scaled GEMM in cutlass_library#3346

Open
waynehacking8 wants to merge 1 commit into
NVIDIA:mainfrom
waynehacking8:sm120-grouped-nvf4-gen
Open

[SM120] Emit grouped NVFP4 (ue4m3) block-scaled GEMM in cutlass_library#3346
waynehacking8 wants to merge 1 commit into
NVIDIA:mainfrom
waynehacking8:sm120-grouped-nvf4-gen

Conversation

@waynehacking8

Copy link
Copy Markdown

The SM120 fp4 block-scaled generator skipped grouped NVFP4 (ue4m3) kernels behind a stale
"not yet available" comment, even though to_grouped_schedule() already maps the Nvf4 Sm120
tags onto the PtrArray tags and example 79d_blackwell_geforce_nvfp4_grouped_gemm.cu runs that
exact kernel. Because the grouped tags collapse to PtrArray*, is_nvf4(kernel_schedule) is
False in the grouped branch, so it must key on the SF type. This adds the
is_grouped && ue4m3 clause and corrects the stale comment.

Emits 14 grouped NVFP4 SM120 kernels that were previously absent (verified by driving the
grouped generator path: 0 -> 14). The kernel template is the one proven by example 79d.

Fixes #3343.

Test plan / limits

  • Verified emission 0 -> 14 by driving GenerateSM120_TensorOp_fp4_UMMA_gemm_with_block_scaled(..., gemm_kind=GemmKind.GroupedBlockScaledUniversal3x).
  • Verified the kernel template via example 79d on RTX PRO 6000 (SM120, CUDA 13): both schedules Disposition: Passed, 303 / 253 TFLOPS.
  • I did not run a full cutlass_profiler build of all 14 emitted tile variants locally (heavy). Please run internal CI to confirm every emitted tile compiles — some SM120 tiles may hit Stages>=2 constraints; if so they can be pruned in the same clause.

The SM120 fp4 block-scaled generator skipped grouped NVFP4 (ue4m3) kernels
behind a stale "not yet available" comment, although to_grouped_schedule()
already maps the Nvf4 Sm120 tags onto the PtrArray tags and example 79d
(79d_blackwell_geforce_nvfp4_grouped_gemm.cu) runs that exact kernel. Because
the grouped tags collapse to PtrArray*, is_nvf4(kernel_schedule) is False in
the grouped branch, so it must key on the SF type instead.

Emits 14 grouped NVFP4 SM120 kernels that were previously absent (verified by
driving the grouped generator path: 0 -> 14). The kernel template is the one
proven by example 79d (303 TFLOPS on RTX PRO 6000, SM120, CUDA 13).

Fixes NVIDIA#3343.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: WEI CHENG CHIU <waynehacking8@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] cutlass_library does not emit grouped NVFP4 (ue4m3) block-scaled GEMM for SM120, although example 79d runs that kernel

1 participant