Fix SM120 blockscaled FP32 epilogue store layout by NVIDIA-JerryChen · Pull Request #3345 · NVIDIA/cutlass

NVIDIA-JerryChen · 2026-06-23T09:00:22Z

Both kernels used StMatrix8x8x16bOp(..., 2) to build the C atom passed to make_tiled_copy_C_atom. For Float32 outputs, the actual register-to-SMEM copy atom is CopyUniversalOp, but the C TV layout is still derived from this stmatrix C atom. The x2 layout does not cover the full C fragment for this epilogue path, causing incomplete SMEM writes before the final TMA store. Changing the C atom to StMatrix8x8x16bOp(..., 4) provides a C TV layout that covers the full C fragment and fixes the observed FP32 output mismatch. In fact, unless there is a special case, we do not use x2; by default, we always use x4. Therefore, I think x2 came from a typo.

Fix SM120 blockscaled FP32 epilogue store layout

acd94b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix SM120 blockscaled FP32 epilogue store layout#3345

Fix SM120 blockscaled FP32 epilogue store layout#3345
NVIDIA-JerryChen wants to merge 1 commit into
NVIDIA:mainfrom
NVIDIA-JerryChen:cjerry/fix-sm120-blockscaled-fp32-epilogue

NVIDIA-JerryChen commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

NVIDIA-JerryChen commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant