Skip to content

Fix SM120 blockscaled FP32 epilogue store layout#3345

Open
NVIDIA-JerryChen wants to merge 1 commit into
NVIDIA:mainfrom
NVIDIA-JerryChen:cjerry/fix-sm120-blockscaled-fp32-epilogue
Open

Fix SM120 blockscaled FP32 epilogue store layout#3345
NVIDIA-JerryChen wants to merge 1 commit into
NVIDIA:mainfrom
NVIDIA-JerryChen:cjerry/fix-sm120-blockscaled-fp32-epilogue

Conversation

@NVIDIA-JerryChen

Copy link
Copy Markdown

Both kernels used StMatrix8x8x16bOp(..., 2) to build the C atom passed to make_tiled_copy_C_atom. For Float32 outputs, the actual register-to-SMEM copy atom is CopyUniversalOp, but the C TV layout is still derived from this stmatrix C atom. The x2 layout does not cover the full C fragment for this epilogue path, causing incomplete SMEM writes before the final TMA store. Changing the C atom to StMatrix8x8x16bOp(..., 4) provides a C TV layout that covers the full C fragment and fixes the observed FP32 output mismatch. In fact, unless there is a special case, we do not use x2; by default, we always use x4. Therefore, I think x2 came from a typo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant