Add INT8 support for LDS transpose load by stefankoncarevic · Pull Request #2214 · ROCm/rocMLIR

stefankoncarevic · 2026-01-23T14:40:21Z

⚠️ Do not merge until #2210 is merged - this PR depends on LDS transpose load fp8 support

Motivation

Extends LDS transpose load optimization to support INT8 data types for GEMM and Attention kernels on gfx950. This enables hardware-accelerated transposed loads (ds_read_tr8_b64) for all INT8 MFMAs (16x16x32, 16x16x64, 32x32x16, 32x32x32), improving performance for INT8 quantized inference.

Technical Details

LdsTransposeLoad.cpp: Added INT8 type support, offset formulas for (16,64) and (32,32) geometries, and double-rate K-coverage logic
AccelEmitter.cpp: Added K-dimension transformation for INT8 MFMAs with kBase=16 when kpack=1
RockDialect.cpp/RockOps.td: Updated validation and type support for INT8 LDS transpose

Test Plan

Added MLIR unit tests
Added E2E tests
All tests verified on gfx950 hardware with numerical correctness validation

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Allow kpack < k_base when k_base >= 8 and k_base % kpack == 0. This enables better utilization of double-rate MFMA instructions (e.g., gfx950 f16/bf16/int8, gfx942 int8/fp8) with kpack=4. Disable LDS transpose for prefetch when kpack < kBase as a necessary fix for the relaxed validation.

Relaxed kpack validation (kpack < k_base) now only applies to double-buffer pipelines (scheduleVersion 2 or 4).

16x16x128) on gfx950 architecture. These tests cover: - Single buffering (scheduleVersion 1, 3) with kpack=32 and kpack=1 - Double buffering (scheduleVersion 2, 4) with kpack=32 - Double buffering with kpack < k_base (kpack=1, 4, 8) - All FP8 type combinations: FP8×FP8, BF8×BF8, FP8×BF8, BF8×FP8 The tests verify that amdgpu.scaled_mfma operations are correctly generated for OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors.

- Remove duplicate entries in getMfmaInsnInfoMap - Clarify neutral scale creation comment in AccelEmitter.cpp - Rename zeroAttr to neutralScaleAttr for clarity

Implement ds_read_tr8_b64 offset formulas for FP8/BF8 MFMA (16x32, 32x16). Enable mixed fp8/bf8 type combinations for GEMM operations on gfx950.

Disable LDS transpose for FP8 GEMM when K >= 1280 or small square matrices (K == N < 512) to avoid performance regressions while preserving compile time benefits.Add FP8 GEMM heuristic to selectively disable LDS transpose

- Add (32,64) geometry support in LdsTransposeLoad.cpp: - New getBasePanelOffsets() branch for 32x64 quad-rate formula - k_block = block_id / 2, m_block = block_id % 2 - kOffsetBase = k_local + k_block * 32 - mOffsetBase = m_parity * 8 + m_block * 16 - Update isQuadRate detection to include 32x64 - Add validation for (32,64) in RockDialect.cpp - Extend tuning ranges for scaled FP8 testing: - kPackPerBlock: added 64 - kPack: added 32 (for k_base=32) Co-authored-by: Cursor <cursoragent@cursor.com>

This commit extends the LDS transpose load optimization to support workgroups with 8 waves (blockSize=512) and 16 waves (blockSize=1024). Previously, the optimization was limited to 1-4 waves only. This restriction has been lifted to enable LDS transpose load for larger workgroup sizes commonly used in high-performance GEMM configurations. Changes: - Extended numWaves limit from 4 to 16 in decideLDSTransposeForOperands() - Added wave grid layout computation for 8 waves: - 2×4, 4×2 (preferred balanced layouts) - 1×8, 8×1 (fallback layouts) - Added wave grid layout computation for 16 waves: - 4×4 (preferred balanced layout) - 2×8, 8×2 (semi-balanced layouts) - 1×16, 16×1 (fallback layouts) Updated tests: - lds_transpose_attributes_toblockwise.mlir: Changed CHECK-NOT to CHECK for 8 and 16 wave tests, confirming LDS transpose is now enabled for these configurations - PrLdsTransposeLoad.toml: Added e2e test cases for 8-wave (4×2, 1×8) and 16-wave (8×2, 1×16) grid configurations

When one operand uses regular load and the other uses LDS transpose load, the regular load must use a compatible K-access pattern. The new formula is only applied when: - useLdsTransposeLoad is true (hybrid scenario) - kVec >= kBase (enough elements to decompose) This ensures correct data alignment between regular and transpose loads for MFMA operations, and prevents assertion failures when kpack < kBase. Changes: - Add useLdsTransposeLoad parameter to wrapLDSBufferForLoad - Implement hybrid K-access formula with blk_d/blk_k split - Pass LDS transpose state from BlockwiseGemmToThreadwise - Update tests in PrLdsTransposeLoad.toml

- Add GEMM1 LDS transpose tests (V transpose + P prefetch) to nightly - Create PrLdsTransposeLoadAttention.toml with 14 quick PR tests

- Add INT8 (i8) support in LdsTransposeLoad.cpp for ds_read_tr8_b64 - Support mfma_i32_16x16x32_i8, mfma_i32_16x16x64_i8, mfma_i32_32x32x16_i8, mfma_i32_32x32x32_i8 - Add INT8 16x64 and 32x32 MFMA geometries with double-rate K coverage - Handle kpack=1 case for INT8 MFMAs with kBase=16 in AccelEmitter.cpp - Add validation for INT8 MFMA geometries in RockDialect.cpp - Add e2e tests for INT8 LDS transpose in GEMM and Attention

Disable LDS transpose load for INT8 convolutions when N=1600 (40x40 spatial output) and K<=M or K>2*M. This fixes two significant performance regressions: - 1x64x40x40 K=64: -62.87% regression - 1x384x40x40 K=128: -43.79% regression The heuristic has no impact on GEMM INT8 (no problems with N=1600) and does not affect any CONV INT8 improvements.

stefankoncarevic requested a review from causten as a code owner January 23, 2026 14:40

stefankoncarevic requested review from dhernandez0, djramic, justinrosner, pabloantoniom and umangyadav January 23, 2026 14:48

stefankoncarevic force-pushed the lds-transpose-load-fp8 branch 3 times, most recently from f3176a8 to a75ab7a Compare January 29, 2026 14:10

stefankoncarevic added 10 commits February 18, 2026 06:11

Merge branch 'develop' into mfma-enable-kpack-values-gfx950

420f609

Pass scheduleVersion to MfmaInsnGroup::select and isCoherentWithK

f440b8f

Relaxed kpack validation (kpack < k_base) now only applies to double-buffer pipelines (scheduleVersion 2 or 4).

Merge branch 'develop' into mfma-enable-kpack-values-gfx950

039aef3

Fix clang-format

3e49dfb

Merge branch 'develop' into mfma-enable-kpack-values-gfx950

f644a1c

Merge branch 'develop' into mfma-enable-kpack-values-gfx950

4c3b993

Minor change

4759559

Merge branch 'develop' into mfma-enable-kpack-values-gfx950

1fa9f3d

Merge branch 'develop' into mfma-enable-kpack-values-gfx950

f439c0e

stefankoncarevic force-pushed the lds-transpose-load-fp8 branch from aaf7a7b to e0bb0cd Compare February 24, 2026 12:52

stefankoncarevic and others added 10 commits February 27, 2026 06:06

Merge branch 'develop' into mfma-enable-kpack-values-gfx950

f8831e4

WIP: Scaled FP8 MFMA support

6893372

Clean up scaled FP8 MFMA code based on review feedback

c6e1847

- Remove duplicate entries in getMfmaInsnInfoMap - Clarify neutral scale creation comment in AccelEmitter.cpp - Rename zeroAttr to neutralScaleAttr for clarity

Clang format

14d13f0

Add FP8/BF8 support for LDS transpose load

bc80aaf

Implement ds_read_tr8_b64 offset formulas for FP8/BF8 MFMA (16x32, 32x16). Enable mixed fp8/bf8 type combinations for GEMM operations on gfx950.

Add PR CI tests for FP8/BF8 LDS transpose load GEMM operations

d3711ae

Add FP8 GEMM heuristic to selectively disable LDS transpose

8164f94

Disable LDS transpose for FP8 GEMM when K >= 1280 or small square matrices (K == N < 512) to avoid performance regressions while preserving compile time benefits.Add FP8 GEMM heuristic to selectively disable LDS transpose

Add tests and update docs for scaled FP8 LDS transpose (16x128, 32x64)

aa06c4c

stefankoncarevic added 3 commits February 27, 2026 06:14

Remove FP8 LDS transpose heuristic from GridwiseGemmToBlockwise

3d5639e

Revert kPackPerBlock=64 and kpack=32 tuning ranges for fp8

69acdef

Update FP8 LDS transpose tests with 16x128/32x64 MFMA

076a998

stefankoncarevic force-pushed the lds-transpose-load-fp8 branch from 24d9bf6 to 076a998 Compare February 27, 2026 13:37

stefankoncarevic added 6 commits February 27, 2026 08:34

Add LDS transpose E2E tests and addressed review comment

0880341

- Add GEMM1 LDS transpose tests (V transpose + P prefetch) to nightly - Create PrLdsTransposeLoadAttention.toml with 14 quick PR tests

Fix INT8 LDS transpose support and update test configurations

a6318df

stefankoncarevic force-pushed the lds-transpose-load-int8 branch from 3ccbf35 to a6318df Compare March 2, 2026 10:05

stefankoncarevic force-pushed the lds-transpose-load-fp8 branch 3 times, most recently from a6e9ccf to b8674ba Compare April 2, 2026 13:56

stefankoncarevic force-pushed the lds-transpose-load-fp8 branch from b8674ba to b54ad4c Compare April 6, 2026 12:21

stefankoncarevic force-pushed the lds-transpose-load-fp8 branch 2 times, most recently from 43f0c7e to b4f76ac Compare April 23, 2026 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add INT8 support for LDS transpose load#2214

Add INT8 support for LDS transpose load#2214
stefankoncarevic wants to merge 29 commits intolds-transpose-load-fp8from
lds-transpose-load-int8

stefankoncarevic commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stefankoncarevic commented Jan 23, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant