Add INT8 support for LDS transpose load#2214
Open
stefankoncarevic wants to merge 29 commits intolds-transpose-load-fp8from
Open
Add INT8 support for LDS transpose load#2214stefankoncarevic wants to merge 29 commits intolds-transpose-load-fp8from
stefankoncarevic wants to merge 29 commits intolds-transpose-load-fp8from
Conversation
f3176a8 to
a75ab7a
Compare
Allow kpack < k_base when k_base >= 8 and k_base % kpack == 0. This enables better utilization of double-rate MFMA instructions (e.g., gfx950 f16/bf16/int8, gfx942 int8/fp8) with kpack=4. Disable LDS transpose for prefetch when kpack < kBase as a necessary fix for the relaxed validation.
Relaxed kpack validation (kpack < k_base) now only applies to double-buffer pipelines (scheduleVersion 2 or 4).
aaf7a7b to
e0bb0cd
Compare
16x16x128) on gfx950 architecture. These tests cover: - Single buffering (scheduleVersion 1, 3) with kpack=32 and kpack=1 - Double buffering (scheduleVersion 2, 4) with kpack=32 - Double buffering with kpack < k_base (kpack=1, 4, 8) - All FP8 type combinations: FP8×FP8, BF8×BF8, FP8×BF8, BF8×FP8 The tests verify that amdgpu.scaled_mfma operations are correctly generated for OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors.
- Remove duplicate entries in getMfmaInsnInfoMap - Clarify neutral scale creation comment in AccelEmitter.cpp - Rename zeroAttr to neutralScaleAttr for clarity
Implement ds_read_tr8_b64 offset formulas for FP8/BF8 MFMA (16x32, 32x16). Enable mixed fp8/bf8 type combinations for GEMM operations on gfx950.
Disable LDS transpose for FP8 GEMM when K >= 1280 or small square matrices (K == N < 512) to avoid performance regressions while preserving compile time benefits.Add FP8 GEMM heuristic to selectively disable LDS transpose
- Add (32,64) geometry support in LdsTransposeLoad.cpp: - New getBasePanelOffsets() branch for 32x64 quad-rate formula - k_block = block_id / 2, m_block = block_id % 2 - kOffsetBase = k_local + k_block * 32 - mOffsetBase = m_parity * 8 + m_block * 16 - Update isQuadRate detection to include 32x64 - Add validation for (32,64) in RockDialect.cpp - Extend tuning ranges for scaled FP8 testing: - kPackPerBlock: added 64 - kPack: added 32 (for k_base=32) Co-authored-by: Cursor <cursoragent@cursor.com>
24d9bf6 to
076a998
Compare
This commit extends the LDS transpose load optimization to support workgroups with 8 waves (blockSize=512) and 16 waves (blockSize=1024). Previously, the optimization was limited to 1-4 waves only. This restriction has been lifted to enable LDS transpose load for larger workgroup sizes commonly used in high-performance GEMM configurations. Changes: - Extended numWaves limit from 4 to 16 in decideLDSTransposeForOperands() - Added wave grid layout computation for 8 waves: - 2×4, 4×2 (preferred balanced layouts) - 1×8, 8×1 (fallback layouts) - Added wave grid layout computation for 16 waves: - 4×4 (preferred balanced layout) - 2×8, 8×2 (semi-balanced layouts) - 1×16, 16×1 (fallback layouts) Updated tests: - lds_transpose_attributes_toblockwise.mlir: Changed CHECK-NOT to CHECK for 8 and 16 wave tests, confirming LDS transpose is now enabled for these configurations - PrLdsTransposeLoad.toml: Added e2e test cases for 8-wave (4×2, 1×8) and 16-wave (8×2, 1×16) grid configurations
When one operand uses regular load and the other uses LDS transpose load, the regular load must use a compatible K-access pattern. The new formula is only applied when: - useLdsTransposeLoad is true (hybrid scenario) - kVec >= kBase (enough elements to decompose) This ensures correct data alignment between regular and transpose loads for MFMA operations, and prevents assertion failures when kpack < kBase. Changes: - Add useLdsTransposeLoad parameter to wrapLDSBufferForLoad - Implement hybrid K-access formula with blk_d/blk_k split - Pass LDS transpose state from BlockwiseGemmToThreadwise - Update tests in PrLdsTransposeLoad.toml
- Add GEMM1 LDS transpose tests (V transpose + P prefetch) to nightly - Create PrLdsTransposeLoadAttention.toml with 14 quick PR tests
- Add INT8 (i8) support in LdsTransposeLoad.cpp for ds_read_tr8_b64 - Support mfma_i32_16x16x32_i8, mfma_i32_16x16x64_i8, mfma_i32_32x32x16_i8, mfma_i32_32x32x32_i8 - Add INT8 16x64 and 32x32 MFMA geometries with double-rate K coverage - Handle kpack=1 case for INT8 MFMAs with kBase=16 in AccelEmitter.cpp - Add validation for INT8 MFMA geometries in RockDialect.cpp - Add e2e tests for INT8 LDS transpose in GEMM and Attention
Disable LDS transpose load for INT8 convolutions when N=1600 (40x40 spatial output) and K<=M or K>2*M. This fixes two significant performance regressions: - 1x64x40x40 K=64: -62.87% regression - 1x384x40x40 K=128: -43.79% regression The heuristic has no impact on GEMM INT8 (no problems with N=1600) and does not affect any CONV INT8 improvements.
3ccbf35 to
a6318df
Compare
a6e9ccf to
b8674ba
Compare
b8674ba to
b54ad4c
Compare
43f0c7e to
b4f76ac
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Extends LDS transpose load optimization to support INT8 data types for GEMM and Attention kernels on gfx950. This enables hardware-accelerated transposed loads (ds_read_tr8_b64) for all INT8 MFMAs (16x16x32, 16x16x64, 32x32x16, 32x32x32), improving performance for INT8 quantized inference.
Technical Details
Test Plan
Added MLIR unit tests
Added E2E tests
All tests verified on gfx950 hardware with numerical correctness validation
Test Result
Submission Checklist