Add HIP MFMA dense GiMMiK kernel#20
Open
tomjen12 wants to merge 26 commits into
Open
Conversation
On memory-bound operators the B matrix is read once from HBM and reused
only within a work-group via LDS -- it is never re-read across blocks.
A normal global load still allocates B's line in L2, which is pure
overhead: the line is never reused, it evicts genuinely-reusable data,
and it adds cache-allocate/eviction traffic. This is the read-side mirror
of the non-temporal C store we already use.
NTB loads B with a non-temporal load (load_b -> __builtin_nontemporal
path) so B bypasses L2 instead of polluting it. It moves the same number
of bytes but keeps the cache clean, raising effective bandwidth.
Implemented as a flag on the existing templates rather than new files:
- base.mako: add a load_b() wrapper (non-temporal B load).
- bstream-msplit{,-preload-c}.mako: gate the B read behind an `ntload`
flag (context.get('ntload', False)); renders byte-identically to the
plain kernel when the flag is absent.
- hip.py: emit `*-ntb` variants by passing ntload=True inside the
existing width loop, so NTB combines with width (w1/w2) automatically.
Backward-compatible (plain variants unchanged) and CDNA-gated like the
other tuned variants. On MI300X (gfx942) NTB passes the accuracy check
(~1e-15) and wins the autotune in the majority of memory-bound cases
(~+4.5% bandwidth on those), being chosen over the plain bstream-msplit.
Add f64 MFMA dense kernel for CDNA3 (gfx94x) Add a dense double-precision GEMM kernel using the CDNA Matrix Cores (__builtin_amdgcn_mfma_f64_16x16x4f64) as an alternative to rocBLAS on the dense path. A is densified, padded and baked into the kernel in fragment order, B is streamed from global memory, C is non-temporal stored, and the epilogue is fully unrolled. Each 64-lane wavefront sweeps 4 consecutive 16-wide N-tiles (64 cols per block), keeping the cols-per-block = blockx contract so the existing grid logic launches it unchanged. Gated on gfx94x and f64 operands with density >= 0.5 and m,k <= 128. Verified offline (py_compile, emission gating, fragment bake, and a numerical emulation matching A @ B to 1.8e-15); on-device accuracy still to be confirmed.
Reduce the HIP tuned kernel search space from 28 variants to 12 and order the remaining variants to try common winners earlier.
The dense MFMA kernel held all m_tiles*4 v4f64 accumulators live in a
single wavefront, which on larger operators meant very high register
pressure (~350 VGPR), low occupancy, and poor latency hiding on these
bandwidth-bound problems. It also computed every tile of the densified
A, including all-zero blocks. Two knobs address both.
m-splitting: msplit wavefronts per block, placed in block.y so block.x
stays 64 (one wavefront = the cols-per-block grid contract is
unchanged). Each wavefront now owns ceil(m_tiles/msplit) m-tiles, so it
keeps only ceil(m_tiles/msplit)*4 accumulators live. For msplit > 1 the
B tile is staged once into LDS with a synchronous cooperative copy and
shared across the block, so B is not re-read per wavefront. Per-warp
work is emitted as compile-time blocks (if threadIdx.y == w) so
accumulators are scope-local and registers are reused across branches.
msplit = 1 keeps the original direct, no-LDS path.
Zero-tile skipping: _mfma_dense_bake now also returns amask, marking
16x4 A-tiles that contain a non-zero. All-zero tiles skip their MMA,
and on the direct path a fully-zero k-tile also skips its B load.
hip.py emits mfma-dense for msplit in {1, 2, 4} (capped at m_tiles),
with shared = k_pad*64*dsize for the LDS variants. The accumulator/store
path uses the accumulator for every m-tile (inactive ones stay zero), so
beta is handled uniformly.
Verified offline: the generator emits the expected s1/s2/s4 variants
with correct block/shared/grid, and a numerical emulation of both paths
reproduces A @ B (with beta) to <= 2.5e-14 for dense, structured-sparse,
large-m, and msplit 1/2/4 cases. On-device accuracy and performance
still to be benchmarked.
Add mfma-dense-pipe, a variant of the dense MFMA kernel's direct path that overlaps global B loads with Matrix-Core work. B for the next k-tile is loaded into a second register buffer before the current k-tile's MFMAs are issued, with the two buffers alternated per k-tile, so the global-load latency hides behind the MFMA pipeline. This targets the B-from-global, latency-bound case. The maths is identical to the msplit=1 direct path: A densified and baked in fragment order, B streamed, C non-temporal stored, epilogue fully unrolled, and zero-tile skipping via amask. It is emitted as a separate kernel so the prefetch restructuring does not complicate the main mfma-dense template. hip.py emits mfma-dense-pipe once per dense f64 operand on gfx94x, alongside the existing s1/s2/s4 mfma-dense variants. Verified offline: the generator emits the expected variant, and a numerical emulation of the pipelined path reproduces A @ B (with beta) to <= 1.4e-14 across dense, large-m, structured-sparse, single-k-tile, and beta cases. On-device performance still to be benchmarked.
… load The m-split MFMA dense path staged the full padded B tile into LDS with a scalar cooperative copy, reading every k-row even when A uses only a subset. On these bandwidth-bound operators that wastes the resource that actually limits us. Two changes reduce and speed up the B read. bix compaction: only the k-rows A actually uses (sorted(self.bix)) are read from global into LDS. Hole rows and the padded tail are zeroed, because A is zero there and the MMA needs a finite (not NaN) operand -- an all-zero row left uninitialised would make 0*NaN corrupt the whole output row. The n-tail needs no such care: an MFMA output column depends only on its own B column, which is already store-guarded. The zero pass (and its extra __syncthreads) is emitted only when bix does not already cover every k-row read and k is a multiple of 4. Vectorized load: when the layout is 2-aligned (aligne % 2 == 0) the global read is issued as a 16-byte f64x2, with Bs declared __align__(16); otherwise it falls back to a scalar copy, with the odd-n column handled separately. hip.py passes bix_rows and the vec2 flag into the m-split variants; the direct (msplit=1) path is unchanged. Verified offline: a numerical emulation of the new staging reproduces A @ B (with beta) to <= 1.4e-14 for operands with zero k-columns (bix 17/24), fully dense (40/40), single active k-tile, msplit 2/4, and the scalar fallback. On-device performance still to be benchmarked; this only reduces the B-read side, so the gain is expected to be modest while C writes dominate.
The m-split path staged a full padded B tile in LDS, sized to every k-tile. Drop the inactive 4-wide k-slabs: the active k-tile at position a now occupies LDS rows [a*4, a*4+4), and the compute reads index it via compact_pos[kt]. Each tile keeps its four k-rows contiguous so the hardware g = lane/16 -> row mapping stays a plain offset (a full per-row compaction would need a runtime-g-indexed lookup, which risks the local-array-indexed-by-runtime register-allocation hazard). The cooperative load now uses two compile-time tables, bg[] (global k-row) and bl[] (compact LDS row), to scatter the bix rows into their tile slots; hole and padded rows inside an active tile are zeroed. LDS usage drops to n_akt*4*64*dsize, where n_akt is the number of active k-tiles (computed in hip.py for the shared-memory meta). Verified offline: a numerical emulation of the compacted staging reproduces A @ B (with beta) to <= 1.1e-14 for operands with zeroed k-slabs (k_tiles 6 -> 4 active, shared 12288 -> 8192), fully dense (unchanged), single active k-tile, and msplit 2/4. Smaller LDS means more resident blocks / higher occupancy, but only when A actually has all-zero k-slabs; k-dense operators see no change. On-device perf still to be benchmarked.
On large-k operators (e.g. p5 tet m460: m~60, k~168) the m-split MFMA variants never ran: the single-chunk LDS B tile needed k_pad*4*64*8 = 84 KB, over the 64 KB limit, so only the scalar-B direct (s1) path was emitted -- leaving it ~4% behind a width-2 (double2) sparse kernel whose edge is exactly its vectorized B reads. Stage B into LDS in chunks of kc active k-tiles (kc = min(n_akt, 8), LDS fixed at 16 KB) and accumulate across the mako-unrolled chunk loop, so the m-split path fits any k and brings the double2 cooperative B load to large-k operators. To keep the cooperative (whole-block) load, cross-chunk accumulator persistence, and per-wavefront register scoping mutually compatible, the m-split path now uses a uniform m-tile assignment: wavefront owns m-tiles [wmt, wmt+mtpg) with wmt = threadIdx.y*mtpg evaluated at runtime, accumulators acc_j (j < mtpg) at function scope, and the real m-tile used at runtime for the Ag index and store row. The wmt+j < m_tiles guard is emitted only when m_tiles is not a multiple of msplit. This drops the compile-time zero-tile MMA skip on the m-split path (still present on the direct path); the bix B-load compaction is retained, so only the k-rows A uses are read. Verified offline: an emulation of the k-blocked path reproduces A @ B (with beta) to <= 5e-14 for multi-chunk large-k, single-chunk, zeroed-k-slab, non-divisible msplit, and beta cases; large-k operators now emit s1/s2/s4 with a bounded 16 KB shared allocation. On-device performance still to be benchmarked.
Keep the dense MFMA path focused on the m-split implementation, rename the template accordingly, and remove the unused direct pipelined variant.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add an initial HIP dense GEMM path using CDNA F64 MFMA instructions.
The kernel densifies and bakes the constant A matrix into MFMA fragment order, stages B through LDS using an m-split/k-blocked path, and is emitted as an additional GiMMiK autotune candidate.
This PR is intended to stack on top of #19.
Results
On seven dense PyFR cases on MI300X, the new MFMA dense path reaches ~78% of rocBLAS geomean. It is added as an additional candidate rather than replacing the existing #19 HIP GiMMiK kernels.
Further tuning will continue in follow-up work, focusing on MFMA utilization, dependency stalls, and register pressure.
Test plan