Add HIP MFMA dense GiMMiK kernel by tomjen12 · Pull Request #20 · PyFR/GiMMiK

tomjen12 · 2026-07-01T06:39:12Z

Summary

Add an initial HIP dense GEMM path using CDNA F64 MFMA instructions.

The kernel densifies and bakes the constant A matrix into MFMA fragment order, stages B through LDS using an m-split/k-blocked path, and is emitted as an additional GiMMiK autotune candidate.

This PR is intended to stack on top of #19.

Results

On seven dense PyFR cases on MI300X, the new MFMA dense path reaches ~78% of rocBLAS geomean. It is added as an additional candidate rather than replacing the existing #19 HIP GiMMiK kernels.

Further tuning will continue in follow-up work, focusing on MFMA utilization, dependency stalls, and register pressure.

Test plan

Ran MI300X dense PyFR matmul benchmarks for the seven cases shown above.
Ran HIP GiMMiK correctness checks against NumPy reference results.

On memory-bound operators the B matrix is read once from HBM and reused only within a work-group via LDS -- it is never re-read across blocks. A normal global load still allocates B's line in L2, which is pure overhead: the line is never reused, it evicts genuinely-reusable data, and it adds cache-allocate/eviction traffic. This is the read-side mirror of the non-temporal C store we already use. NTB loads B with a non-temporal load (load_b -> __builtin_nontemporal path) so B bypasses L2 instead of polluting it. It moves the same number of bytes but keeps the cache clean, raising effective bandwidth. Implemented as a flag on the existing templates rather than new files: - base.mako: add a load_b() wrapper (non-temporal B load). - bstream-msplit{,-preload-c}.mako: gate the B read behind an `ntload` flag (context.get('ntload', False)); renders byte-identically to the plain kernel when the flag is absent. - hip.py: emit `*-ntb` variants by passing ntload=True inside the existing width loop, so NTB combines with width (w1/w2) automatically. Backward-compatible (plain variants unchanged) and CDNA-gated like the other tuned variants. On MI300X (gfx942) NTB passes the accuracy check (~1e-15) and wins the autotune in the majority of memory-bound cases (~+4.5% bandwidth on those), being chosen over the plain bstream-msplit.

Add f64 MFMA dense kernel for CDNA3 (gfx94x) Add a dense double-precision GEMM kernel using the CDNA Matrix Cores (__builtin_amdgcn_mfma_f64_16x16x4f64) as an alternative to rocBLAS on the dense path. A is densified, padded and baked into the kernel in fragment order, B is streamed from global memory, C is non-temporal stored, and the epilogue is fully unrolled. Each 64-lane wavefront sweeps 4 consecutive 16-wide N-tiles (64 cols per block), keeping the cols-per-block = blockx contract so the existing grid logic launches it unchanged. Gated on gfx94x and f64 operands with density >= 0.5 and m,k <= 128. Verified offline (py_compile, emission gating, fragment bake, and a numerical emulation matching A @ B to 1.8e-15); on-device accuracy still to be confirmed.

Reduce the HIP tuned kernel search space from 28 variants to 12 and order the remaining variants to try common winners earlier.

The dense MFMA kernel held all m_tiles*4 v4f64 accumulators live in a single wavefront, which on larger operators meant very high register pressure (~350 VGPR), low occupancy, and poor latency hiding on these bandwidth-bound problems. It also computed every tile of the densified A, including all-zero blocks. Two knobs address both. m-splitting: msplit wavefronts per block, placed in block.y so block.x stays 64 (one wavefront = the cols-per-block grid contract is unchanged). Each wavefront now owns ceil(m_tiles/msplit) m-tiles, so it keeps only ceil(m_tiles/msplit)*4 accumulators live. For msplit > 1 the B tile is staged once into LDS with a synchronous cooperative copy and shared across the block, so B is not re-read per wavefront. Per-warp work is emitted as compile-time blocks (if threadIdx.y == w) so accumulators are scope-local and registers are reused across branches. msplit = 1 keeps the original direct, no-LDS path. Zero-tile skipping: _mfma_dense_bake now also returns amask, marking 16x4 A-tiles that contain a non-zero. All-zero tiles skip their MMA, and on the direct path a fully-zero k-tile also skips its B load. hip.py emits mfma-dense for msplit in {1, 2, 4} (capped at m_tiles), with shared = k_pad*64*dsize for the LDS variants. The accumulator/store path uses the accumulator for every m-tile (inactive ones stay zero), so beta is handled uniformly. Verified offline: the generator emits the expected s1/s2/s4 variants with correct block/shared/grid, and a numerical emulation of both paths reproduces A @ B (with beta) to <= 2.5e-14 for dense, structured-sparse, large-m, and msplit 1/2/4 cases. On-device accuracy and performance still to be benchmarked.

Add mfma-dense-pipe, a variant of the dense MFMA kernel's direct path that overlaps global B loads with Matrix-Core work. B for the next k-tile is loaded into a second register buffer before the current k-tile's MFMAs are issued, with the two buffers alternated per k-tile, so the global-load latency hides behind the MFMA pipeline. This targets the B-from-global, latency-bound case. The maths is identical to the msplit=1 direct path: A densified and baked in fragment order, B streamed, C non-temporal stored, epilogue fully unrolled, and zero-tile skipping via amask. It is emitted as a separate kernel so the prefetch restructuring does not complicate the main mfma-dense template. hip.py emits mfma-dense-pipe once per dense f64 operand on gfx94x, alongside the existing s1/s2/s4 mfma-dense variants. Verified offline: the generator emits the expected variant, and a numerical emulation of the pipelined path reproduces A @ B (with beta) to <= 1.4e-14 across dense, large-m, structured-sparse, single-k-tile, and beta cases. On-device performance still to be benchmarked.

… load The m-split MFMA dense path staged the full padded B tile into LDS with a scalar cooperative copy, reading every k-row even when A uses only a subset. On these bandwidth-bound operators that wastes the resource that actually limits us. Two changes reduce and speed up the B read. bix compaction: only the k-rows A actually uses (sorted(self.bix)) are read from global into LDS. Hole rows and the padded tail are zeroed, because A is zero there and the MMA needs a finite (not NaN) operand -- an all-zero row left uninitialised would make 0*NaN corrupt the whole output row. The n-tail needs no such care: an MFMA output column depends only on its own B column, which is already store-guarded. The zero pass (and its extra __syncthreads) is emitted only when bix does not already cover every k-row read and k is a multiple of 4. Vectorized load: when the layout is 2-aligned (aligne % 2 == 0) the global read is issued as a 16-byte f64x2, with Bs declared __align__(16); otherwise it falls back to a scalar copy, with the odd-n column handled separately. hip.py passes bix_rows and the vec2 flag into the m-split variants; the direct (msplit=1) path is unchanged. Verified offline: a numerical emulation of the new staging reproduces A @ B (with beta) to <= 1.4e-14 for operands with zero k-columns (bix 17/24), fully dense (40/40), single active k-tile, msplit 2/4, and the scalar fallback. On-device performance still to be benchmarked; this only reduces the B-read side, so the gain is expected to be modest while C writes dominate.

The m-split path staged a full padded B tile in LDS, sized to every k-tile. Drop the inactive 4-wide k-slabs: the active k-tile at position a now occupies LDS rows [a*4, a*4+4), and the compute reads index it via compact_pos[kt]. Each tile keeps its four k-rows contiguous so the hardware g = lane/16 -> row mapping stays a plain offset (a full per-row compaction would need a runtime-g-indexed lookup, which risks the local-array-indexed-by-runtime register-allocation hazard). The cooperative load now uses two compile-time tables, bg[] (global k-row) and bl[] (compact LDS row), to scatter the bix rows into their tile slots; hole and padded rows inside an active tile are zeroed. LDS usage drops to n_akt*4*64*dsize, where n_akt is the number of active k-tiles (computed in hip.py for the shared-memory meta). Verified offline: a numerical emulation of the compacted staging reproduces A @ B (with beta) to <= 1.1e-14 for operands with zeroed k-slabs (k_tiles 6 -> 4 active, shared 12288 -> 8192), fully dense (unchanged), single active k-tile, and msplit 2/4. Smaller LDS means more resident blocks / higher occupancy, but only when A actually has all-zero k-slabs; k-dense operators see no change. On-device perf still to be benchmarked.

On large-k operators (e.g. p5 tet m460: m~60, k~168) the m-split MFMA variants never ran: the single-chunk LDS B tile needed k_pad*4*64*8 = 84 KB, over the 64 KB limit, so only the scalar-B direct (s1) path was emitted -- leaving it ~4% behind a width-2 (double2) sparse kernel whose edge is exactly its vectorized B reads. Stage B into LDS in chunks of kc active k-tiles (kc = min(n_akt, 8), LDS fixed at 16 KB) and accumulate across the mako-unrolled chunk loop, so the m-split path fits any k and brings the double2 cooperative B load to large-k operators. To keep the cooperative (whole-block) load, cross-chunk accumulator persistence, and per-wavefront register scoping mutually compatible, the m-split path now uses a uniform m-tile assignment: wavefront owns m-tiles [wmt, wmt+mtpg) with wmt = threadIdx.y*mtpg evaluated at runtime, accumulators acc_j (j < mtpg) at function scope, and the real m-tile used at runtime for the Ag index and store row. The wmt+j < m_tiles guard is emitted only when m_tiles is not a multiple of msplit. This drops the compile-time zero-tile MMA skip on the m-split path (still present on the direct path); the bix B-load compaction is retained, so only the k-rows A uses are read. Verified offline: an emulation of the k-blocked path reproduces A @ B (with beta) to <= 5e-14 for multi-chunk large-k, single-chunk, zeroed-k-slab, non-divisible msplit, and beta cases; large-k operators now emit s1/s2/s4 with a bounded 16 KB shared allocation. On-device performance still to be benchmarked.

Keep the dense MFMA path focused on the m-split implementation, rename the template accordingly, and remove the unused direct pipelined variant.

tomjen12 and others added 26 commits June 18, 2026 00:06

Add tuned HIP GiMMiK preload variants

8e23d63

Fix HIP GiMMiK block size metadata

8f4d03e

Address HIP GiMMiK review comments

96671a6

Handle ROCm feature suffixes for gfx942 tuning

739a82e

Enable tuned HIP variants on gfx90a

0633539

Parameterize HIP vector width and refine preload kernels

7b59fb0

Use blockx launch bounds for HIP cstream preload

e9b921a

Always use non-temporal C accesses for HIP

2aa2577

Use non-temporal B loads by default for HIP

280e948

Make HIP preload-C a template option

c06216d

Avoid HIP vector operator+= overloads

e014e4d

Update mfma-dense.mako

6d237ef

Prune HIP tuned variants to 12

f6bc308

Reduce the HIP tuned kernel search space from 28 variants to 12 and order the remaining variants to try common winners earlier.

Update mfma-dense.mako

6689a9c

Remove HIP variant arch gate

a3aee45

Update hip.py

b521427

Restore MI355 HIP baseline variants

2c7af9b

Merge branch 'hip-gimmik-mfma-dense' into hip-gimmik-mfma-dense-pr

9e289a1

Clean up HIP MFMA dense integration

ffc8aff

Keep the dense MFMA path focused on the m-split implementation, rename the template accordingly, and remove the unused direct pipelined variant.

tomjen12 mentioned this pull request Jul 1, 2026

Allow dense HIP GiMMiK autotune candidates PyFR/PyFR#570

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add HIP MFMA dense GiMMiK kernel#20

Add HIP MFMA dense GiMMiK kernel#20
tomjen12 wants to merge 26 commits into
PyFR:masterfrom
tomjen12:hip-gimmik-mfma-dense-pr

tomjen12 commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tomjen12 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomjen12 commented Jul 1, 2026 •

edited

Loading