Add Fused Multi-Head Attention example by AntonOresten · Pull Request #16 · JuliaGPU/cuTile.jl

AntonOresten · 2026-01-10T16:01:36Z

See outdated

~~Seems to fall slightly short of my NNop / ONIONop baseline (no WMMA), although I haven't compared it to the Python version. On my GPU, it compiles and runs fastest with tile_n=32 and tile_m=32~~:

julia> begin
           T = Float32
           D, QL, KL, H, B = 64, 4096, 4096, 4, 4
           q = CUDA.randn(T, D, QL, H, B)
           k = CUDA.randn(T, D, KL, H, B)  
           v = CUDA.randn(T, D, KL, H, B)
       end;

julia> @b CUDA.@sync ONIONop.flash_attention(q, k, v, causal=false)
9.559 ms (339 allocs: 7.875 KiB)

julia> @b CUDA.@sync cutile_fmha(q, k, v, causal=false, tile_m=32, tile_n=32)
11.058 ms (540 allocs: 23.109 KiB)

EDIT: this is without tensor cores. simply switching the compute type to TFloat32 / BFloat16 and exploring the optimization and entry hint landscape makes forward and backward passes ~10x faster.

Notably, cutile-python has a latency argument for ct.load, as well as num_ctas and occupancy arguments for the kernel, which might affect performance. The python version also does a kernel config autotune by searching a space of hand-picked configurations. EDIT: fixed in #32 and #27.

Another thing that might be important for correctness or covering edge cases is exposing flush_to_zero? Used in e.g. exp2. Haven't thought about in which cases this matters.

AntonOresten · 2026-01-17T23:21:36Z

Seeing some weird erroring when branching (being fixed in #53):

Click to see snippets

This works:

        qk = if !EVEN_K[] && j >= mask_start
            offs_n = ((j-Int32(1)) * TILE_N[]) .+ offs_n_tile
            mask = ct.full((TILE_N[], TILE_M[]), true, Bool)
            mask = mask .& (offs_n .<= k_seqlen)
            mask = ct.where(mask, ct.zeros((TILE_N[], TILE_M[],), Float32), ct.full((TILE_N[], TILE_M[],), -Inf32, Float32))
            qk .+ mask
        else
            qk
        end

but this doesn't:

        if !EVEN_K[] && j >= mask_start
            offs_n = ((j-Int32(1)) * TILE_N[]) .+ offs_n_tile
            mask = ct.full((TILE_N[], TILE_M[]), true, Bool)
            mask = mask .& (offs_n .<= k_seqlen)
            mask = ct.where(mask, ct.zeros((TILE_N[], TILE_M[],), Float32), ct.full((TILE_N[], TILE_M[],), -Inf32, Float32))
            qk = qk .+ mask
        end

nor does this:

        qk = if !EVEN_K[] && j >= mask_start
            offs_n = ((j-Int32(1)) * TILE_N[]) .+ offs_n_tile
            mask = ct.full((TILE_N[], TILE_M[]), true, Bool)
            if !EVEN_K[]
                mask .& (offs_n .<= k_seqlen)
            end
            mask = ct.where(mask, ct.zeros((TILE_N[], TILE_M[],), Float32), ct.full((TILE_N[], TILE_M[],), -Inf32, Float32))
            qk .+ mask
        else
            qk
        end

In the second and third block, I get "ERROR: SSAValue %___ not found in context"

after removing the second condition, I can suddenly have a nested if block, and I don't need the outer else block:

        if !EVEN_K[]
            offs_n = ((j-Int32(1)) * TILE_N[]) .+ offs_n_tile
            mask = ct.full((TILE_N[], TILE_M[]), true, Bool)
            if !EVEN_K[]
                mask = mask .& (offs_n .<= k_seqlen)
            end
            mask = ct.where(mask, ct.zeros((TILE_N[], TILE_M[],), Float32), ct.full((TILE_N[], TILE_M[],), -Inf32, Float32))
            qk = qk .+ mask
        end

Does the if block need to depend on compile time constants?

I'd need this to make the padding and causal mask properly.

maleadt · 2026-01-19T11:17:12Z

In the second and third block, I get "ERROR: SSAValue %___ not found in context"

That's an IRStructurizer error. Can you provide an MWE?

AntonOresten · 2026-01-26T16:26:14Z

That's an IRStructurizer error. Can you provide an MWE?

I was able to narrow it down and believe it is covered by #53. See the added tests for MWE.

With #51 and #53, I now have forward and backward passes working locally!

AntonOresten · 2026-02-05T19:04:41Z

Currently needing to wrap outside Float32 constants in Float32 within the kernel because MulF otherwise sees it as nothing:

qk_scale = Float32(qk_scale) * Float32(INV_LOG_2)

AntonOresten · 2026-02-06T12:24:56Z

Another concern is whether I should convert to Int32 or not, essentially every time I do index arithmetic.

maleadt · 2026-02-06T13:48:39Z

Currently needing to wrap outside Float32 constants in Float32 within the kernel because MulF otherwise sees it as nothing:
qk_scale = Float32(qk_scale) * Float32(INV_LOG_2)

Can you elaborate?

Another concern is whether I should convert to Int32 or not, essentially every time I do index arithmetic.

Yeah that's a common Julia pain point. It's why we have One(), and in CUDA.jl you can do e.g. 1i32.

AntonOresten · 2026-02-06T14:30:52Z

Can you elaborate?

I define const INV_LOG_2 = Float32(1 / log(2)). If I use it without wrapping in Float32 within the kernel I get:

ERROR: LoadError: MethodError: no method matching encode_MulFOp!(::cuTile.CodeBuilder, ::cuTile.TypeId, ::cuTile.Value, ::Nothing)
The function `encode_MulFOp!` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  encode_MulFOp!(::cuTile.CodeBuilder, ::cuTile.TypeId, ::cuTile.Value, ::cuTile.Value; rounding_mode, flush_to_zero)
   @ cuTile ~/.julia/dev/cuTile/src/bytecode/encodings.jl:720

in CUDA.jl you can do e.g. 1i32.

Oh, neat. I didn't know. I considered maybe a @32 macro (macro var"32" ... end) to make all 64-bit integers wrapped in their 32-bit counterparts. Found that it wouldn't work within curly brackets like for type parameters since e.g. Array{T,Int32(1)} won't count as a vector, but the macro doesn't need to descend into :curly expressions. Problem is still that some functions actually only have methods for Int so it can't be applied to the entire function.

maleadt · 2026-02-06T14:49:59Z

Problem is still that some functions actually only have methods for Int so it can't be applied to the entire function.

In general, Julia's array indexing requires Int. In CUDA.jl we've added some additional methods to override part of the getindex chain to support Int32, but it's tricky...

maleadt · 2026-02-08T06:31:22Z

Constants should work without the type conversion now.

AntonOresten added 2 commits January 10, 2026 17:03

add mod1, max, min

9178f78

add fmha

1bd757c

AntonOresten force-pushed the fmha branch from f7888b1 to 1bd757c Compare January 10, 2026 16:03

AntonOresten added 3 commits January 10, 2026 18:13

fix tests

905e732

Update fmha.jl

d6f9d9b

Merge branch 'main' into fmha

5c27549

This was referenced Jan 13, 2026

Expose entry hints through launch #27

Merged

Allow redefinition of kernel methods #31

Merged

Expose load/store optimization hints #32

Merged

This was referenced Jan 24, 2026

Add tile-level atomics #51

Draft

IRStructurizer: handle merge phis for if-then regions #53

Merged

fix off-by-ones

5c950fd

Merge branch 'main' into fmha

448f8c3

AntonOresten marked this pull request as ready for review February 5, 2026 18:28

AntonOresten added 2 commits February 5, 2026 19:52

fmha.jl -> attention.jl

3fd6f0b

add attention.py, fix transpose.py

1e40c30

remove mod1 overlay (fixed on main)

9ef252c

maleadt mentioned this pull request Feb 6, 2026

Using constants generates compilation error #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Fused Multi-Head Attention example#16

Add Fused Multi-Head Attention example#16
AntonOresten wants to merge 10 commits intoJuliaGPU:mainfrom
AntonOresten:fmha

AntonOresten commented Jan 10, 2026 •

edited

Loading

Uh oh!

AntonOresten commented Jan 17, 2026 •

edited

Loading

Uh oh!

maleadt commented Jan 19, 2026

Uh oh!

AntonOresten commented Jan 26, 2026

Uh oh!

AntonOresten commented Feb 5, 2026 •

edited

Loading

Uh oh!

AntonOresten commented Feb 6, 2026

Uh oh!

maleadt commented Feb 6, 2026

Uh oh!

AntonOresten commented Feb 6, 2026

Uh oh!

maleadt commented Feb 6, 2026

Uh oh!

maleadt commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AntonOresten commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntonOresten commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Jan 19, 2026

Uh oh!

AntonOresten commented Jan 26, 2026

Uh oh!

AntonOresten commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntonOresten commented Feb 6, 2026

Uh oh!

maleadt commented Feb 6, 2026

Uh oh!

AntonOresten commented Feb 6, 2026

Uh oh!

maleadt commented Feb 6, 2026

Uh oh!

maleadt commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AntonOresten commented Jan 10, 2026 •

edited

Loading

AntonOresten commented Jan 17, 2026 •

edited

Loading

AntonOresten commented Feb 5, 2026 •

edited

Loading