Skip to content

[Tile][WIP] tile DeviceTransform port#9210

Draft
nanan-nvidia wants to merge 17 commits into
NVIDIA:mainfrom
nanan-nvidia:tile-device-transform
Draft

[Tile][WIP] tile DeviceTransform port#9210
nanan-nvidia wants to merge 17 commits into
NVIDIA:mainfrom
nanan-nvidia:tile-device-transform

Conversation

@nanan-nvidia
Copy link
Copy Markdown

Highly experimental, opening as Draft to share a work-in-progress port of cub::DeviceTransform (#8087, #9038) onto cutile, and to compare side-by-side with the existing CUB benches. Not for merge.

Before the benchmarks, it is important to note that SIMT-Tile interop for TileIR is still work in progress. Thus, right now, the custom function the user passes in must be a __tile__ function, and it must consist of tile operations and must be inlinable.

Current B200 benchmark on pytorch and babel:

pytorch (tile / cub/ delta = tile - cub, BW utilisation %)

op T 2^16 2^20 2^24 2^28 2^31
relu half $\ \ 0.4/\ \ 0.6/{\color{red}-\ \ 0.1}$ $\ \ 6.6/\ \ 6.8/{\color{red}-\ \ 0.2}$ $47.9/47.4/{\color{green}+\ \ 0.5}$ $83.5/83.1/{\color{green}+\ \ 0.4}$ $87.8/80.2/{\color{green}+\ \ 7.6}$
relu bf16 $\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$ $\ \ 6.7/\ \ 7.0/{\color{red}-\ \ 0.3}$ $47.5/46.2/{\color{green}+\ \ 1.3}$ $83.3/83.0/{\color{green}+\ \ 0.3}$ $87.7/80.9/{\color{green}+\ \ 6.8}$
relu f32 $\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$ $12.9/13.3/{\color{red}-\ \ 0.4}$ $64.7/65.0/{\color{red}-\ \ 0.3}$ $88.4/88.8/{\color{red}-\ \ 0.3}$ $90.8/91.1/{\color{red}-\ \ 0.3}$
sigmoid half $\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$ $\ \ 6.9/\ \ 6.6/{\color{green}+\ \ 0.2}$ $34.5/31.5/{\color{green}+\ \ 3.0}$ $48.1/45.5/{\color{green}+\ \ 2.5}$ $49.3/46.8/{\color{green}+\ \ 2.6}$
sigmoid bf16 $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.1}$ $\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.3}$ $34.8/32.0/{\color{green}+\ \ 2.8}$ $48.1/45.5/{\color{green}+\ \ 2.6}$ $49.4/46.8/{\color{green}+\ \ 2.7}$
sigmoid f32 $\ \ 1.0/\ \ 1.1/{\color{red}-\ \ 0.1}$ $12.9/13.3/{\color{red}-\ \ 0.4}$ $61.3/56.8/{\color{green}+\ \ 4.5}$ $83.8/76.7/{\color{green}+\ \ 7.1}$ $79.3/73.3/{\color{green}+\ \ 6.0}$
tanh half $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 7.0/\ \ 6.7/{\color{green}+\ \ 0.3}$ $39.6/35.5/{\color{green}+\ \ 4.1}$ $58.5/51.6/{\color{green}+\ \ 6.9}$ $60.4/53.2/{\color{green}+\ \ 7.2}$
tanh bf16 $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 7.0/\ \ 6.7/{\color{green}+\ \ 0.2}$ $38.4/35.5/{\color{green}+\ \ 2.9}$ $55.3/51.4/{\color{green}+\ \ 3.9}$ $56.9/53.1/{\color{green}+\ \ 3.8}$
tanh f32 $\ \ 1.0/\ \ 1.1/{\color{red}-\ \ 0.1}$ $13.0/13.3/{\color{red}-\ \ 0.3}$ $64.8/57.5/{\color{green}+\ \ 7.3}$ $88.8/76.7/{\color{green}+12.0}$ $86.7/75.0/{\color{green}+11.6}$
gelu half $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$ $35.3/30.5/{\color{green}+\ \ 4.8}$ $48.8/43.2/{\color{green}+\ \ 5.5}$ $50.1/44.4/{\color{green}+\ \ 5.7}$
gelu bf16 $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$ $34.9/30.6/{\color{green}+\ \ 4.4}$ $48.3/43.3/{\color{green}+\ \ 5.1}$ $49.6/44.5/{\color{green}+\ \ 5.1}$
gelu f32 $\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$ $13.1/13.3/{\color{red}-\ \ 0.2}$ $64.2/53.5/{\color{green}+10.8}$ $85.2/72.8/{\color{green}+12.4}$ $80.5/69.2/{\color{green}+11.4}$
sin half $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.1}$ $\ \ 6.8/\ \ 6.6/{\color{green}+\ \ 0.2}$ $33.3/30.4/{\color{green}+\ \ 2.8}$ $46.9/41.5/{\color{green}+\ \ 5.3}$ $48.1/42.6/{\color{green}+\ \ 5.5}$
sin bf16 $\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$ $\ \ 6.8/\ \ 6.6/{\color{green}+\ \ 0.2}$ $33.2/30.4/{\color{green}+\ \ 2.8}$ $46.8/41.6/{\color{green}+\ \ 5.2}$ $48.1/42.6/{\color{green}+\ \ 5.5}$
sin f32 $\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$ $12.9/13.3/{\color{red}-\ \ 0.3}$ $60.4/53.4/{\color{green}+\ \ 7.0}$ $79.8/72.7/{\color{green}+\ \ 7.1}$ $76.2/69.5/{\color{green}+\ \ 6.7}$
exp half $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 6.8/\ \ 6.7/{\color{green}+\ \ 0.1}$ $46.9/38.8/{\color{green}+\ \ 8.1}$ $74.8/61.0/{\color{green}+13.8}$ $73.4/60.3/{\color{green}+13.1}$
exp bf16 $\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$ $45.5/38.9/{\color{green}+\ \ 6.6}$ $71.2/62.1/{\color{green}+\ \ 9.1}$ $73.6/61.7/{\color{green}+11.9}$
exp f32 $\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$ $12.9/13.3/{\color{red}-\ \ 0.4}$ $64.7/65.4/{\color{red}-\ \ 0.7}$ $88.5/87.2/{\color{green}+\ \ 1.2}$ $90.8/82.8/{\color{green}+\ \ 8.1}$
add half $\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$ $10.1/10.0/{\color{green}+\ \ 0.1}$ $57.1/53.5/{\color{green}+\ \ 3.6}$ $88.6/86.8/{\color{green}+\ \ 1.8}$ $92.5/84.1/{\color{green}+\ \ 8.4}$
add bf16 $\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$ $10.2/10.0/{\color{green}+\ \ 0.2}$ $56.7/53.5/{\color{green}+\ \ 3.2}$ $88.6/86.9/{\color{green}+\ \ 1.8}$ $92.5/85.7/{\color{green}+\ \ 6.8}$
add f32 $\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.1}$ $17.8/17.9/{\color{red}-\ \ 0.2}$ $70.1/69.9/{\color{green}+\ \ 0.2}$ $91.2/92.1/{\color{red}-\ \ 0.8}$ $92.8/93.6/{\color{red}-\ \ 0.9}$
sub half $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{red}-\ \ 0.0}$ $57.2/53.4/{\color{green}+\ \ 3.8}$ $88.7/86.8/{\color{green}+\ \ 1.8}$ $92.5/84.0/{\color{green}+\ \ 8.5}$
sub bf16 $\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.1}$ $57.4/53.4/{\color{green}+\ \ 4.0}$ $88.6/86.9/{\color{green}+\ \ 1.8}$ $92.4/85.6/{\color{green}+\ \ 6.9}$
sub f32 $\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.0}$ $17.7/17.7/{\color{green}+\ \ 0.1}$ $70.7/70.0/{\color{green}+\ \ 0.7}$ $91.2/92.1/{\color{red}-\ \ 0.8}$ $92.7/93.7/{\color{red}-\ \ 1.0}$
mul half $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.1}$ $57.7/53.5/{\color{green}+\ \ 4.3}$ $88.6/86.8/{\color{green}+\ \ 1.8}$ $92.6/83.9/{\color{green}+\ \ 8.6}$
mul bf16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.1/10.0/{\color{green}+\ \ 0.1}$ $57.7/53.5/{\color{green}+\ \ 4.3}$ $88.7/86.8/{\color{green}+\ \ 1.8}$ $92.5/85.5/{\color{green}+\ \ 6.9}$
mul f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$ $19.3/18.1/{\color{green}+\ \ 1.1}$ $71.0/70.0/{\color{green}+\ \ 1.0}$ $91.3/92.1/{\color{red}-\ \ 0.8}$ $92.8/93.6/{\color{red}-\ \ 0.8}$
div half $\ \ 0.7/\ \ 0.8/{\color{red}-\ \ 0.1}$ $10.0/10.0/{\color{red}-\ \ 0.0}$ $53.5/49.3/{\color{green}+\ \ 4.3}$ $83.2/73.2/{\color{green}+10.0}$ $79.4/69.9/{\color{green}+\ \ 9.5}$
div bf16 $\ \ 0.7/\ \ 0.8/{\color{red}-\ \ 0.1}$ $\ \ 9.5/10.0/{\color{red}-\ \ 0.5}$ $54.4/49.2/{\color{green}+\ \ 5.2}$ $82.5/69.7/{\color{green}+12.8}$ $79.8/71.6/{\color{green}+\ \ 8.2}$
div f32 $\ \ 1.5/\ \ 1.6/{\color{red}-\ \ 0.1}$ $17.5/17.2/{\color{green}+\ \ 0.3}$ $69.9/67.4/{\color{green}+\ \ 2.5}$ $92.3/85.1/{\color{green}+\ \ 7.2}$ $93.6/81.8/{\color{green}+11.8}$
le half $\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.0}$ $57.9/53.4/{\color{green}+\ \ 4.5}$ $89.2/84.7/{\color{green}+\ \ 4.5}$ $92.7/82.5/{\color{green}+10.2}$
le bf16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.1}$ $10.0/\ \ 9.6/{\color{green}+\ \ 0.5}$ $57.9/54.0/{\color{green}+\ \ 3.9}$ $89.3/84.7/{\color{green}+\ \ 4.6}$ $92.7/83.1/{\color{green}+\ \ 9.6}$
le f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$ $18.2/17.6/{\color{green}+\ \ 0.7}$ $70.6/70.3/{\color{green}+\ \ 0.3}$ $91.3/92.1/{\color{red}-\ \ 0.8}$ $92.8/93.9/{\color{red}-\ \ 1.2}$
ge half $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.0}$ $57.5/53.4/{\color{green}+\ \ 4.1}$ $89.3/84.7/{\color{green}+\ \ 4.6}$ $92.8/82.6/{\color{green}+10.2}$
ge bf16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/\ \ 9.8/{\color{green}+\ \ 0.2}$ $57.5/53.4/{\color{green}+\ \ 4.1}$ $89.3/84.7/{\color{green}+\ \ 4.6}$ $92.7/83.1/{\color{green}+\ \ 9.6}$
ge f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.1}$ $18.4/17.5/{\color{green}+\ \ 0.9}$ $70.2/69.9/{\color{green}+\ \ 0.3}$ $91.3/92.1/{\color{red}-\ \ 0.8}$ $92.7/93.9/{\color{red}-\ \ 1.2}$
fmin half $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/\ \ 9.7/{\color{green}+\ \ 0.3}$ $57.4/53.5/{\color{green}+\ \ 3.9}$ $89.1/84.0/{\color{green}+\ \ 5.0}$ $92.7/80.2/{\color{green}+12.5}$
fmin bf16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/\ \ 9.7/{\color{green}+\ \ 0.3}$ $57.9/53.4/{\color{green}+\ \ 4.5}$ $89.1/84.0/{\color{green}+\ \ 5.1}$ $92.6/81.7/{\color{green}+10.9}$
fmin f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$ $19.3/18.5/{\color{green}+\ \ 0.8}$ $70.7/70.6/{\color{green}+\ \ 0.1}$ $91.2/92.0/{\color{red}-\ \ 0.8}$ $92.8/93.5/{\color{red}-\ \ 0.7}$
fmax half $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.0}$ $57.7/53.4/{\color{green}+\ \ 4.3}$ $89.1/83.8/{\color{green}+\ \ 5.3}$ $92.7/80.3/{\color{green}+12.4}$
fmax bf16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.1}$ $58.1/53.5/{\color{green}+\ \ 4.7}$ $89.1/84.0/{\color{green}+\ \ 5.0}$ $92.6/81.7/{\color{green}+10.9}$
fmax f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$ $18.7/18.2/{\color{green}+\ \ 0.5}$ $69.8/68.5/{\color{green}+\ \ 1.3}$ $91.2/92.0/{\color{red}-\ \ 0.8}$ $92.8/93.5/{\color{red}-\ \ 0.7}$

babel

op T 2^16 2^20 2^24 2^28 2^31
mul i8 $\ \ 0.2/\ \ 0.3/{\color{red}-\ \ 0.1}$ $\ \ 3.5/\ \ 3.4/{\color{green}+\ \ 0.1}$ $34.5/30.2/{\color{green}+\ \ 4.3}$ $72.9/58.9/{\color{green}+14.0}$ $79.4/59.8/{\color{green}+19.6}$
mul i16 $\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$ $\ \ 6.7/\ \ 6.7/{\color{red}-\ \ 0.0}$ $48.2/47.5/{\color{green}+\ \ 0.7}$ $85.2/85.5/{\color{red}-\ \ 0.2}$ $89.4/82.2/{\color{green}+\ \ 7.2}$
mul f32 $\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$ $12.9/13.3/{\color{red}-\ \ 0.3}$ $61.6/65.3/{\color{red}-\ \ 3.7}$ $87.5/89.0/{\color{red}-\ \ 1.5}$ $89.6/91.0/{\color{red}-\ \ 1.5}$
mul f64 $\ \ 1.7/\ \ 2.1/{\color{red}-\ \ 0.4}$ $21.8/21.5/{\color{green}+\ \ 0.3}$ $73.8/76.1/{\color{red}-\ \ 2.2}$ $88.8/90.0/{\color{red}-\ \ 1.3}$ $89.8/91.1/{\color{red}-\ \ 1.3}$
add i8 $\ \ 0.3/\ \ 0.4/{\color{red}-\ \ 0.1}$ $\ \ 5.2/\ \ 5.0/{\color{green}+\ \ 0.1}$ $38.8/39.9/{\color{red}-\ \ 1.1}$ $66.9/71.3/{\color{red}-\ \ 4.4}$ $71.2/70.4/{\color{green}+\ \ 0.8}$
add i16 $\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.0}$ $58.4/54.7/{\color{green}+\ \ 3.7}$ $88.7/86.9/{\color{green}+\ \ 1.9}$ $92.5/85.3/{\color{green}+\ \ 7.2}$
add f32 $\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.1}$ $17.7/17.5/{\color{green}+\ \ 0.2}$ $70.2/68.8/{\color{green}+\ \ 1.5}$ $91.2/92.1/{\color{red}-\ \ 0.9}$ $92.8/93.7/{\color{red}-\ \ 0.9}$
add f64 $\ \ 3.0/\ \ 3.1/{\color{red}-\ \ 0.1}$ $30.9/31.5/{\color{red}-\ \ 0.6}$ $80.3/79.9/{\color{green}+\ \ 0.4}$ $92.0/93.3/{\color{red}-\ \ 1.3}$ $93.0/94.1/{\color{red}-\ \ 1.1}$
triad i8 $\ \ 0.4/\ \ 0.4/{\color{red}-\ \ 0.0}$ $\ \ 5.1/\ \ 5.0/{\color{green}+\ \ 0.0}$ $36.9/36.9/{\color{green}+\ \ 0.0}$ $64.1/67.4/{\color{red}-\ \ 3.3}$ $66.2/64.6/{\color{green}+\ \ 1.6}$
triad i16 $\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$ $10.0/10.0/{\color{green}+\ \ 0.0}$ $57.9/53.5/{\color{green}+\ \ 4.4}$ $90.0/85.9/{\color{green}+\ \ 4.0}$ $93.0/83.3/{\color{green}+\ \ 9.6}$
triad f32 $\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$ $18.9/17.6/{\color{green}+\ \ 1.3}$ $71.0/70.0/{\color{green}+\ \ 1.1}$ $91.3/92.3/{\color{red}-\ \ 1.0}$ $92.9/91.6/{\color{green}+\ \ 1.3}$
triad f64 $\ \ 3.2/\ \ 3.1/{\color{green}+\ \ 0.1}$ $30.5/29.7/{\color{green}+\ \ 0.8}$ $79.9/79.8/{\color{green}+\ \ 0.1}$ $92.1/93.5/{\color{red}-\ \ 1.4}$ $93.0/93.8/{\color{red}-\ \ 0.8}$
nstream i8 $\ \ 0.4/\ \ 0.5/{\color{red}-\ \ 0.1}$ $\ \ 6.6/\ \ 6.7/{\color{red}-\ \ 0.0}$ $37.7/42.6/{\color{red}-\ \ 4.9}$ $56.5/65.8/{\color{red}-\ \ 9.2}$ $57.3/64.0/{\color{red}-\ \ 6.6}$
nstream i16 $\ \ 1.1/\ \ 1.1/{\color{green}+\ \ 0.0}$ $13.3/13.3/{\color{green}+\ \ 0.0}$ $61.9/60.9/{\color{green}+\ \ 1.0}$ $91.7/85.2/{\color{green}+\ \ 6.5}$ $95.0/81.5/{\color{green}+13.6}$
nstream f32 $\ \ 2.1/\ \ 2.1/{\color{green}+\ \ 0.1}$ $22.0/21.4/{\color{green}+\ \ 0.6}$ $74.6/73.4/{\color{green}+\ \ 1.2}$ $93.4/92.7/{\color{green}+\ \ 0.8}$ $95.3/93.4/{\color{green}+\ \ 1.9}$
nstream f64 $\ \ 3.8/\ \ 3.6/{\color{green}+\ \ 0.2}$ $35.5/35.5/{\color{green}+\ \ 0.0}$ $83.3/83.2/{\color{green}+\ \ 0.0}$ $94.5/94.5/{\color{red}-\ \ 0.1}$ $95.4/95.5/{\color{red}-\ \ 0.0}$

copy, grayscale, fill

op T 2^16 2^20 2^24 2^28 2^31
copy i8 $\ \ 0.3/\ \ 0.3/{\color{green}+\ \ 0.0}$ $\ \ 4.0/\ \ 3.4/{\color{green}+\ \ 0.6}$ $35.8/30.8/{\color{green}+\ \ 5.1}$ $82.1/72.0/{\color{green}+10.0}$ $89.8/68.5/{\color{green}+21.2}$
copy i16 $\ \ 0.5/\ \ 0.5/{\color{green}+\ \ 0.0}$ $\ \ 6.8/\ \ 6.8/{\color{green}+\ \ 0.1}$ $51.7/47.4/{\color{green}+\ \ 4.4}$ $86.5/84.8/{\color{green}+\ \ 1.7}$ $90.5/80.6/{\color{green}+\ \ 9.9}$
copy i32 $\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$ $13.0/13.1/{\color{red}-\ \ 0.1}$ $61.8/65.3/{\color{red}-\ \ 3.4}$ $87.7/88.7/{\color{red}-\ \ 1.1}$ $89.6/91.0/{\color{red}-\ \ 1.5}$
copy f32 $\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$ $13.0/13.2/{\color{red}-\ \ 0.1}$ $62.2/65.3/{\color{red}-\ \ 3.1}$ $87.6/88.8/{\color{red}-\ \ 1.1}$ $88.2/91.2/{\color{red}-\ \ 3.0}$
copy f64 $\ \ 2.1/\ \ 2.1/{\color{red}-\ \ 0.0}$ $23.0/22.2/{\color{green}+\ \ 0.8}$ $75.5/75.7/{\color{red}-\ \ 0.2}$ $89.8/90.1/{\color{red}-\ \ 0.3}$ $90.9/91.2/{\color{red}-\ \ 0.3}$
grayscale f32 $\ \ 1.7/\ \ 2.0/{\color{red}-\ \ 0.3}$ $21.7/21.3/{\color{green}+\ \ 0.4}$ $70.7/71.1/{\color{red}-\ \ 0.4}$ $92.3/92.6/{\color{red}-\ \ 0.4}$ $94.0/93.7/{\color{green}+\ \ 0.3}$
grayscale f64 $\ \ 3.5/\ \ 3.8/{\color{red}-\ \ 0.2}$ $35.1/35.1/{\color{red}-\ \ 0.0}$ $79.7/82.7/{\color{red}-\ \ 3.1}$ $92.9/94.4/{\color{red}-\ \ 1.5}$ $93.7/95.6/{\color{red}-\ \ 1.9}$
fill I8 $\ \ 0.1/\ \ 0.1/{\color{red}-\ \ 0.0}$ $\ \ 2.2/\ \ 2.2/{\color{red}-\ \ 0.0}$ $26.3/26.3/{\color{red}-\ \ 0.0}$ $85.1/85.1/{\color{red}-\ \ 0.0}$ $97.4/97.4/{\color{red}-\ \ 0.0}$
fill I16 $\ \ 0.3/\ \ 0.3/{\color{green}+\ \ 0.0}$ $\ \ 4.4/\ \ 4.4/{\color{green}+\ \ 0.0}$ $42.1/42.6/{\color{red}-\ \ 0.6}$ $91.8/91.9/{\color{red}-\ \ 0.1}$ $98.2/98.2/{\color{red}-\ \ 0.0}$
fill I32 $\ \ 0.6/\ \ 0.6/{\color{red}-\ \ 0.0}$ $\ \ 8.5/\ \ 8.3/{\color{green}+\ \ 0.2}$ $60.1/59.3/{\color{green}+\ \ 0.8}$ $95.1/95.0/{\color{green}+\ \ 0.1}$ $98.8/98.7/{\color{green}+\ \ 0.1}$
fill I64 $\ \ 1.1/\ \ 1.1/{\color{green}+\ \ 0.0}$ $13.9/15.1/{\color{red}-\ \ 1.2}$ $72.8/71.4/{\color{green}+\ \ 1.4}$ $97.4/97.0/{\color{green}+\ \ 0.4}$ $99.1/99.0/{\color{green}+\ \ 0.1}$

We did not do benchmarks on complex, fib and heavy:

  • for complex, cutile does not accept std::complex as a vaild type to form tiles.
  • for fib, with tile semantics, there is no 1-to-1 fair implementation in tile. We can get one by abusing ct::select but it is much slower.
  • for heavy, cutile lowers syntax like T reg[N] to heap allocation.

There was a more detailed write up on fib and heavy here: #9038 (comment)

@nanan-nvidia nanan-nvidia self-assigned this Jun 1, 2026
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 1, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Jun 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 1, 2026
Copy link
Copy Markdown
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great already

Comment on lines +26 to +29
constexpr int ceil_div(int a, int b) { return (a + b - 1) / b; }
constexpr int round_up_pow2(int x) {
int p = 1; while (p < x) p *= 2; return p;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those could just be ::cuda::ceil_div and ::cuda::next_power_of_two

Comment on lines +78 to +79
auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));
auto out = ct::assume_aligned<16>(out_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: For a fair comparison, the transform_kernel should support unaligned data as well. This can be just a branch dispatching two an aligned and an unaligned code path.

Copy link
Copy Markdown
Author

@nanan-nvidia nanan-nvidia Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! However, I have observed very bad performance on tile kernels in this case. When the pointer is not aligned, all loads will not be vectorized. CUB does way better with TMA & overcopying.

I am asking compiler people to see if they can add this heuristics, since having this will make a much more informative benchmark

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good idea! Please help them improve tile!

Unaligned inputs are less common but do appear. Examples are if you want to run a kernel on the output of a previous one, where the previous one also chose the problem size. Like, you partition an array by a predicate and then only transform the elements of the selected partition, etc. Such cases come up frequently in database or dataframe workloads.

@bernhardmgruber
Copy link
Copy Markdown
Contributor

I am genuinely impressed by the size of the tile kernel! It's really small and expressive. Nice!

The cub_tile::DeviceTransform implementation moves from
cub/benchmarks/bench/transform/tile/device_transform.cuh into
cub/cub/device/dispatch/dispatch_transform_tile.cuh so the CUB header
tree can reference it. The bench .cu files now include from the new
path. The hand-rolled ceil_div and round_up_pow2 helpers are replaced
with cuda::ceil_div and cuda::next_power_of_two from <cuda/cmath>.
The cub_tile namespace and its hand-rolled detail layout move under
cub::detail::transform::tile to match how CUB groups the existing
transform internals. A type alias is kept at cub_tile::DeviceTransform
so the benches and tests still compile during the transition.

The whole file body is also gated by _CCCL_CTK_AT_LEAST(13, 3) so
older toolchains never see the tile DSL types.
@nanan-nvidia
Copy link
Copy Markdown
Author

nanan-nvidia commented Jun 4, 2026

NOTE: Not ready to merge due to regular device transform SIMT kernels will fail to compile with --enable-tile. I temporarily unblocked myself with e0a31e4

This now introduces tile kernels with no call site change for users. To opt in at build time, please compile with --enable-tile and -DCCCL_ENABLE_TILE_TRANSFORM_DISPATCH.

The basic idea is, at compile time, eligible (Op, T, NIn) combos will be dispatched to tile kernels with traits. We can ship traits based on our benchmarks (i.e. if we know tile is better on some (Op, T, NIn)). The user can also self-register the combos they find tile to be beneficial. To self register, they need to provide three pieces of data:

First, a SIMT functor that will be called at the API:

struct my_tanh {
    template <class T>
    __host__ __device__ T operator()(T v) const {
        return static_cast<T>(::cuda::std::tanh(static_cast<float>(v)));
    }
};

Second, a tile functor that has the same semantical meaning:

#if defined(CCCL_ENABLE_TILE_TRANSFORM_DISPATCH) && _CCCL_TILE_COMPILATION()
struct tile_my_tanh {
    template <class T>
    __tile__ auto operator()(T v) const {
        namespace ct = cuda::tiles;
        return ct::element_cast<ct::tile_element_t<T>>(
            ct::tanh(ct::element_cast<float>(v)));
    }
};

And third, the trait specialization that links those two functors:

CUB_NAMESPACE_BEGIN
namespace detail::transform::tile {
    template <class T>
    struct tile_eligible<my_tanh, T, 1> : std::true_type {
        using tile_op_type = tile_my_tanh;
    };
    // Optional: hint the tile policy picker that this is MUFU-heavy.
    template <> struct tile_mufu_heavy<my_tanh> : std::true_type {};
}
CUB_NAMESPACE_END
#endif

The general idea is that as cutile c++ gets more and more mature and performant, we can correspondingly register more and more cases until it covers all the cases.

The dispatch flow is as follows:

                 cub::DeviceTransform::Transform(.., op, ..)
                                │
                                ▼
              if constexpr (tile_dispatch_eligible_v<Op, OutIter, InIters...>):
                                │
              ┌─────────────────┴─────────────────┐
              ▼                                   ▼
       runtime preconditions OK?            standard CUB dispatch
       (16B-aligned pointers, 
        num_items % 16 == 0, 
        num_items ≤ 2^31)
              │
       ┌──────┴──────┐
       ▼             ▼
   tile kernel    standard CUB

Those runtime preconditions exist because current bad cutile performance on those cases.

image

Copy link
Copy Markdown
Contributor

@fbusato fbusato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass for the library implementation. The implementation is already great!!
I pointed out some compatibility and stylistic issues.

# pragma system_header
#endif // no system header

#if _CCCL_CTK_AT_LEAST(13, 3)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check for C++20 and _CCCL_TILE_COMPILATION

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to create a dedicated macro for it


# include <cuda_tile.h>

# include <cstdint>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# include <cstdint>
# include <cuda/std/cstdint>


template <int TileSize, typename Fn, typename Out, typename... Ins>
__tile_global__ void
transform_kernel(int64_t num_items_, Out* __restrict__ out_, const Ins* __restrict__... ins_)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit. I would remove the postfix _

__tile_global__ void
transform_kernel(int64_t num_items_, Out* __restrict__ out_, const Ins* __restrict__... ins_)
{
namespace ct = cuda::tiles;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace ct = cuda::tiles;
namespace ct = ::cuda::tiles;

do it work, right?

const auto bx = ct::bid().x;
Fn fn{};

auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment why we need assume_divisible and assume_bounded_below for future reference

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));
const auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));

return 12 * 1024;
}

constexpr int min_size(int a)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use ::cuda::std::min

// registers and the compiler unpacks them and packs them back. reducing the
// compute work per thread helps here. need profiling to know the exact cause.
template <typename Out, typename... Ins>
constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use compute_capability object instead?

constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000)
{
constexpr int threads_per_block = 128;
constexpr int vector_bytes = 16; // LDG.E.128 -> 16 bytes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blackwell also has LDG.E.256 -> 32 bytes

constexpr int items_for_vec = static_cast<int>(::cuda::ceil_div(vector_bytes, min_elem));

// Fill (zero inputs) keeps the same latency target by counting output bytes.
constexpr int bytes_per_iter = (sizeof...(Ins) > 0) ? (int(sizeof(Ins)) + ... + 0) : int(sizeof(Out));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for int()

const int items_for_latency =
static_cast<int>(::cuda::ceil_div(target, max_occupancy * threads_per_block * bytes_per_iter));

int items = items_for_vec > items_for_latency ? items_for_vec : items_for_latency;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use max

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants