[Tile][WIP] tile DeviceTransform port by nanan-nvidia · Pull Request #9210 · NVIDIA/cccl

nanan-nvidia · 2026-06-01T23:06:39Z

Highly experimental, opening as Draft to share a work-in-progress port of cub::DeviceTransform (#8087, #9038) onto cutile, and to compare side-by-side with the existing CUB benches. Not for merge.

Before the benchmarks, it is important to note that SIMT-Tile interop for TileIR is still work in progress. Thus, right now, the custom function the user passes in must be a __tile__ function, and it must consist of tile operations and must be inlinable.

Current B200 benchmark on pytorch and babel:

pytorch (tile / cub/ delta = tile - cub, BW utilisation %)

op	T	2^16	2^20	2^24	2^28	2^31
relu	half	$\ \ 0.4/\ \ 0.6/{\color{red}-\ \ 0.1}$	$\ \ 6.6/\ \ 6.8/{\color{red}-\ \ 0.2}$	$47.9/47.4/{\color{green}+\ \ 0.5}$	$83.5/83.1/{\color{green}+\ \ 0.4}$	$87.8/80.2/{\color{green}+\ \ 7.6}$
relu	bf16	$\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$	$\ \ 6.7/\ \ 7.0/{\color{red}-\ \ 0.3}$	$47.5/46.2/{\color{green}+\ \ 1.3}$	$83.3/83.0/{\color{green}+\ \ 0.3}$	$87.7/80.9/{\color{green}+\ \ 6.8}$
relu	f32	$\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$	$12.9/13.3/{\color{red}-\ \ 0.4}$	$64.7/65.0/{\color{red}-\ \ 0.3}$	$88.4/88.8/{\color{red}-\ \ 0.3}$	$90.8/91.1/{\color{red}-\ \ 0.3}$
sigmoid	half	$\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$	$\ \ 6.9/\ \ 6.6/{\color{green}+\ \ 0.2}$	$34.5/31.5/{\color{green}+\ \ 3.0}$	$48.1/45.5/{\color{green}+\ \ 2.5}$	$49.3/46.8/{\color{green}+\ \ 2.6}$
sigmoid	bf16	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.1}$	$\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.3}$	$34.8/32.0/{\color{green}+\ \ 2.8}$	$48.1/45.5/{\color{green}+\ \ 2.6}$	$49.4/46.8/{\color{green}+\ \ 2.7}$
sigmoid	f32	$\ \ 1.0/\ \ 1.1/{\color{red}-\ \ 0.1}$	$12.9/13.3/{\color{red}-\ \ 0.4}$	$61.3/56.8/{\color{green}+\ \ 4.5}$	$83.8/76.7/{\color{green}+\ \ 7.1}$	$79.3/73.3/{\color{green}+\ \ 6.0}$
tanh	half	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 7.0/\ \ 6.7/{\color{green}+\ \ 0.3}$	$39.6/35.5/{\color{green}+\ \ 4.1}$	$58.5/51.6/{\color{green}+\ \ 6.9}$	$60.4/53.2/{\color{green}+\ \ 7.2}$
tanh	bf16	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 7.0/\ \ 6.7/{\color{green}+\ \ 0.2}$	$38.4/35.5/{\color{green}+\ \ 2.9}$	$55.3/51.4/{\color{green}+\ \ 3.9}$	$56.9/53.1/{\color{green}+\ \ 3.8}$
tanh	f32	$\ \ 1.0/\ \ 1.1/{\color{red}-\ \ 0.1}$	$13.0/13.3/{\color{red}-\ \ 0.3}$	$64.8/57.5/{\color{green}+\ \ 7.3}$	$88.8/76.7/{\color{green}+12.0}$	$86.7/75.0/{\color{green}+11.6}$
gelu	half	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$	$35.3/30.5/{\color{green}+\ \ 4.8}$	$48.8/43.2/{\color{green}+\ \ 5.5}$	$50.1/44.4/{\color{green}+\ \ 5.7}$
gelu	bf16	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$	$34.9/30.6/{\color{green}+\ \ 4.4}$	$48.3/43.3/{\color{green}+\ \ 5.1}$	$49.6/44.5/{\color{green}+\ \ 5.1}$
gelu	f32	$\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$	$13.1/13.3/{\color{red}-\ \ 0.2}$	$64.2/53.5/{\color{green}+10.8}$	$85.2/72.8/{\color{green}+12.4}$	$80.5/69.2/{\color{green}+11.4}$
sin	half	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.1}$	$\ \ 6.8/\ \ 6.6/{\color{green}+\ \ 0.2}$	$33.3/30.4/{\color{green}+\ \ 2.8}$	$46.9/41.5/{\color{green}+\ \ 5.3}$	$48.1/42.6/{\color{green}+\ \ 5.5}$
sin	bf16	$\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$	$\ \ 6.8/\ \ 6.6/{\color{green}+\ \ 0.2}$	$33.2/30.4/{\color{green}+\ \ 2.8}$	$46.8/41.6/{\color{green}+\ \ 5.2}$	$48.1/42.6/{\color{green}+\ \ 5.5}$
sin	f32	$\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$	$12.9/13.3/{\color{red}-\ \ 0.3}$	$60.4/53.4/{\color{green}+\ \ 7.0}$	$79.8/72.7/{\color{green}+\ \ 7.1}$	$76.2/69.5/{\color{green}+\ \ 6.7}$
exp	half	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 6.8/\ \ 6.7/{\color{green}+\ \ 0.1}$	$46.9/38.8/{\color{green}+\ \ 8.1}$	$74.8/61.0/{\color{green}+13.8}$	$73.4/60.3/{\color{green}+13.1}$
exp	bf16	$\ \ 0.5/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 6.9/\ \ 6.7/{\color{green}+\ \ 0.2}$	$45.5/38.9/{\color{green}+\ \ 6.6}$	$71.2/62.1/{\color{green}+\ \ 9.1}$	$73.6/61.7/{\color{green}+11.9}$
exp	f32	$\ \ 1.1/\ \ 1.1/{\color{red}-\ \ 0.0}$	$12.9/13.3/{\color{red}-\ \ 0.4}$	$64.7/65.4/{\color{red}-\ \ 0.7}$	$88.5/87.2/{\color{green}+\ \ 1.2}$	$90.8/82.8/{\color{green}+\ \ 8.1}$
add	half	$\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$	$10.1/10.0/{\color{green}+\ \ 0.1}$	$57.1/53.5/{\color{green}+\ \ 3.6}$	$88.6/86.8/{\color{green}+\ \ 1.8}$	$92.5/84.1/{\color{green}+\ \ 8.4}$
add	bf16	$\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$	$10.2/10.0/{\color{green}+\ \ 0.2}$	$56.7/53.5/{\color{green}+\ \ 3.2}$	$88.6/86.9/{\color{green}+\ \ 1.8}$	$92.5/85.7/{\color{green}+\ \ 6.8}$
add	f32	$\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.1}$	$17.8/17.9/{\color{red}-\ \ 0.2}$	$70.1/69.9/{\color{green}+\ \ 0.2}$	$91.2/92.1/{\color{red}-\ \ 0.8}$	$92.8/93.6/{\color{red}-\ \ 0.9}$
sub	half	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{red}-\ \ 0.0}$	$57.2/53.4/{\color{green}+\ \ 3.8}$	$88.7/86.8/{\color{green}+\ \ 1.8}$	$92.5/84.0/{\color{green}+\ \ 8.5}$
sub	bf16	$\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.1}$	$57.4/53.4/{\color{green}+\ \ 4.0}$	$88.6/86.9/{\color{green}+\ \ 1.8}$	$92.4/85.6/{\color{green}+\ \ 6.9}$
sub	f32	$\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.0}$	$17.7/17.7/{\color{green}+\ \ 0.1}$	$70.7/70.0/{\color{green}+\ \ 0.7}$	$91.2/92.1/{\color{red}-\ \ 0.8}$	$92.7/93.7/{\color{red}-\ \ 1.0}$
mul	half	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.1}$	$57.7/53.5/{\color{green}+\ \ 4.3}$	$88.6/86.8/{\color{green}+\ \ 1.8}$	$92.6/83.9/{\color{green}+\ \ 8.6}$
mul	bf16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.1/10.0/{\color{green}+\ \ 0.1}$	$57.7/53.5/{\color{green}+\ \ 4.3}$	$88.7/86.8/{\color{green}+\ \ 1.8}$	$92.5/85.5/{\color{green}+\ \ 6.9}$
mul	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$	$19.3/18.1/{\color{green}+\ \ 1.1}$	$71.0/70.0/{\color{green}+\ \ 1.0}$	$91.3/92.1/{\color{red}-\ \ 0.8}$	$92.8/93.6/{\color{red}-\ \ 0.8}$
div	half	$\ \ 0.7/\ \ 0.8/{\color{red}-\ \ 0.1}$	$10.0/10.0/{\color{red}-\ \ 0.0}$	$53.5/49.3/{\color{green}+\ \ 4.3}$	$83.2/73.2/{\color{green}+10.0}$	$79.4/69.9/{\color{green}+\ \ 9.5}$
div	bf16	$\ \ 0.7/\ \ 0.8/{\color{red}-\ \ 0.1}$	$\ \ 9.5/10.0/{\color{red}-\ \ 0.5}$	$54.4/49.2/{\color{green}+\ \ 5.2}$	$82.5/69.7/{\color{green}+12.8}$	$79.8/71.6/{\color{green}+\ \ 8.2}$
div	f32	$\ \ 1.5/\ \ 1.6/{\color{red}-\ \ 0.1}$	$17.5/17.2/{\color{green}+\ \ 0.3}$	$69.9/67.4/{\color{green}+\ \ 2.5}$	$92.3/85.1/{\color{green}+\ \ 7.2}$	$93.6/81.8/{\color{green}+11.8}$
le	half	$\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.0}$	$57.9/53.4/{\color{green}+\ \ 4.5}$	$89.2/84.7/{\color{green}+\ \ 4.5}$	$92.7/82.5/{\color{green}+10.2}$
le	bf16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.1}$	$10.0/\ \ 9.6/{\color{green}+\ \ 0.5}$	$57.9/54.0/{\color{green}+\ \ 3.9}$	$89.3/84.7/{\color{green}+\ \ 4.6}$	$92.7/83.1/{\color{green}+\ \ 9.6}$
le	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$	$18.2/17.6/{\color{green}+\ \ 0.7}$	$70.6/70.3/{\color{green}+\ \ 0.3}$	$91.3/92.1/{\color{red}-\ \ 0.8}$	$92.8/93.9/{\color{red}-\ \ 1.2}$
ge	half	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.0}$	$57.5/53.4/{\color{green}+\ \ 4.1}$	$89.3/84.7/{\color{green}+\ \ 4.6}$	$92.8/82.6/{\color{green}+10.2}$
ge	bf16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/\ \ 9.8/{\color{green}+\ \ 0.2}$	$57.5/53.4/{\color{green}+\ \ 4.1}$	$89.3/84.7/{\color{green}+\ \ 4.6}$	$92.7/83.1/{\color{green}+\ \ 9.6}$
ge	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.1}$	$18.4/17.5/{\color{green}+\ \ 0.9}$	$70.2/69.9/{\color{green}+\ \ 0.3}$	$91.3/92.1/{\color{red}-\ \ 0.8}$	$92.7/93.9/{\color{red}-\ \ 1.2}$
fmin	half	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/\ \ 9.7/{\color{green}+\ \ 0.3}$	$57.4/53.5/{\color{green}+\ \ 3.9}$	$89.1/84.0/{\color{green}+\ \ 5.0}$	$92.7/80.2/{\color{green}+12.5}$
fmin	bf16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/\ \ 9.7/{\color{green}+\ \ 0.3}$	$57.9/53.4/{\color{green}+\ \ 4.5}$	$89.1/84.0/{\color{green}+\ \ 5.1}$	$92.6/81.7/{\color{green}+10.9}$
fmin	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$	$19.3/18.5/{\color{green}+\ \ 0.8}$	$70.7/70.6/{\color{green}+\ \ 0.1}$	$91.2/92.0/{\color{red}-\ \ 0.8}$	$92.8/93.5/{\color{red}-\ \ 0.7}$
fmax	half	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.0}$	$57.7/53.4/{\color{green}+\ \ 4.3}$	$89.1/83.8/{\color{green}+\ \ 5.3}$	$92.7/80.3/{\color{green}+12.4}$
fmax	bf16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.1}$	$58.1/53.5/{\color{green}+\ \ 4.7}$	$89.1/84.0/{\color{green}+\ \ 5.0}$	$92.6/81.7/{\color{green}+10.9}$
fmax	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$	$18.7/18.2/{\color{green}+\ \ 0.5}$	$69.8/68.5/{\color{green}+\ \ 1.3}$	$91.2/92.0/{\color{red}-\ \ 0.8}$	$92.8/93.5/{\color{red}-\ \ 0.7}$

babel

op	T	2^16	2^20	2^24	2^28	2^31
mul	i8	$\ \ 0.2/\ \ 0.3/{\color{red}-\ \ 0.1}$	$\ \ 3.5/\ \ 3.4/{\color{green}+\ \ 0.1}$	$34.5/30.2/{\color{green}+\ \ 4.3}$	$72.9/58.9/{\color{green}+14.0}$	$79.4/59.8/{\color{green}+19.6}$
mul	i16	$\ \ 0.5/\ \ 0.5/{\color{red}-\ \ 0.1}$	$\ \ 6.7/\ \ 6.7/{\color{red}-\ \ 0.0}$	$48.2/47.5/{\color{green}+\ \ 0.7}$	$85.2/85.5/{\color{red}-\ \ 0.2}$	$89.4/82.2/{\color{green}+\ \ 7.2}$
mul	f32	$\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$	$12.9/13.3/{\color{red}-\ \ 0.3}$	$61.6/65.3/{\color{red}-\ \ 3.7}$	$87.5/89.0/{\color{red}-\ \ 1.5}$	$89.6/91.0/{\color{red}-\ \ 1.5}$
mul	f64	$\ \ 1.7/\ \ 2.1/{\color{red}-\ \ 0.4}$	$21.8/21.5/{\color{green}+\ \ 0.3}$	$73.8/76.1/{\color{red}-\ \ 2.2}$	$88.8/90.0/{\color{red}-\ \ 1.3}$	$89.8/91.1/{\color{red}-\ \ 1.3}$
add	i8	$\ \ 0.3/\ \ 0.4/{\color{red}-\ \ 0.1}$	$\ \ 5.2/\ \ 5.0/{\color{green}+\ \ 0.1}$	$38.8/39.9/{\color{red}-\ \ 1.1}$	$66.9/71.3/{\color{red}-\ \ 4.4}$	$71.2/70.4/{\color{green}+\ \ 0.8}$
add	i16	$\ \ 0.8/\ \ 0.8/{\color{red}-\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.0}$	$58.4/54.7/{\color{green}+\ \ 3.7}$	$88.7/86.9/{\color{green}+\ \ 1.9}$	$92.5/85.3/{\color{green}+\ \ 7.2}$
add	f32	$\ \ 1.6/\ \ 1.6/{\color{red}-\ \ 0.1}$	$17.7/17.5/{\color{green}+\ \ 0.2}$	$70.2/68.8/{\color{green}+\ \ 1.5}$	$91.2/92.1/{\color{red}-\ \ 0.9}$	$92.8/93.7/{\color{red}-\ \ 0.9}$
add	f64	$\ \ 3.0/\ \ 3.1/{\color{red}-\ \ 0.1}$	$30.9/31.5/{\color{red}-\ \ 0.6}$	$80.3/79.9/{\color{green}+\ \ 0.4}$	$92.0/93.3/{\color{red}-\ \ 1.3}$	$93.0/94.1/{\color{red}-\ \ 1.1}$
triad	i8	$\ \ 0.4/\ \ 0.4/{\color{red}-\ \ 0.0}$	$\ \ 5.1/\ \ 5.0/{\color{green}+\ \ 0.0}$	$36.9/36.9/{\color{green}+\ \ 0.0}$	$64.1/67.4/{\color{red}-\ \ 3.3}$	$66.2/64.6/{\color{green}+\ \ 1.6}$
triad	i16	$\ \ 0.8/\ \ 0.8/{\color{green}+\ \ 0.0}$	$10.0/10.0/{\color{green}+\ \ 0.0}$	$57.9/53.5/{\color{green}+\ \ 4.4}$	$90.0/85.9/{\color{green}+\ \ 4.0}$	$93.0/83.3/{\color{green}+\ \ 9.6}$
triad	f32	$\ \ 1.6/\ \ 1.6/{\color{green}+\ \ 0.0}$	$18.9/17.6/{\color{green}+\ \ 1.3}$	$71.0/70.0/{\color{green}+\ \ 1.1}$	$91.3/92.3/{\color{red}-\ \ 1.0}$	$92.9/91.6/{\color{green}+\ \ 1.3}$
triad	f64	$\ \ 3.2/\ \ 3.1/{\color{green}+\ \ 0.1}$	$30.5/29.7/{\color{green}+\ \ 0.8}$	$79.9/79.8/{\color{green}+\ \ 0.1}$	$92.1/93.5/{\color{red}-\ \ 1.4}$	$93.0/93.8/{\color{red}-\ \ 0.8}$
nstream	i8	$\ \ 0.4/\ \ 0.5/{\color{red}-\ \ 0.1}$	$\ \ 6.6/\ \ 6.7/{\color{red}-\ \ 0.0}$	$37.7/42.6/{\color{red}-\ \ 4.9}$	$56.5/65.8/{\color{red}-\ \ 9.2}$	$57.3/64.0/{\color{red}-\ \ 6.6}$
nstream	i16	$\ \ 1.1/\ \ 1.1/{\color{green}+\ \ 0.0}$	$13.3/13.3/{\color{green}+\ \ 0.0}$	$61.9/60.9/{\color{green}+\ \ 1.0}$	$91.7/85.2/{\color{green}+\ \ 6.5}$	$95.0/81.5/{\color{green}+13.6}$
nstream	f32	$\ \ 2.1/\ \ 2.1/{\color{green}+\ \ 0.1}$	$22.0/21.4/{\color{green}+\ \ 0.6}$	$74.6/73.4/{\color{green}+\ \ 1.2}$	$93.4/92.7/{\color{green}+\ \ 0.8}$	$95.3/93.4/{\color{green}+\ \ 1.9}$
nstream	f64	$\ \ 3.8/\ \ 3.6/{\color{green}+\ \ 0.2}$	$35.5/35.5/{\color{green}+\ \ 0.0}$	$83.3/83.2/{\color{green}+\ \ 0.0}$	$94.5/94.5/{\color{red}-\ \ 0.1}$	$95.4/95.5/{\color{red}-\ \ 0.0}$

copy, grayscale, fill

op	T	2^16	2^20	2^24	2^28	2^31
copy	i8	$\ \ 0.3/\ \ 0.3/{\color{green}+\ \ 0.0}$	$\ \ 4.0/\ \ 3.4/{\color{green}+\ \ 0.6}$	$35.8/30.8/{\color{green}+\ \ 5.1}$	$82.1/72.0/{\color{green}+10.0}$	$89.8/68.5/{\color{green}+21.2}$
copy	i16	$\ \ 0.5/\ \ 0.5/{\color{green}+\ \ 0.0}$	$\ \ 6.8/\ \ 6.8/{\color{green}+\ \ 0.1}$	$51.7/47.4/{\color{green}+\ \ 4.4}$	$86.5/84.8/{\color{green}+\ \ 1.7}$	$90.5/80.6/{\color{green}+\ \ 9.9}$
copy	i32	$\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$	$13.0/13.1/{\color{red}-\ \ 0.1}$	$61.8/65.3/{\color{red}-\ \ 3.4}$	$87.7/88.7/{\color{red}-\ \ 1.1}$	$89.6/91.0/{\color{red}-\ \ 1.5}$
copy	f32	$\ \ 0.9/\ \ 1.1/{\color{red}-\ \ 0.2}$	$13.0/13.2/{\color{red}-\ \ 0.1}$	$62.2/65.3/{\color{red}-\ \ 3.1}$	$87.6/88.8/{\color{red}-\ \ 1.1}$	$88.2/91.2/{\color{red}-\ \ 3.0}$
copy	f64	$\ \ 2.1/\ \ 2.1/{\color{red}-\ \ 0.0}$	$23.0/22.2/{\color{green}+\ \ 0.8}$	$75.5/75.7/{\color{red}-\ \ 0.2}$	$89.8/90.1/{\color{red}-\ \ 0.3}$	$90.9/91.2/{\color{red}-\ \ 0.3}$
grayscale	f32	$\ \ 1.7/\ \ 2.0/{\color{red}-\ \ 0.3}$	$21.7/21.3/{\color{green}+\ \ 0.4}$	$70.7/71.1/{\color{red}-\ \ 0.4}$	$92.3/92.6/{\color{red}-\ \ 0.4}$	$94.0/93.7/{\color{green}+\ \ 0.3}$
grayscale	f64	$\ \ 3.5/\ \ 3.8/{\color{red}-\ \ 0.2}$	$35.1/35.1/{\color{red}-\ \ 0.0}$	$79.7/82.7/{\color{red}-\ \ 3.1}$	$92.9/94.4/{\color{red}-\ \ 1.5}$	$93.7/95.6/{\color{red}-\ \ 1.9}$
fill	I8	$\ \ 0.1/\ \ 0.1/{\color{red}-\ \ 0.0}$	$\ \ 2.2/\ \ 2.2/{\color{red}-\ \ 0.0}$	$26.3/26.3/{\color{red}-\ \ 0.0}$	$85.1/85.1/{\color{red}-\ \ 0.0}$	$97.4/97.4/{\color{red}-\ \ 0.0}$
fill	I16	$\ \ 0.3/\ \ 0.3/{\color{green}+\ \ 0.0}$	$\ \ 4.4/\ \ 4.4/{\color{green}+\ \ 0.0}$	$42.1/42.6/{\color{red}-\ \ 0.6}$	$91.8/91.9/{\color{red}-\ \ 0.1}$	$98.2/98.2/{\color{red}-\ \ 0.0}$
fill	I32	$\ \ 0.6/\ \ 0.6/{\color{red}-\ \ 0.0}$	$\ \ 8.5/\ \ 8.3/{\color{green}+\ \ 0.2}$	$60.1/59.3/{\color{green}+\ \ 0.8}$	$95.1/95.0/{\color{green}+\ \ 0.1}$	$98.8/98.7/{\color{green}+\ \ 0.1}$
fill	I64	$\ \ 1.1/\ \ 1.1/{\color{green}+\ \ 0.0}$	$13.9/15.1/{\color{red}-\ \ 1.2}$	$72.8/71.4/{\color{green}+\ \ 1.4}$	$97.4/97.0/{\color{green}+\ \ 0.4}$	$99.1/99.0/{\color{green}+\ \ 0.1}$

We did not do benchmarks on complex, fib and heavy:

for complex, cutile does not accept std::complex as a vaild type to form tiles.
for fib, with tile semantics, there is no 1-to-1 fair implementation in tile. We can get one by abusing ct::select but it is much slower.
for heavy, cutile lowers syntax like T reg[N] to heap allocation.

There was a more detailed write up on fib and heavy here: #9038 (comment)

copy-pr-bot · 2026-06-01T23:06:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

miscco

This looks great already

miscco · 2026-06-02T07:18:28Z

+constexpr int ceil_div(int a, int b) { return (a + b - 1) / b; }
+constexpr int round_up_pow2(int x) {
+    int p = 1; while (p < x) p *= 2; return p;
+}


Those could just be ::cuda::ceil_div and ::cuda::next_power_of_two

bernhardmgruber · 2026-06-02T07:26:29Z

+    auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));
+    auto out       = ct::assume_aligned<16>(out_);


Suggestion: For a fair comparison, the transform_kernel should support unaligned data as well. This can be just a branch dispatching two an aligned and an unaligned code path.

Indeed! However, I have observed very bad performance on tile kernels in this case. When the pointer is not aligned, all loads will not be vectorized. CUB does way better with TMA & overcopying.

I am asking compiler people to see if they can add this heuristics, since having this will make a much more informative benchmark

That is a very good idea! Please help them improve tile!

Unaligned inputs are less common but do appear. Examples are if you want to run a kernel on the output of a previous one, where the previous one also chose the problem size. Like, you partition an array by a predicate and then only transform the elements of the selected partition, etc. Such cases come up frequently in database or dataframe workloads.

bernhardmgruber · 2026-06-02T07:29:02Z

I am genuinely impressed by the size of the tile kernel! It's really small and expressive. Nice!

The cub_tile::DeviceTransform implementation moves from cub/benchmarks/bench/transform/tile/device_transform.cuh into cub/cub/device/dispatch/dispatch_transform_tile.cuh so the CUB header tree can reference it. The bench .cu files now include from the new path. The hand-rolled ceil_div and round_up_pow2 helpers are replaced with cuda::ceil_div and cuda::next_power_of_two from <cuda/cmath>.

The cub_tile namespace and its hand-rolled detail layout move under cub::detail::transform::tile to match how CUB groups the existing transform internals. A type alias is kept at cub_tile::DeviceTransform so the benches and tests still compile during the transition. The whole file body is also gated by _CCCL_CTK_AT_LEAST(13, 3) so older toolchains never see the tile DSL types.

nanan-nvidia · 2026-06-04T01:30:05Z

NOTE: Not ready to merge due to regular device transform SIMT kernels will fail to compile with --enable-tile. I temporarily unblocked myself with e0a31e4

This now introduces tile kernels with no call site change for users. To opt in at build time, please compile with --enable-tile and -DCCCL_ENABLE_TILE_TRANSFORM_DISPATCH.

The basic idea is, at compile time, eligible (Op, T, NIn) combos will be dispatched to tile kernels with traits. We can ship traits based on our benchmarks (i.e. if we know tile is better on some (Op, T, NIn)). The user can also self-register the combos they find tile to be beneficial. To self register, they need to provide three pieces of data:

First, a SIMT functor that will be called at the API:

struct my_tanh {
    template <class T>
    __host__ __device__ T operator()(T v) const {
        return static_cast<T>(::cuda::std::tanh(static_cast<float>(v)));
    }
};

Second, a tile functor that has the same semantical meaning:

#if defined(CCCL_ENABLE_TILE_TRANSFORM_DISPATCH) && _CCCL_TILE_COMPILATION()
struct tile_my_tanh {
    template <class T>
    __tile__ auto operator()(T v) const {
        namespace ct = cuda::tiles;
        return ct::element_cast<ct::tile_element_t<T>>(
            ct::tanh(ct::element_cast<float>(v)));
    }
};

And third, the trait specialization that links those two functors:

CUB_NAMESPACE_BEGIN
namespace detail::transform::tile {
    template <class T>
    struct tile_eligible<my_tanh, T, 1> : std::true_type {
        using tile_op_type = tile_my_tanh;
    };
    // Optional: hint the tile policy picker that this is MUFU-heavy.
    template <> struct tile_mufu_heavy<my_tanh> : std::true_type {};
}
CUB_NAMESPACE_END
#endif

The general idea is that as cutile c++ gets more and more mature and performant, we can correspondingly register more and more cases until it covers all the cases.

The dispatch flow is as follows:

                 cub::DeviceTransform::Transform(.., op, ..)
                                │
                                ▼
              if constexpr (tile_dispatch_eligible_v<Op, OutIter, InIters...>):
                                │
              ┌─────────────────┴─────────────────┐
              ▼                                   ▼
       runtime preconditions OK?            standard CUB dispatch
       (16B-aligned pointers, 
        num_items % 16 == 0, 
        num_items ≤ 2^31)
              │
       ┌──────┴──────┐
       ▼             ▼
   tile kernel    standard CUB

Those runtime preconditions exist because current bad cutile performance on those cases.

fbusato

I did a first pass for the library implementation. The implementation is already great!!
I pointed out some compatibility and stylistic issues.

fbusato · 2026-06-04T17:41:20Z

+#  pragma system_header
+#endif // no system header
+
+#if _CCCL_CTK_AT_LEAST(13, 3)


We need to check for C++20 and _CCCL_TILE_COMPILATION

better to create a dedicated macro for it

fbusato · 2026-06-04T17:41:52Z

+
+#  include <cuda_tile.h>
+
+#  include <cstdint>


Suggested change

# include <cstdint>

# include <cuda/std/cstdint>

fbusato · 2026-06-04T17:42:58Z

+
+template <int TileSize, typename Fn, typename Out, typename... Ins>
+__tile_global__ void
+transform_kernel(int64_t num_items_, Out* __restrict__ out_, const Ins* __restrict__... ins_)


bit. I would remove the postfix _

fbusato · 2026-06-04T17:43:34Z

+__tile_global__ void
+transform_kernel(int64_t num_items_, Out* __restrict__ out_, const Ins* __restrict__... ins_)
+{
+  namespace ct = cuda::tiles;


Suggested change

namespace ct = cuda::tiles;

namespace ct = ::cuda::tiles;

do it work, right?

fbusato · 2026-06-04T17:44:36Z

+  const auto bx = ct::bid().x;
+  Fn fn{};
+
+  auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));


please add a comment why we need assume_divisible and assume_bounded_below for future reference

Suggested change

auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));

const auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));

fbusato · 2026-06-04T18:26:08Z

+  return 12 * 1024;
+}
+
+constexpr int min_size(int a)


please use ::cuda::std::min

fbusato · 2026-06-04T18:26:45Z

+// registers and the compiler unpacks them and packs them back. reducing the
+// compute work per thread helps here. need profiling to know the exact cause.
+template <typename Out, typename... Ins>
+constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000)


can we use compute_capability object instead?

fbusato · 2026-06-04T18:27:18Z

+constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000)
+{
+  constexpr int threads_per_block    = 128;
+  constexpr int vector_bytes         = 16; // LDG.E.128 -> 16 bytes


Blackwell also has LDG.E.256 -> 32 bytes

fbusato · 2026-06-04T18:27:54Z

+  constexpr int items_for_vec = static_cast<int>(::cuda::ceil_div(vector_bytes, min_elem));
+
+  // Fill (zero inputs) keeps the same latency target by counting output bytes.
+  constexpr int bytes_per_iter = (sizeof...(Ins) > 0) ? (int(sizeof(Ins)) + ... + 0) : int(sizeof(Out));


no need for int()

fbusato · 2026-06-04T18:28:15Z

+  const int items_for_latency =
+    static_cast<int>(::cuda::ceil_div(target, max_occupancy * threads_per_block * bytes_per_iter));
+
+  int items = items_for_vec > items_for_latency ? items_for_vec : items_for_latency;


nanan-nvidia added 7 commits June 1, 2026 14:29

tile DeviceTransform policy picker

7e4ca6b

tile DeviceTransform kernels + public API

76e4208

bench_init RNG helper

1631be8

babelstream tile bench

91e9b1a

pytorch tile bench

9dbdcad

copy/grayscale/fill tile benches

eb9df56

tile DeviceTransform tests

41fd20d

nanan-nvidia self-assigned this Jun 1, 2026

github-project-automation Bot added this to CCCL Jun 1, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 1, 2026

cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 1, 2026

nanan-nvidia mentioned this pull request Jun 1, 2026

Functional/Non-Functional C++ Tile Evaluation #9038

Open

miscco reviewed Jun 2, 2026

View reviewed changes

bernhardmgruber reviewed Jun 2, 2026

View reviewed changes

nanan-nvidia added 10 commits June 3, 2026 13:27

add tile dispatch trait header

c519dfa

split tile dispatch into kernel and tuning headers

baa42f5

wire tile dispatch hook into cub::DeviceTransform

f152947

drop runtime gates from tile dispatch helper

1f306a1

add runtime alignment check before routing to tile

e51fd87

drop _CCCL_TILE from _CCCL_API to unblock CUB under enable-tile

e0a31e4

fall back to standard CUB dispatch when tile preconditions fail

f86d492

migrate tile benches and tests to cub::DeviceTransform

42bb308

fbusato requested changes Jun 4, 2026

View reviewed changes

		auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));
		auto out = ct::assume_aligned<16>(out_);

	auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));
	const auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_));

Conversation

nanan-nvidia commented Jun 1, 2026

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

miscco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nanan-nvidia Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber commented Jun 2, 2026

Uh oh!

nanan-nvidia commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fbusato left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nanan-nvidia Jun 2, 2026 •

edited

Loading

nanan-nvidia commented Jun 4, 2026 •

edited

Loading