[Tile][WIP] tile DeviceTransform port#9210
Conversation
| constexpr int ceil_div(int a, int b) { return (a + b - 1) / b; } | ||
| constexpr int round_up_pow2(int x) { | ||
| int p = 1; while (p < x) p *= 2; return p; | ||
| } |
There was a problem hiding this comment.
Those could just be ::cuda::ceil_div and ::cuda::next_power_of_two
| auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_)); | ||
| auto out = ct::assume_aligned<16>(out_); |
There was a problem hiding this comment.
Suggestion: For a fair comparison, the transform_kernel should support unaligned data as well. This can be just a branch dispatching two an aligned and an unaligned code path.
There was a problem hiding this comment.
Indeed! However, I have observed very bad performance on tile kernels in this case. When the pointer is not aligned, all loads will not be vectorized. CUB does way better with TMA & overcopying.
I am asking compiler people to see if they can add this heuristics, since having this will make a much more informative benchmark
There was a problem hiding this comment.
That is a very good idea! Please help them improve tile!
Unaligned inputs are less common but do appear. Examples are if you want to run a kernel on the output of a previous one, where the previous one also chose the problem size. Like, you partition an array by a predicate and then only transform the elements of the selected partition, etc. Such cases come up frequently in database or dataframe workloads.
|
I am genuinely impressed by the size of the tile kernel! It's really small and expressive. Nice! |
The cub_tile::DeviceTransform implementation moves from cub/benchmarks/bench/transform/tile/device_transform.cuh into cub/cub/device/dispatch/dispatch_transform_tile.cuh so the CUB header tree can reference it. The bench .cu files now include from the new path. The hand-rolled ceil_div and round_up_pow2 helpers are replaced with cuda::ceil_div and cuda::next_power_of_two from <cuda/cmath>.
The cub_tile namespace and its hand-rolled detail layout move under cub::detail::transform::tile to match how CUB groups the existing transform internals. A type alias is kept at cub_tile::DeviceTransform so the benches and tests still compile during the transition. The whole file body is also gated by _CCCL_CTK_AT_LEAST(13, 3) so older toolchains never see the tile DSL types.
|
NOTE: Not ready to merge due to regular device transform SIMT kernels will fail to compile with This now introduces tile kernels with no call site change for users. To opt in at build time, please compile with The basic idea is, at compile time, eligible First, a SIMT functor that will be called at the API: Second, a tile functor that has the same semantical meaning: And third, the trait specialization that links those two functors: The general idea is that as cutile c++ gets more and more mature and performant, we can correspondingly register more and more cases until it covers all the cases. The dispatch flow is as follows: Those runtime preconditions exist because current bad cutile performance on those cases.
|
fbusato
left a comment
There was a problem hiding this comment.
I did a first pass for the library implementation. The implementation is already great!!
I pointed out some compatibility and stylistic issues.
| # pragma system_header | ||
| #endif // no system header | ||
|
|
||
| #if _CCCL_CTK_AT_LEAST(13, 3) |
There was a problem hiding this comment.
We need to check for C++20 and _CCCL_TILE_COMPILATION
There was a problem hiding this comment.
better to create a dedicated macro for it
|
|
||
| # include <cuda_tile.h> | ||
|
|
||
| # include <cstdint> |
There was a problem hiding this comment.
| # include <cstdint> | |
| # include <cuda/std/cstdint> |
|
|
||
| template <int TileSize, typename Fn, typename Out, typename... Ins> | ||
| __tile_global__ void | ||
| transform_kernel(int64_t num_items_, Out* __restrict__ out_, const Ins* __restrict__... ins_) |
There was a problem hiding this comment.
bit. I would remove the postfix _
| __tile_global__ void | ||
| transform_kernel(int64_t num_items_, Out* __restrict__ out_, const Ins* __restrict__... ins_) | ||
| { | ||
| namespace ct = cuda::tiles; |
There was a problem hiding this comment.
| namespace ct = cuda::tiles; | |
| namespace ct = ::cuda::tiles; |
do it work, right?
| const auto bx = ct::bid().x; | ||
| Fn fn{}; | ||
|
|
||
| auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_)); |
There was a problem hiding this comment.
please add a comment why we need assume_divisible and assume_bounded_below for future reference
There was a problem hiding this comment.
| auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_)); | |
| const auto num_items = ct::assume_bounded_below<0>(ct::assume_divisible<16>(num_items_)); |
| return 12 * 1024; | ||
| } | ||
|
|
||
| constexpr int min_size(int a) |
There was a problem hiding this comment.
please use ::cuda::std::min
| // registers and the compiler unpacks them and packs them back. reducing the | ||
| // compute work per thread helps here. need profiling to know the exact cause. | ||
| template <typename Out, typename... Ins> | ||
| constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000) |
There was a problem hiding this comment.
can we use compute_capability object instead?
| constexpr int pick_tile_size(bool mufu_heavy = false, int cc_x10 = 1000) | ||
| { | ||
| constexpr int threads_per_block = 128; | ||
| constexpr int vector_bytes = 16; // LDG.E.128 -> 16 bytes |
There was a problem hiding this comment.
Blackwell also has LDG.E.256 -> 32 bytes
| constexpr int items_for_vec = static_cast<int>(::cuda::ceil_div(vector_bytes, min_elem)); | ||
|
|
||
| // Fill (zero inputs) keeps the same latency target by counting output bytes. | ||
| constexpr int bytes_per_iter = (sizeof...(Ins) > 0) ? (int(sizeof(Ins)) + ... + 0) : int(sizeof(Out)); |
| const int items_for_latency = | ||
| static_cast<int>(::cuda::ceil_div(target, max_occupancy * threads_per_block * bytes_per_iter)); | ||
|
|
||
| int items = items_for_vec > items_for_latency ? items_for_vec : items_for_latency; |

Highly experimental, opening as Draft to share a work-in-progress port of
cub::DeviceTransform(#8087, #9038) onto cutile, and to compare side-by-side with the existing CUB benches. Not for merge.Before the benchmarks, it is important to note that SIMT-Tile interop for TileIR is still work in progress. Thus, right now, the custom function the user passes in must be a
__tile__function, and it must consist of tile operations and must be inlinable.Current B200 benchmark on
pytorchandbabel:pytorch (tile / cub/ delta = tile - cub, BW utilisation %)
babel
copy, grayscale, fill
We did not do benchmarks on
complex,fibandheavy:complex, cutile does not acceptstd::complexas a vaild type to form tiles.fib, with tile semantics, there is no 1-to-1 fair implementation intile. We can get one by abusingct::selectbut it is much slower.heavy, cutile lowers syntax likeT reg[N]to heap allocation.There was a more detailed write up on
fibandheavyhere: #9038 (comment)