Skip to content

Commit ad15f55

Browse files
committed
Address review comments on build time docs
1 parent 9b3f767 commit ad15f55

1 file changed

Lines changed: 25 additions & 7 deletions

File tree

include/ck/BUILD_TIME_OPTIMIZATION.md

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,16 @@ This document describes techniques for reducing C++ template instantiation overh
88

99
Composable Kernel relies heavily on C++ template metaprogramming to achieve GPU kernels with no runtime abstraction penalty. However, deep template instantiation can significantly impact build times. A single translation unit may trigger hundreds of thousands of template instantiations, with each instantiation adding to compile time.
1010

11+
## Key Types
12+
13+
This codebase uses compile-time types to enable zero-overhead abstractions:
14+
15+
- `Number<N>` - compile-time integer, enables static dispatch and compile-time arithmetic
16+
- `Sequence<Is...>` - compile-time integer sequence, used for dimension ordering and index manipulation
17+
- `Tuple<Ts...>` - heterogeneous container holding different types, used for tensor descriptors and transforms
18+
19+
These types allow the compiler to fully unroll loops, eliminate branches, and inline all operations - producing GPU kernels with no runtime abstraction cost.
20+
1121
## Optimization Techniques
1222

1323
### 1. Replace Recursive Templates with Pack Expansion
@@ -65,7 +75,7 @@ struct sequence_gen
6575
};
6676
```
6777

68-
Note: While `std::make_integer_sequence` is the standard C++14 way to generate integer sequences, it only produces `std::integer_sequence<T, ...>`. We use `__make_integer_seq` directly because it accepts any template as its first argument, enabling this pattern where the helper class receives the index pack directly.
78+
Note: This document assumes C++17 or later. While `std::make_integer_sequence` (introduced in C++14) is the standard library facility for generating integer sequences, it only produces `std::integer_sequence<T, ...>`. We use `__make_integer_seq` directly because it accepts any template as its first argument, enabling this pattern where the helper class receives the index pack directly.
6979

7080
### 2. Replace Lambdas with Named Functors
7181

@@ -153,11 +163,18 @@ Template recursion creates N template instantiations for N iterations. A constex
153163
**Before** (O(N) template instantiations):
154164

155165
```cpp
156-
template <index_t Target, typename Seq, index_t Pos>
166+
// Simplified example - actual CK code used more complex recursive patterns
167+
template <index_t Target, typename Seq, index_t Pos, bool AtEnd>
157168
struct find_source_index_impl
158169
{
159170
static constexpr index_t value =
160-
(Seq::template At<Pos>() == Target) ? Pos : find_source_index_impl<Target, Seq, Pos+1>::value;
171+
(Seq::template At<Pos>() == Target) ? Pos : find_source_index_impl<Target, Seq, Pos+1, (Pos+1 == Seq::Size())>::value;
172+
};
173+
174+
template <index_t Target, typename Seq, index_t Pos>
175+
struct find_source_index_impl<Target, Seq, Pos, true>
176+
{
177+
static constexpr index_t value = -1; // not found
161178
};
162179
```
163180
@@ -167,10 +184,11 @@ struct find_source_index_impl
167184
template <index_t Target, index_t... Is>
168185
__host__ __device__ constexpr index_t find_source_index(Sequence<Is...>)
169186
{
187+
// Simplified example - actual implementation handles empty sequences
170188
constexpr index_t values[] = {Is...};
171189
for(index_t i = 0; i < sizeof...(Is); ++i)
172190
if(values[i] == Target) return i;
173-
return 0;
191+
return -1; // not found
174192
}
175193
```
176194

@@ -180,14 +198,14 @@ This reduced `sequence_map_inverse` instantiations from 45 to 10 (78% reduction)
180198

181199
Fold expressions (C++17) can replace recursive template patterns for accumulation operations.
182200

183-
**Before** (implicit recursion through generate_tuple and container_reduce):
201+
**Before** (uses helper utilities that hide template recursion: `generate_tuple` recursively constructs a tuple of N elements, and `container_reduce` recursively reduces that tuple):
184202

185203
```cpp
186204
const auto element_space_size = container_reduce(
187205
generate_tuple([&](auto i) {
188-
return (lengths[i] - I1) * strides[i];
206+
return (lengths[i] - Number<1>{}) * strides[i];
189207
}, Number<N>{}),
190-
math::plus{}, LongNumber<1>{});
208+
math::plus{}, Number<1>{});
191209
```
192210
193211
**After** (single fold expression):

0 commit comments

Comments
 (0)