Conversation
|
|
||
| # Compute sigmoid(a) and silu(a) | ||
| sigmoid_a = sigmoid(a_tile_f32) | ||
| sigmoid_a = 1.0 / (1.0 + ct.exp(-a_tile_f32)) |
There was a problem hiding this comment.
Inlined this for now because I didn’t want to modify the backward kernel in this PR - that would require re-benchmarking it as well. I have additional optimizations planned that I’ll include in a separate PR, which will also make use of the new sigmoid implementation I added.
381e8dc to
0461595
Compare
| def sigmoid(x): | ||
| return 1.0 / (1.0 + ct.exp(-x)) | ||
| denom = ct.add(1.0, ct.exp(-x), flush_to_zero=True) | ||
| return ct.truediv(1.0, denom, flush_to_zero=True, rounding_mode=RMd.APPROX) |
There was a problem hiding this comment.
A good chunk of the savings came from Rmd.APPROX without losing precision - verified via tests
| # Sigmoid requires type float32 | ||
| c_tile = silu(a_tile.astype(ct.float32)).astype(a.dtype) * b_tile | ||
| ct.store(c, index=(row, col), tile=c_tile) | ||
| a_tile = ct.gather(a, (row, offsets), check_bounds=True, padding_value=0.0) |
There was a problem hiding this comment.
Good chunk of the perf improvements came from gather scatter vs load/store
| create_benchmark_config(batch_size, M, dtype) | ||
| for batch_size in [1, 4, 8] # Different batch sizes | ||
| for M in [128, 4096] # Different rows | ||
| for dtype in [torch.float16, torch.bfloat16, torch.float32] |
There was a problem hiding this comment.
Most benchmarks test across various dtypes, I thought this one should too
|
Hey @hannahli-nv, another day - another cuTILE perf upgrade! |
|
@xjmxyt Any chance I could get a review :) |
Description
flush_to_zero=True + approximate reciprocal via rounding_mode=RMd.APPROXto reduce scalar math cost.Benchmark Results (Added bfloat16 + float32 in addition to float16)
Notes for PR:
CI Configuration
Checklist
./format.sh)