Benchmarks

This directory contains standalone micro-benchmarks for key kernels.

Prerequisites

Install dependencies per the project README.
Additionally install plotting/data dependencies used by benchmarks:
```
pip install matplotlib pandas
```

Run all benchmarks

From this directory:

bash run_all.sh

💡 Note: All benchmarks are validated on NVIDIA B200 GPUs. If you encounter Out-of-Memory (OOM) errors on other Blackwell GPUs (e.g., RTX 5080, RTX 5090), please reduce the test sizes in the benchmark scripts.

⚠️ Known Issue: When upgrading from CUDA 13.1 to CUDA 13.2, some benchmarks may show performance regressions, which is a known issue.

Run a single benchmark

Execute the specific Python file, for example:

python bench_matrix_multiplication.py

Available benchmark scripts:

bench_bmm.py
bench_dropout.py
bench_fused_attention.py
bench_group_gemm.py
bench_layernorm.py
bench_matrix_multiplication.py
bench_mix_triton_cutile.py
bench_mla.py
bench_mla_decoding.py
bench_persistent_matmul.py
bench_rmsnorm.py
bench_rope.py
bench_silu_and_mul.py
bench_softmax.py
bench_swiglu.py
experimental/bench_attention_backward.py
experimental/bench_fused_linear_cross_entropy.py
experimental/bench_fused_swiglu_mlp.py
experimental/bench_mhc.py
experimental/bench_rmsnorm_backward.py
experimental/bench_silu_and_mul_backward.py
experimental/bench_sparse_mla.py
experimental/bench_swiglu_backward.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks

Prerequisites

Run all benchmarks

Run a single benchmark

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Benchmarks

Prerequisites

Run all benchmarks

Run a single benchmark