This directory contains standalone micro-benchmarks for key kernels.
- Install dependencies per the project README.
- Additionally install plotting/data dependencies used by benchmarks:
pip install matplotlib pandas
From this directory:
bash run_all.sh💡 Note: All benchmarks are validated on NVIDIA B200 GPUs. If you encounter Out-of-Memory (OOM) errors on other Blackwell GPUs (e.g., RTX 5080, RTX 5090), please reduce the test sizes in the benchmark scripts.
⚠️ Known Issue: When upgrading from CUDA 13.1 to CUDA 13.2, some benchmarks may show performance regressions, which is a known issue.
Execute the specific Python file, for example:
python bench_matrix_multiplication.pyAvailable benchmark scripts:
bench_bmm.pybench_dropout.pybench_fused_attention.pybench_group_gemm.pybench_layernorm.pybench_matrix_multiplication.pybench_mix_triton_cutile.pybench_mla.pybench_mla_decoding.pybench_persistent_matmul.pybench_rmsnorm.pybench_rope.pybench_silu_and_mul.pybench_softmax.pybench_swiglu.pyexperimental/bench_attention_backward.pyexperimental/bench_fused_linear_cross_entropy.pyexperimental/bench_fused_swiglu_mlp.pyexperimental/bench_mhc.pyexperimental/bench_rmsnorm_backward.pyexperimental/bench_silu_and_mul_backward.pyexperimental/bench_sparse_mla.pyexperimental/bench_swiglu_backward.py