Skip to content

Latest commit

 

History

History
54 lines (46 loc) · 1.69 KB

File metadata and controls

54 lines (46 loc) · 1.69 KB

Benchmarks

This directory contains standalone micro-benchmarks for key kernels.

Prerequisites

  • Install dependencies per the project README.
  • Additionally install plotting/data dependencies used by benchmarks:
    pip install matplotlib pandas

Run all benchmarks

From this directory:

bash run_all.sh

💡 Note: All benchmarks are validated on NVIDIA B200 GPUs. If you encounter Out-of-Memory (OOM) errors on other Blackwell GPUs (e.g., RTX 5080, RTX 5090), please reduce the test sizes in the benchmark scripts.

⚠️ Known Issue: When upgrading from CUDA 13.1 to CUDA 13.2, some benchmarks may show performance regressions, which is a known issue.

Run a single benchmark

Execute the specific Python file, for example:

python bench_matrix_multiplication.py

Available benchmark scripts:

  • bench_bmm.py
  • bench_dropout.py
  • bench_fused_attention.py
  • bench_group_gemm.py
  • bench_layernorm.py
  • bench_matrix_multiplication.py
  • bench_mix_triton_cutile.py
  • bench_mla.py
  • bench_mla_decoding.py
  • bench_persistent_matmul.py
  • bench_rmsnorm.py
  • bench_rope.py
  • bench_silu_and_mul.py
  • bench_softmax.py
  • bench_swiglu.py
  • experimental/bench_attention_backward.py
  • experimental/bench_fused_linear_cross_entropy.py
  • experimental/bench_fused_swiglu_mlp.py
  • experimental/bench_mhc.py
  • experimental/bench_rmsnorm_backward.py
  • experimental/bench_silu_and_mul_backward.py
  • experimental/bench_sparse_mla.py
  • experimental/bench_swiglu_backward.py