File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change 1+ ---
2+ layout : default
3+ title : LLM-Speed
4+ ---
5+
6+ # LLM-Speed
7+
8+ CUDA LLM Kernel Optimization — 高性能 LLM 推理算子库,含 FlashAttention (online softmax)、FP16/INT8 GEMM with Tensor Core。
9+
10+ ## 核心特性
11+
12+ - ** FlashAttention** — Online softmax 实现,支持因果遮罩
13+ - ** FP16 HGEMM** — Tensor Core 加速半精度矩阵乘法
14+ - ** INT8 GEMM** — SM 75+ Tensor Core 量化矩阵乘
15+ - ** Warp Primitives** — 高效 warp-level reduction / scan
16+ - ** 共享内存优化** — Bank conflict-free 访问模式
17+ - ** Python 绑定** — 通过 pybind11 提供 Python 接口
18+
19+ ## 算子实现
20+
21+ | Kernel | 关键技术 | 架构要求 |
22+ | --------| ---------| ---------|
23+ | Naive Attention | 共享内存 QK^T | SM 70+ |
24+ | Tiled Attention | 分块计算 + 流式 softmax | SM 70+ |
25+ | Flash Attention | Online softmax + 因果遮罩 | SM 70+ |
26+ | HGEMM | WMMA Tensor Core (FP16→FP32) | SM 70+ |
27+ | Tensor Core GEMM | INT8/FP16 混合精度 | SM 75+ |
28+
29+ ## 快速开始
30+
31+ ``` bash
32+ # CMake 构建
33+ cmake --preset release
34+ cmake --build build/release -j$( nproc)
35+
36+ # Python 安装
37+ pip install -e .
38+
39+ # 运行测试
40+ pytest tests/
41+ ```
42+
43+ ## 技术栈
44+
45+ | 类别 | 技术 |
46+ | ------| ------|
47+ | 语言 | CUDA C++17, Python |
48+ | 构建 | CMake 3.18+, setup.py (CUDAExtension) |
49+ | 绑定 | pybind11 v2.11.1 |
50+ | GPU | SM 70+ (Volta → Hopper) |
51+ | 测试 | pytest + Hypothesis |
52+
53+ ## 链接
54+
55+ - [ GitHub 仓库] ( https://github.com/LessUp/llm-speed )
56+ - [ README] ( README.md )
You can’t perform that action at this time.
0 commit comments