Skip to content

Commit a84d491

Browse files
author
shijiashuai
committed
docs: add bilingual README and update configs
1 parent c4461d9 commit a84d491

3 files changed

Lines changed: 79 additions & 1 deletion

File tree

README.en.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# CUDA LLM Kernel Optimization
2+
3+
[![CI](https://github.com/LessUp/llm-speed/actions/workflows/ci.yml/badge.svg)](https://github.com/LessUp/llm-speed/actions/workflows/ci.yml)
4+
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
5+
![CUDA](https://img.shields.io/badge/CUDA-11.0+-76B900?logo=nvidia&logoColor=white)
6+
7+
[简体中文](README.md) | English
8+
9+
High-performance CUDA operator library for LLM inference optimization, including FlashAttention and high-performance GEMM kernels.
10+
11+
## Features
12+
13+
- **FlashAttention**: Online Softmax, O(N) memory, causal mask support
14+
- **High-Performance GEMM**: FP32/FP16/INT8 mixed precision, Tensor Core (WMMA)
15+
- **Progressive Optimization**: Naive → Tiled → FlashAttention (double-buffered)
16+
- **Register Tiling GEMM**: 128×128 blocks + 8×8 register accumulation + double buffer pipeline
17+
- **PyTorch Integration**: pybind11 Python bindings, direct PyTorch Tensor I/O
18+
- **Property Testing**: Hypothesis-driven property-based tests
19+
20+
## Installation
21+
22+
```bash
23+
pip install -r requirements.txt
24+
pip install -e .
25+
```
26+
27+
### CMake Build
28+
29+
```bash
30+
cmake --preset release
31+
cmake --build --preset release
32+
```
33+
34+
## Usage
35+
36+
```python
37+
from cuda_llm_ops import flash_attention, gemm, tensor_core_gemm
38+
39+
# FlashAttention (causal mask)
40+
output = flash_attention(q, k, v, is_causal=True)
41+
42+
# High-performance GEMM
43+
c = gemm(a, b, alpha=1.0, beta=0.0)
44+
45+
# Tensor Core GEMM (FP16 → FP32)
46+
c_fp32 = tensor_core_gemm(a, b)
47+
```
48+
49+
## Testing
50+
51+
```bash
52+
pytest tests/ -v # All tests
53+
pytest tests/ -v -m property # Property tests
54+
python benchmarks/benchmark_attention.py # Benchmarks
55+
```
56+
57+
## GPU Architecture Support
58+
59+
| Arch | SM | Features |
60+
|------|-----|----------|
61+
| Volta | 7.0 | FP16 Tensor Core |
62+
| Turing | 7.5 | FP16 + INT8 |
63+
| Ampere | 8.0, 8.6 | TF32 + async copy |
64+
| Ada | 8.9 | FP8 |
65+
| Hopper | 9.0 | TMA + Warp Group MMA |
66+
67+
## License
68+
69+
MIT License

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
11
# CUDA LLM Kernel Optimization
22

3-
[![CI](https://github.com/user/llm-speed/actions/workflows/ci.yml/badge.svg)](https://github.com/user/llm-speed/actions/workflows/ci.yml)
3+
[![CI](https://github.com/LessUp/llm-speed/actions/workflows/ci.yml/badge.svg)](https://github.com/LessUp/llm-speed/actions/workflows/ci.yml)
4+
[![Docs](https://img.shields.io/badge/Docs-GitHub%20Pages-blue?logo=github)](https://lessup.github.io/llm-speed/)
45
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
6+
7+
简体中文 | [English](README.en.md)
58
![CUDA](https://img.shields.io/badge/CUDA-11.0+-76B900?logo=nvidia&logoColor=white)
69
![C++](https://img.shields.io/badge/C%2B%2B-17-00599C?logo=c%2B%2B&logoColor=white)
710
![Python](https://img.shields.io/badge/Python-3.8+-3776AB?logo=python&logoColor=white)

_config.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
title: LLM-Speed
22
description: CUDA LLM Kernel Optimization — FlashAttention, FP16/INT8 GEMM with Tensor Core
33
remote_theme: pages-themes/cayman@v0.2.0
4+
5+
url: "https://lessup.github.io"
6+
baseurl: "/llm-speed"
7+
lang: zh-CN
8+
49
plugins:
510
- jekyll-remote-theme
11+
- jekyll-seo-tag

0 commit comments

Comments
 (0)