docs: add bilingual README and update configs

shijiashuai · shijiashuai · commit a84d491cc58a · 2026-03-09T16:46:30.000+08:00
diff --git a/README.en.md b/README.en.md
@@ -0,0 +1,69 @@
+# CUDA LLM Kernel Optimization
+
+[![CI](https://github.com/LessUp/llm-speed/actions/workflows/ci.yml/badge.svg)](https://github.com/LessUp/llm-speed/actions/workflows/ci.yml)
+[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
+![CUDA](https://img.shields.io/badge/CUDA-11.0+-76B900?logo=nvidia&logoColor=white)
+
+[简体中文](README.md) | English
+
+High-performance CUDA operator library for LLM inference optimization, including FlashAttention and high-performance GEMM kernels.
+
+## Features
+
+- **FlashAttention**: Online Softmax, O(N) memory, causal mask support
+- **High-Performance GEMM**: FP32/FP16/INT8 mixed precision, Tensor Core (WMMA)
+- **Progressive Optimization**: Naive → Tiled → FlashAttention (double-buffered)
+- **Register Tiling GEMM**: 128×128 blocks + 8×8 register accumulation + double buffer pipeline
+- **PyTorch Integration**: pybind11 Python bindings, direct PyTorch Tensor I/O
+- **Property Testing**: Hypothesis-driven property-based tests
+
+## Installation
+
+```bash
+pip install -r requirements.txt
+pip install -e .
+```
+
+### CMake Build
+
+```bash
+cmake --preset release
+cmake --build --preset release
+```
+
+## Usage
+
+```python
+from cuda_llm_ops import flash_attention, gemm, tensor_core_gemm
+
+# FlashAttention (causal mask)
+output = flash_attention(q, k, v, is_causal=True)
+
+# High-performance GEMM
+c = gemm(a, b, alpha=1.0, beta=0.0)
+
+# Tensor Core GEMM (FP16 → FP32)
+c_fp32 = tensor_core_gemm(a, b)
+```
+
+## Testing
+
+```bash
+pytest tests/ -v                         # All tests
+pytest tests/ -v -m property             # Property tests
+python benchmarks/benchmark_attention.py # Benchmarks
+```
+
+## GPU Architecture Support
+
+| Arch | SM | Features |
+|------|-----|----------|
+| Volta | 7.0 | FP16 Tensor Core |
+| Turing | 7.5 | FP16 + INT8 |
+| Ampere | 8.0, 8.6 | TF32 + async copy |
+| Ada | 8.9 | FP8 |
+| Hopper | 9.0 | TMA + Warp Group MMA |
+
+## License
+
+MIT License
diff --git a/README.md b/README.md
@@ -1,7 +1,10 @@
 # CUDA LLM Kernel Optimization
 
-[![CI](https://github.com/user/llm-speed/actions/workflows/ci.yml/badge.svg)](https://github.com/user/llm-speed/actions/workflows/ci.yml)
+[![CI](https://github.com/LessUp/llm-speed/actions/workflows/ci.yml/badge.svg)](https://github.com/LessUp/llm-speed/actions/workflows/ci.yml)
+[![Docs](https://img.shields.io/badge/Docs-GitHub%20Pages-blue?logo=github)](https://lessup.github.io/llm-speed/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
+
+简体中文 | [English](README.en.md)
 ![CUDA](https://img.shields.io/badge/CUDA-11.0+-76B900?logo=nvidia&logoColor=white)
 ![C++](https://img.shields.io/badge/C%2B%2B-17-00599C?logo=c%2B%2B&logoColor=white)
 ![Python](https://img.shields.io/badge/Python-3.8+-3776AB?logo=python&logoColor=white)
diff --git a/_config.yml b/_config.yml
@@ -1,5 +1,11 @@
 title: LLM-Speed
 description: CUDA LLM Kernel Optimization — FlashAttention, FP16/INT8 GEMM with Tensor Core
 remote_theme: pages-themes/cayman@v0.2.0
+
+url: "https://lessup.github.io"
+baseurl: "/llm-speed"
+lang: zh-CN
+
 plugins:
   - jekyll-remote-theme
+  - jekyll-seo-tag