A GitHub-hosted PyPI index for pre-compiled Python wheels of CUDA-enabled packages. Build once, install anywhere - no local compilation required.
Building packages like flash-attn and xformers from source takes 30+ minutes and requires CUDA toolkit, build tools, and significant CPU/memory. This repository:
- ✅ Pre-builds wheels for multiple Python and CUDA versions
- ✅ Hosts a PEP 503 index on GitHub Pages
- ✅ Eliminates local compilation - just
pip install - ✅ Supports CUDA 12.x and 13.x via PyTorch nightly builds
Current packages available:
- flash-attn 2.8.3
GPU Compatibility: Wheels are compiled for CUDA compute capability 8.6 ONLY (RTX 3080/3090/3090Ti, A100). This limitation is necessary to fit compilation within GitHub Actions runner memory constraints (7GB RAM). Other GPUs will NOT work.
See config/packages.yml for full configuration.
Install pre-built wheels using pip's --extra-index-url:
# Install PyTorch first (required for CUDA packages)
# For CUDA 12.9:
pip install torch --index-url https://download.pytorch.org/whl/cu129
# For CUDA 13.0:
pip install torch --index-url https://download.pytorch.org/whl/cu130
# Then install the package from this index
pip install flash-attn --extra-index-url https://DEVtheOPS.github.io/python-wheels/simple/| Package | Version | Python | CUDA |
|---|---|---|---|
| flash-attn | 2.8.3 | 3.12, 3.13 | 12.9.1, 13.0.2 |
- Go to the Actions tab
- Click "Build Wheels" workflow
- Click "Run workflow" button
- (Optional) Customize parameters:
- Package: Leave empty to build all, or specify one (e.g.,
flash-attn) - Python versions: Comma-separated (default:
3.12,3.13) - CUDA versions: Comma-separated (default:
12.9.1,13.0.2)
- Package: Leave empty to build all, or specify one (e.g.,
- Click "Run workflow"
Builds trigger automatically when you:
- Push changes to
config/packages.ymlon themainbranch
# Build all packages with defaults
gh workflow run build-wheels.yml
# Build specific package
gh workflow run build-wheels.yml -f package=flash-attn
# Build for specific Python version
gh workflow run build-wheels.yml -f python_versions=3.12
# Build for specific CUDA version
gh workflow run build-wheels.yml -f cuda_versions=13.0.2
# Combine options
gh workflow run build-wheels.yml \
-f package=flash-attn \
-f python_versions=3.12 \
-f cuda_versions=13.0.2- Edit
config/packages.yml:
packages:
my-package:
versions: ["1.0.0"]
build_args: "--no-build-isolation" # Optional
extra_deps: ["torch", "ninja"] # Build dependencies
test_import: "my_package" # Module name for import test
description: "My CUDA package" # Human-readable description- Commit and push to
mainbranch (triggers automatic build) - Or manually trigger workflow via Actions UI
python-wheels/
├── .github/workflows/
│ └── build-wheels.yml # CI/CD workflow
├── config/
│ └── packages.yml # Package definitions
├── scripts/
│ ├── build_in_docker.sh # Docker build script
│ ├── generate_index.py # PyPI index generator
│ └── test_build_local.sh # Local testing script
├── tests/
│ └── test_generate_index.py # Unit tests
├── AGENTS.md # Technical documentation for AI assistants
├── CONTRIBUTING.md # Development guide
└── README.md # This file
- Matrix Generation: Workflow reads
config/packages.ymland generates build matrix - Docker Build: Each combination builds in CUDA Docker container
- Wheel Creation:
pip wheelcompiles package with CUDA support - Import Test: Verifies wheel loads successfully in clean environment
- Release: Creates GitHub release with wheels attached
- Index Generation: Generates PEP 503 index HTML
- GitHub Pages: Deploys index for pip consumption
- Build Environment:
nvidia/cuda:{VERSION}-devel-ubuntu22.04 - Test Environment:
nvidia/cuda:{VERSION}-runtime-ubuntu22.04 - Python: Installed from deadsnakes PPA
- PyTorch: Version-matched to CUDA (12.x stable, 13.x nightly)
- Index Format: PEP 503 compliant, hosted on GitHub Pages
GitHub Actions runners have limited space. The workflow includes cleanup steps, but you can:
- Build fewer packages at once using the
packageparameter - Reduce Python/CUDA version combinations
flash-attn compilation is extremely CPU/memory intensive. The workflow runs builds sequentially (max-parallel: 1) to prevent overwhelming runners. This means:
- ⏱️ Each build takes 60-90 minutes
- ⏱️ All 4 combinations take ~4-6 hours total
- ✅ Builds complete successfully without timeouts
To speed up for testing:
- Build one package at a time using the
packageparameter - Build one Python version at a time
- Build one CUDA version at a time
Common causes:
- Missing runtime dependencies: Package needs PyTorch + numpy installed
- CUDA version mismatch: Ensure PyTorch CUDA version matches wheel's CUDA version
- Wrong PyTorch version: CUDA 13.x requires PyTorch nightly from
cu130index
See AGENTS.md for detailed troubleshooting.
You can test wheel builds locally before pushing to CI:
- Docker or Podman installed
- Python 3.12+ with PyYAML (
pip install pyyaml)
# Using Docker (default)
./scripts/test_build_local.sh docker flash-attn 2.8.3 3.12 12.9.1
# Using Podman
./scripts/test_build_local.sh podman flash-attn 2.8.3 3.12 13.0.2
# With defaults (flash-attn 2.8.3, Python 3.12, CUDA 12.9.1)
./scripts/test_build_local.shThe script will:
- ✅ Read package config from
config/packages.yml - ✅ Pull CUDA Docker image
- ✅ Build the wheel in Docker
- ✅ Run import test in clean container
- ✅ Report success/failure
Wheels are output to ./wheels/ directory.
# Generate index HTML from wheels directory
python scripts/generate_index.py \
--wheels-dir wheels \
--output-dir index/simple \
--base-url "https://github.com/USER/REPO/releases/download/TAG"
# View generated index
ls -R index/simple/# Run all tests
python -m unittest discover -s tests -p "test_*.py"
# Run specific test
python -m unittest tests.test_generate_index.ParseWheelFilenameTestsSee CONTRIBUTING.md for development guidelines.
See LICENSE file for details.
- flash-attention - The original flash-attn implementation
- xformers - Memory-efficient transformers
- pytorch - The foundation for all CUDA packages
- Issues: GitHub Issues
- Releases: GitHub Releases
- Index: GitHub Pages