This project uses bd (beads) for issue tracking. Run bd onboard to get started.
A GitHub-hosted PyPI index that builds and distributes pre-compiled Python wheels for CUDA-enabled packages (flash-attn, xformers, etc.) across multiple Python and CUDA versions. Eliminates the need for users to compile complex CUDA packages locally.
Key Architecture:
- GitHub Actions orchestrates builds using dynamic matrix strategy
- Docker containers with NVIDIA CUDA images provide build environment
- PEP 503-compliant index hosted on GitHub Pages
- Configuration-driven package definitions in
config/packages.yml
bd ready # Find available work
bd show <id> # View issue details
bd update <id> --status in_progress # Claim work
bd close <id> # Complete work
bd sync # Sync with gitRun tests:
python -m unittest discover -s tests -p "test_*.py"Run specific test:
python -m unittest tests.test_generate_index.ParseWheelFilenameTests.test_normalizes_name_and_handles_build_tagGenerate PyPI index locally:
python scripts/generate_index.py \
--wheels-dir wheels \
--output-dir index/simple \
--base-url "https://github.com/USER/REPO/releases/download/TAG"Build wheel locally (requires Docker or Podman):
# Using Docker (default)
./scripts/test_build_local.sh docker flash-attn 2.8.3 3.12 12.9.1
# Using Podman
./scripts/test_build_local.sh podman flash-attn 2.8.3 3.12 12.9.1
# Test with defaults (flash-attn 2.8.3, Python 3.12, CUDA 12.9.1)
./scripts/test_build_local.shLegacy build script (no testing):
./scripts/build_wheel.sh flash-attn 2.8.3 3.12 12.9.1Trigger workflow manually:
- GitHub UI → Actions → "Build Wheels" → Run workflow
- Optional: Specify package, Python versions, CUDA versions
Automatic builds:
- Push changes to
config/packages.ymlon main branch
config/packages.yml
↓
GitHub Actions Prepare Job
↓ (generates matrix)
Build Jobs (parallel, one per combination)
↓ (Docker: nvidia/cuda:*-devel-ubuntu22.04)
Wheel Artifacts
↓
Release Job (creates GitHub release)
↓
PyPI Index Generation
↓
GitHub Pages Deployment
Located in config/packages.yml:
packages:
<package-name>:
versions: ["version1", "version2"] # Package versions to build
build_args: "--no-build-isolation" # pip wheel arguments
extra_deps: ["torch", "ninja", "psutil"] # Build dependencies
test_import: "module_name" # Python import to test
description: "Package description" # Human-readable descriptionCritical Fields:
versions: List of exact versions to build (e.g.,["2.8.3"])build_args: Pass flags like--no-build-isolationfor complex buildsextra_deps: Install before building (e.g.,torchfor flash-attn)test_import: Module name for import smoke test (defaults to package name with_)
Each wheel is built in isolation:
- Pull CUDA image:
nvidia/cuda:{VERSION}-devel-ubuntu22.04 - Install Python from deadsnakes PPA
- Create virtualenv with specific Python version
- Install build dependencies (
extra_deps) - Build wheel:
pip wheel {PACKAGE}=={VERSION} {BUILD_ARGS} - Test import in runtime container
- Upload artifact to GitHub
Follows PEP 503 (Simple Repository API):
index/simple/
├── index.html # Root index (lists all packages)
└── {normalized-package-name}/
└── index.html # Package index (lists all wheels)
Wheel naming: {package}-{version}-{python}-{abi}-{platform}.whl
Example: flash_attn-2.8.3-cp312-cp312-linux_x86_64.whl
scripts/generate_index.py:
- Parses wheel filenames (PEP 427)
- Normalizes package names (PEP 503: replace
[-_.]with-) - Generates HTML index files
- Validates wheel format
scripts/build_wheel.sh:
- Input validation (package name, version, Python, CUDA)
- Docker orchestration
- Wheel output collection
- Error handling with line number reporting
- Add entry to
config/packages.yml:
packages:
my-package:
versions: ["1.0.0"]
build_args: ""
extra_deps: []
test_import: "my_package"
description: "My CUDA package"-
Commit to main branch (triggers automatic build), or manually trigger workflow
-
Verify build in Actions tab
-
Check release artifacts and GitHub Pages index
Unit tests (tests/test_generate_index.py):
- Wheel filename parsing
- Package name normalization
- Index HTML generation
- End-to-end script execution
CI smoke tests:
- Import test in runtime container (validates wheel installs correctly)
- Wheel artifact upload verification
Build fails with missing CUDA:
- Ensure using
-develCUDA image, not-runtime - Check CUDA version matches build requirements
Import test fails:
- Verify
test_importmatches actual module name - Check
extra_depsincludes runtime dependencies
Wheel not found in index:
- Confirm wheel filename follows PEP 427 format
- Check
generate_index.pydidn't skip it (warning in logs)
Docker pull timeout:
- Workflow uses
nick-invision/retry@v3with 3 attempts - CUDA images are large (~5GB), may need increased timeout
Disk space exhausted during build:
- GitHub Actions runners have limited space (~14GB free)
- Workflow includes disk cleanup steps (removes dotnet, android, etc.)
- Cleans up Docker images after each build
- If still failing, reduce parallelism by building fewer versions at once
Build timeout / runner lost communication:
- Error:
The hosted runner lost communication with the serveror exit code 137 (OOM killed) - Cause: flash-attn compilation is extremely CPU/memory intensive (compiles thousands of CUDA kernels)
- GitHub Actions runners: 2 CPU cores, 7GB RAM - easily overwhelmed by CUDA compilation
- Solution: Workflow limits resource usage aggressively:
max-parallel: 1- Only 1 build runs at a time (sequential processing)MAX_JOBS=1- Single-threaded compilation (prevents memory spikes)--memory=6g- Docker memory limit (prevents runaway processes)TORCH_CUDA_ARCH_LIST="8.9;9.0"- Only build for RTX 4000/H100 GPUs (reduced from 4 architectures to 2)timeout-minutes: 120- Fail gracefully after 2 hours
- Result: Each build takes 60-90 minutes but completes successfully
- Trade-off: 4 builds take ~4-6 hours total instead of 1-2 hours parallel
- GPU compatibility: Wheels support compute capability 8.6 ONLY (RTX 3080/3090/3090Ti, A100)
- Uses
FLASH_ATTENTION_CUDA_ARCHS="86"environment variable - Building for multiple architectures causes OOM even with max-parallel: 1
- ❌ H100 (9.0), RTX 4000 (8.9), older GPUs NOT supported
- Uses
- Manual trigger tip: Build one package at a time by specifying package name in workflow dispatch
CUDA version mismatch with PyTorch:
- Error:
RuntimeError: The detected CUDA version (X.X) mismatches the version that was used to compile PyTorch (Y.Y) - Cause: PyTorch (required by flash-attn) normally enforces strict CUDA version matching
- Solution: Workflow automatically handles this
- CUDA 12.x: Uses PyTorch stable with matching CUDA version from
https://download.pytorch.org/whl/cu12X - CUDA 13.x: Uses PyTorch nightly with CUDA 13.0 support from
https://download.pytorch.org/whl/cu130
- CUDA 12.x: Uses PyTorch stable with matching CUDA version from
- How it works: For CUDA 13.x, the workflow:
- Installs PyTorch nightly (2.9.0+cu130) with CUDA 13.0 support
- Patches
torch.utils.cpp_extension._check_cuda_versionto bypass check - Sets
TORCH_CUDA_ARCH_LISTfor compatible GPU architectures - Builds successfully with proper CUDA 13.x runtime support
Missing dependencies (ninja, numpy, etc.):
- Error:
ERROR: Could not find a version that satisfies the requirement ninja - Cause: PyTorch CUDA-specific index (
https://download.pytorch.org/whl/cu121) only contains PyTorch packages - Solution: Workflow separates dependency installation:
- Installs
torchfrom PyTorch CUDA-specific index - Installs other dependencies (
ninja,numpy,packaging,psutil) from default PyPI
- Installs
- Implementation:
scripts/build_in_docker.shfilters outtorchfromEXTRA_DEPSand installs separately
Test import failures (undefined symbol errors):
- Error:
undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib - Cause: flash-attn is a C++ extension that links against PyTorch's CUDA libraries at runtime
- Solution: Test step installs runtime dependencies before testing:
- Installs PyTorch (matching CUDA version) in test container
- Installs numpy
- Then installs and tests the wheel
- Why this works: PyTorch provides the shared libraries (
.sofiles) that flash-attn needs to load at import time
CUDA 13.x support:
- Status: Fully supported via PyTorch nightly builds with CUDA 13.0 support
- PyTorch version: 2.9.0+cu130 from
https://download.pytorch.org/whl/cu130 - Build process:
- Installs PyTorch nightly with CUDA 13.0 support
- Patches version check to allow compilation
- Compiles flash-attn against CUDA 13.0.2 toolkit
- Tests with same PyTorch nightly version
- For users: CUDA 13.x wheels work with PyTorch 2.9.0+cu130 or compatible versions. Install with:
pip install torch --index-url https://download.pytorch.org/whl/cu130 pip install flash-attn --extra-index-url https://USER.github.io/REPO/simple/
Required permissions in .github/workflows/build-wheels.yml:
contents: write- Create releases and commit to gh-pagespages: write- Deploy to GitHub Pagesid-token: write- GitHub Pages deployment authentication
- Immutability: Wheel builds are reproducible (pinned versions, Docker)
- Fail-fast disabled: One package failure doesn't stop other builds
- Retention: Wheel artifacts kept for 90 days
- Release tags: Date-based (
v2026-01-28) - Base URL: Points to GitHub release download URL for wheel files
Installation for users:
pip install flash-attn --extra-index-url https://USER.github.io/REPO/simple/Index updates:
- Automatic on successful builds
force_orphan: true- Keeps gh-pages branch clean (no history)- Deployed from
index/directory after generation
When ending a work session, you MUST complete ALL steps below. Work is NOT complete until git push succeeds.
MANDATORY WORKFLOW:
- File issues for remaining work - Create issues for anything that needs follow-up
- Run quality gates (if code changed) - Tests, linters, builds
- Update issue status - Close finished work, update in-progress items
- PUSH TO REMOTE - This is MANDATORY:
git pull --rebase bd sync git push git status # MUST show "up to date with origin" - Clean up - Clear stashes, prune remote branches
- Verify - All changes committed AND pushed
- Hand off - Provide context for next session
CRITICAL RULES:
- Work is NOT complete until
git pushsucceeds - NEVER stop before pushing - that leaves work stranded locally
- NEVER say "ready to push when you are" - YOU must push
- If push fails, resolve and retry until it succeeds
IMPORTANT: This project uses bd (beads) for ALL issue tracking. Do NOT use markdown TODOs, task lists, or other tracking methods.
- Dependency-aware: Track blockers and relationships between issues
- Git-friendly: Auto-syncs to JSONL for version control
- Agent-optimized: JSON output, ready work detection, discovered-from links
- Prevents duplicate tracking systems and confusion
Check for ready work:
bd ready --jsonCreate new issues:
bd create "Issue title" --description="Detailed context" -t bug|feature|task -p 0-4 --json
bd create "Issue title" --description="What this issue is about" -p 1 --deps discovered-from:bd-123 --jsonClaim and update:
bd update bd-42 --status in_progress --json
bd update bd-42 --priority 1 --jsonComplete work:
bd close bd-42 --reason "Completed" --jsonbug- Something brokenfeature- New functionalitytask- Work item (tests, docs, refactoring)epic- Large feature with subtaskschore- Maintenance (dependencies, tooling)
0- Critical (security, data loss, broken builds)1- High (major features, important bugs)2- Medium (default, nice-to-have)3- Low (polish, optimization)4- Backlog (future ideas)
- Check ready work:
bd readyshows unblocked issues - Claim your task:
bd update <id> --status in_progress - Work on it: Implement, test, document
- Discover new work? Create linked issue:
bd create "Found bug" --description="Details about what was found" -p 1 --deps discovered-from:<parent-id>
- Complete:
bd close <id> --reason "Done"
bd automatically syncs with git:
- Exports to
.beads/issues.jsonlafter changes (5s debounce) - Imports from JSONL when newer (e.g., after
git pull) - No manual export/import needed!
- ✅ Use bd for ALL task tracking
- ✅ Always use
--jsonflag for programmatic use - ✅ Link discovered work with
discovered-fromdependencies - ✅ Check
bd readybefore asking "what should I work on?" - ❌ Do NOT create markdown TODO lists
- ❌ Do NOT use external issue trackers
- ❌ Do NOT duplicate tracking systems
For more details, see README.md and docs/QUICKSTART.md.