FLUX.2 [klein] 9B distilled image generation service following the lip-sync-v2 pattern.
Two-phase approach:
- Phase 1: Test directly on Vast AI (no Docker) - benchmark VRAM with reference images
- Phase 2: Dockerize once benchmarks confirm GPU requirements
- Create
pyproject.toml - Create
Makefile - Create
.gitignore - Create
imagegen/__init__.py - Create
imagegen/server/__init__.py
- Create
imagegen/server/schemas.py- HealthResponse
- GenerateRequest/Response
- EditRequest/Response
- Create
scripts/benchmark_vram.py- Model loading with/without CPU offload
- Text-to-image at various resolutions
- Multi-reference editing tests
- Peak VRAM logging
- Create
imagegen/server/flux_pipeline.py- FluxConfig dataclass
- FluxPipeline class with load/unload/generate/edit methods
- CPU offload for RTX 4090 compatibility
- Create
imagegen/server/main.py- Lifespan for model loading
- GET /health endpoint
- POST /generate endpoint
- POST /edit endpoint
- GET / root endpoint
- Create
README-VASTAI.md- Setup instructions for Vast AI - Create
scripts/test_api.py- API testing script
- Create
tests/conftest.py- Pytest fixtures - Create
tests/test_schemas.py- Schema validation tests - Create
README.md- Main project documentation
- Create
Dockerfilebased on lip-sync-v2 - Test Docker build locally
- Test on Vast AI with Docker
- Create
.github/workflows/image-gen.yml - Unit tests for schemas
- Docker build and push to GHCR
| Variable | Required | Default | Description |
|---|---|---|---|
HUGGING_FACE_TOKEN |
Yes | - | HF token for gated model |
PRELOAD_MODELS |
No | true |
Load model on startup |
QUANTIZATION |
No | none |
Quantization mode: none, fp8, int8 |
LOG_LEVEL |
No | INFO |
Logging level |
HF_HOME |
No | ~/.cache/huggingface |
Model cache dir |
- Run
scripts/benchmark_vram.pyon RTX 4090 - Confirm model loads within 24GB with CPU offload
- Record max reference images supported before OOM
- Start server:
uvicorn imagegen.server.main:app --host 0.0.0.0 --port 7000 - Test
/healthendpoint - Test
/generatewith curl/httpie - Test
/editwith 1, 2, 3, 4 reference images
- Build:
docker build -t image-gen:latest . - Run:
docker run --gpus all -p 7000:7000 -e HUGGING_FACE_TOKEN=xxx image-gen:latest - Wait for health check to pass
- Test all endpoints
-
Model:
black-forest-labs/FLUX.2-klein-9B- 9B parameter rectified flow transformer
- Step-distilled to 4 steps
- Supports both text-to-image and multi-reference image editing
-
Memory Requirements:
- BF16 (no quantization): ~29GB VRAM - best quality
- FP8 quantization: ~18GB VRAM - minor quality trade-off
-
GPU Targets:
- 32GB+ (A6000, etc.): BF16 without quantization (recommended)
- 24GB (RTX 4090): FP8 quantization via TorchAO
-
Reference files:
/Users/biz/Documents/projects/ScenemaAI/models/lip-sync-v2/