Generate lip-synced talking-head videos from a still image and an audio file. Built on Lightricks' LTX-2 v2.3 — a 22B-parameter DiT-based audio-video foundation model.
Input: image + audio clip + text prompt → Output: MP4 video with synchronized lip movements
Pirate Captain — Lip-synced speech with scene animation
| Input Image | Output Video |
![]() |
pirate_scene_upscaled_2x.mp4 |
🎵 Audio input:
Prompt: "A weathered pirate captain speaking boldly, his lips moving with each word, beard shifting as his jaw opens and closes, fierce eyes narrowing, face lit by warm flickering firelight"
Generated with
--seed 77 --upscale 2(~10 min per chunk on M5 Max 128GB)
- Two-stage pipeline — Stage 1 denoises at half resolution, spatial 2x upscale, stage 2 refines at full resolution with distilled LoRA
- Automatic resolution & duration — matches source image aspect ratio and audio length
- Memory-efficient — staged transformer loading with aggressive cleanup; runs on 128GB Apple Silicon
- MPS support — includes forked LTX-2 as submodule with Apple Silicon MPS fixes
- Post-process upscaling —
--upscale 2or--upscale 4for higher resolution output via ffmpeg - Diagnostic logging — peak RSS tracking, per-stage timing, sigma schedule, run summary
# 1. Clone with submodules (includes MPS-patched LTX-2)
git clone --recurse-submodules https://github.com/techfreakworm/LTX2.3-ImageAudioToVideo.git
cd LTX2.3-ImageAudioToVideo
# 2. Run setup (creates venv, installs dependencies)
bash setup.sh
# 3. Download models (see Models section below)
# 4. Generate
source .venv/bin/activate
python generate.py \
--image inputs/portrait.png \
--audio inputs/speech.wav \
--prompt "A person speaking clearly with natural expressions" \
--upscale 2Note: If you already cloned without
--recurse-submodules, run:git submodule update --init --recursive
Download and place in the models/ directory:
| Model | Size | Download |
|---|---|---|
| LTX-2.3 Checkpoint | ~44GB | ltx-2.3-22b-dev.safetensors |
| Distilled LoRA | ~1.5GB | ltx-2.3-22b-distilled-lora-384.safetensors |
| Spatial Upscaler 2x | ~150MB | ltx-2.3-spatial-upscaler-x2-1.1.safetensors |
Gemma 3 Text Encoder — download all files from google/gemma-3-12b-it-qat-q4_0-unquantized into models/gemma_root/:
pip install huggingface_hub
huggingface-cli download google/gemma-3-12b-it-qat-q4_0-unquantized --local-dir models/gemma_root# Default mode (~5 min on Apple Silicon, with 2x upscale)
python generate.py \
--image inputs/portrait.png \
--audio inputs/speech.wav \
--prompt "A woman speaking enthusiastically with expressive gestures" \
--upscale 2
# Without upscale (faster, lower resolution output)
python generate.py \
--image inputs/portrait.png \
--audio inputs/speech.wav \
--prompt "A person speaking clearly"
# With overrides
python generate.py \
--image inputs/photo.png \
--audio inputs/voice.wav \
--prompt "A man explaining something" \
--output outputs/my_video.mp4 \
--seed 123 \
--upscale 4| Argument | Default | Description |
|---|---|---|
--image, -i |
required | Input image |
--audio, -a |
required | Input audio file (stereo; mono auto-converted) |
--prompt, -p |
required | Text prompt describing the scene |
--output, -o |
outputs/gen_<timestamp>.mp4 |
Output video path |
--config, -c |
config.yaml |
Config file path |
--seed, -s |
42 | Random seed |
--upscale |
none | Post-process upscale: 2 for 2x, 4 for 4x (ffmpeg lanczos) |
--num-frames |
auto | Frame count override (must be 8k+1) |
--height |
auto | Height override (snapped to 32-multiple) |
--width |
auto | Width override (snapped to 32-multiple) |
--steps |
from config | Stage 1 inference steps |
--fps |
25.0 | Frame rate |
--image-strength |
0.7 | Image conditioning (0.0-1.0). Lower = more motion |
--pipeline |
auto | auto, two-stage, or single-stage |
--negative-prompt |
from config | Negative prompt override |
This project is optimized for Apple Silicon Macs with MPS backend. Known limitations:
| Limitation | Workaround |
|---|---|
| STG breaks output | STG is disabled (stg_scale=0.0). See FUTURE_IMPROVEMENTS.md |
| >10 inference steps loses lip sync | Capped at 10 steps in config |
| >409600 max_pixels produces garbage | Resolution capped; use --upscale 2 for higher res output |
These limitations are likely caused by MPS bfloat16 precision differences. See FUTURE_IMPROVEMENTS.md for investigation plans.
The LTX-2 submodule includes MPS compatibility fixes (pending upstream merge) that guard CUDA-only API calls and add MPS fallbacks.
- Encode — prompts via Gemma, audio via audio VAE, image via video encoder; each model freed after use
- Stage 1 — denoise at half resolution (N steps), then free transformer (~44GB reclaimed)
- Upsample — 2x spatial latent upscale
- Stage 2 — refine at full resolution (4 distilled steps with LoRA), then free transformer
- Decode — video VAE decode + mux original audio into MP4
Audio is "frozen" during diffusion — it conditions video generation but is not modified. The original waveform is muxed into the output for full fidelity.
Denoises directly at target resolution in one pass. No upsampling. Use via --pipeline single-stage.
Designed for 128GB unified memory (Apple Silicon). Key optimizations:
- Staged transformer loading — only one 22B transformer in memory at a time
- Aggressive model cleanup —
model.to("meta")releases storage immediately after each model's job - Resolution cap —
max_pixelsin config prevents OOM (default: 409,600) - Sequential pipeline — each encoder freed as soon as its output is captured
Peak memory: ~65-70GB during transformer diffusion steps.
Both configs use identical guidance values (tuned for MPS):
| Parameter | Value | Effect |
|---|---|---|
num_inference_steps |
10 | Denoising steps (Stage 1) |
video_cfg_scale |
3.5 | Classifier-free guidance strength |
video_stg_scale |
0.0 | STG disabled (MPS limitation) |
video_rescale_scale |
0.7 | Reduces oversaturation |
a2v_guidance_scale |
3.0 | Audio-to-video coupling strength |
max_pixels |
409600 | Resolution cap (~512x768) |
- Write detailed, chronological descriptions in a single flowing paragraph
- Start directly with the main action
- Include specific movements, gestures, and expressions
- Describe character appearance, background, lighting, camera angle
- Mention "lip sync", "speaking", or "mouth moving" for better audio alignment
- Keep within 200 words
- Python 3.11+
- PyTorch with MPS (macOS) or CUDA support
- 128GB unified memory (Apple Silicon) recommended
- ~50GB disk space for models
- ffmpeg (for
--upscaleand video encoding)
This project wraps Lightricks' LTX-2, which has its own license. See LTX-2/LICENSE.
- Lightricks for the LTX-2 foundation model
- Google for the Gemma 3 text encoder
