LTX-2.3 Image + Audio to Video

Generate lip-synced talking-head videos from a still image and an audio file. Built on Lightricks' LTX-2 v2.3 — a 22B-parameter DiT-based audio-video foundation model.

Input: image + audio clip + text prompt → Output: MP4 video with synchronized lip movements

Demo

Pirate Captain — Lip-synced speech with scene animation

Input Image

Output Video

pirate_scene_upscaled_2x.mp4

🎵 Audio input:

pirate_voice.mp3

Prompt: "A weathered pirate captain speaking boldly, his lips moving with each word, beard shifting as his jaw opens and closes, fierce eyes narrowing, face lit by warm flickering firelight"

Generated with --seed 77 --upscale 2 (~10 min per chunk on M5 Max 128GB)

Features

Two-stage pipeline — Stage 1 denoises at half resolution, spatial 2x upscale, stage 2 refines at full resolution with distilled LoRA
Automatic resolution & duration — matches source image aspect ratio and audio length
Memory-efficient — staged transformer loading with aggressive cleanup; runs on 128GB Apple Silicon
MPS support — includes forked LTX-2 as submodule with Apple Silicon MPS fixes
Post-process upscaling — --upscale 2 or --upscale 4 for higher resolution output via ffmpeg
Diagnostic logging — peak RSS tracking, per-stage timing, sigma schedule, run summary

Quick Start

# 1. Clone with submodules (includes MPS-patched LTX-2)
git clone --recurse-submodules https://github.com/techfreakworm/LTX2.3-ImageAudioToVideo.git
cd LTX2.3-ImageAudioToVideo

# 2. Run setup (creates venv, installs dependencies)
bash setup.sh

# 3. Download models (see Models section below)

# 4. Generate
source .venv/bin/activate
python generate.py \
  --image inputs/portrait.png \
  --audio inputs/speech.wav \
  --prompt "A person speaking clearly with natural expressions" \
  --upscale 2

Note: If you already cloned without --recurse-submodules, run: git submodule update --init --recursive

Models

Download and place in the models/ directory:

Model	Size	Download
LTX-2.3 Checkpoint	~44GB	ltx-2.3-22b-dev.safetensors
Distilled LoRA	~1.5GB	ltx-2.3-22b-distilled-lora-384.safetensors
Spatial Upscaler 2x	~150MB	ltx-2.3-spatial-upscaler-x2-1.1.safetensors

Gemma 3 Text Encoder — download all files from google/gemma-3-12b-it-qat-q4_0-unquantized into models/gemma_root/:

pip install huggingface_hub
huggingface-cli download google/gemma-3-12b-it-qat-q4_0-unquantized --local-dir models/gemma_root

Usage

# Default mode (~5 min on Apple Silicon, with 2x upscale)
python generate.py \
  --image inputs/portrait.png \
  --audio inputs/speech.wav \
  --prompt "A woman speaking enthusiastically with expressive gestures" \
  --upscale 2

# Without upscale (faster, lower resolution output)
python generate.py \
  --image inputs/portrait.png \
  --audio inputs/speech.wav \
  --prompt "A person speaking clearly"

# With overrides
python generate.py \
  --image inputs/photo.png \
  --audio inputs/voice.wav \
  --prompt "A man explaining something" \
  --output outputs/my_video.mp4 \
  --seed 123 \
  --upscale 4

CLI Arguments

Argument	Default	Description
`--image`, `-i`	required	Input image
`--audio`, `-a`	required	Input audio file (stereo; mono auto-converted)
`--prompt`, `-p`	required	Text prompt describing the scene
`--output`, `-o`	`outputs/gen_<timestamp>.mp4`	Output video path
`--config`, `-c`	`config.yaml`	Config file path
`--seed`, `-s`	42	Random seed
`--upscale`	none	Post-process upscale: `2` for 2x, `4` for 4x (ffmpeg lanczos)
`--num-frames`	auto	Frame count override (must be 8k+1)
`--height`	auto	Height override (snapped to 32-multiple)
`--width`	auto	Width override (snapped to 32-multiple)
`--steps`	from config	Stage 1 inference steps
`--fps`	25.0	Frame rate
`--image-strength`	0.7	Image conditioning (0.0-1.0). Lower = more motion
`--pipeline`	auto	`auto`, `two-stage`, or `single-stage`
`--negative-prompt`	from config	Negative prompt override

Apple Silicon (MPS) Notes

This project is optimized for Apple Silicon Macs with MPS backend. Known limitations:

Limitation	Workaround
STG breaks output	STG is disabled (`stg_scale=0.0`). See FUTURE_IMPROVEMENTS.md
>10 inference steps loses lip sync	Capped at 10 steps in config
>409600 max_pixels produces garbage	Resolution capped; use `--upscale 2` for higher res output

These limitations are likely caused by MPS bfloat16 precision differences. See FUTURE_IMPROVEMENTS.md for investigation plans.

The LTX-2 submodule includes MPS compatibility fixes (pending upstream merge) that guard CUDA-only API calls and add MPS fallbacks.

Architecture

Two-Stage Pipeline (`a2vid_preloaded.py`)

Encode — prompts via Gemma, audio via audio VAE, image via video encoder; each model freed after use
Stage 1 — denoise at half resolution (N steps), then free transformer (~44GB reclaimed)
Upsample — 2x spatial latent upscale
Stage 2 — refine at full resolution (4 distilled steps with LoRA), then free transformer
Decode — video VAE decode + mux original audio into MP4

Audio is "frozen" during diffusion — it conditions video generation but is not modified. The original waveform is muxed into the output for full fidelity.

Single-Stage Fallback (`a2vid_one_stage.py`)

Denoises directly at target resolution in one pass. No upsampling. Use via --pipeline single-stage.

Memory Management

Designed for 128GB unified memory (Apple Silicon). Key optimizations:

Staged transformer loading — only one 22B transformer in memory at a time
Aggressive model cleanup — model.to("meta") releases storage immediately after each model's job
Resolution cap — max_pixels in config prevents OOM (default: 409,600)
Sequential pipeline — each encoder freed as soon as its output is captured

Peak memory: ~65-70GB during transformer diffusion steps.

Configuration

Both configs use identical guidance values (tuned for MPS):

Parameter	Value	Effect
`num_inference_steps`	10	Denoising steps (Stage 1)
`video_cfg_scale`	3.5	Classifier-free guidance strength
`video_stg_scale`	0.0	STG disabled (MPS limitation)
`video_rescale_scale`	0.7	Reduces oversaturation
`a2v_guidance_scale`	3.0	Audio-to-video coupling strength
`max_pixels`	409600	Resolution cap (~512x768)

Prompting Tips

Write detailed, chronological descriptions in a single flowing paragraph
Start directly with the main action
Include specific movements, gestures, and expressions
Describe character appearance, background, lighting, camera angle
Mention "lip sync", "speaking", or "mouth moving" for better audio alignment
Keep within 200 words

Requirements

Python 3.11+
PyTorch with MPS (macOS) or CUDA support
128GB unified memory (Apple Silicon) recommended
~50GB disk space for models
ffmpeg (for --upscale and video encoding)

License

This project wraps Lightricks' LTX-2, which has its own license. See LTX-2/LICENSE.

Acknowledgments

Lightricks for the LTX-2 foundation model
Google for the Gemma 3 text encoder

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LTX-2 @ 7044e31		LTX-2 @ 7044e31
demo		demo
inputs		inputs
outputs		outputs
.gitignore		.gitignore
.gitmodules		.gitmodules
CLAUDE.md		CLAUDE.md
FUTURE_IMPROVEMENTS.md		FUTURE_IMPROVEMENTS.md
LICENSE		LICENSE
README.md		README.md
a2vid_one_stage.py		a2vid_one_stage.py
a2vid_preloaded.py		a2vid_preloaded.py
config.fast.yaml		config.fast.yaml
config.yaml		config.yaml
generate.py		generate.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LTX-2.3 Image + Audio to Video

Demo

Features

Quick Start

Models

Usage

CLI Arguments

Apple Silicon (MPS) Notes

Architecture

Two-Stage Pipeline (`a2vid_preloaded.py`)

Single-Stage Fallback (`a2vid_one_stage.py`)

Memory Management

Configuration

Prompting Tips

Requirements

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LTX-2.3 Image + Audio to Video

Demo

Features

Quick Start

Models

Usage

CLI Arguments

Apple Silicon (MPS) Notes

Architecture

Two-Stage Pipeline (a2vid_preloaded.py)

Single-Stage Fallback (a2vid_one_stage.py)

Memory Management

Configuration

Prompting Tips

Requirements

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Two-Stage Pipeline (`a2vid_preloaded.py`)

Single-Stage Fallback (`a2vid_one_stage.py`)

Packages