One API to rule them all - unlimited-length TTS with automatic chunking.
Most TTS models have strict length limits:
┌────────────────────────────────────────────────────────────┐
│ RAW MODEL LIMITATIONS │
├────────────────────────────────────────────────────────────┤
│ Model Type │ Max Words │ Max Chars │ │
├──────────────────────┼─────────────┼─────────────┤ │
│ Voice Clones │ ~75 │ ~400 │ │
│ (VoxCPM, OpenAudio) │ │ │ │
├──────────────────────┼─────────────┼─────────────┤ │
│ Neural TTS │ ~200 │ ~1200 │ │
│ (Kokoro) │ │ │ │
├──────────────────────┼─────────────┼─────────────┤ │
│ Emotion Models │ ~40 │ ~250 │ │
│ (Kyutai/Moshi) │ │ │ │
├──────────────────────┼─────────────┼─────────────┤ │
│ Generative │ ~100 │ ~600 │ │
│ (Higgs) │ │ │ │
├──────────────────────┼─────────────┼─────────────┤ │
│ Cloud APIs │ ~2500 │ ~15000 │ │
│ (ElevenLabs) │ │ │ │
└────────────────────────────────────────────────────────────┘
Beyond these limits: quality degrades, audio cuts off, or errors.
The Community Has Been Asking
This isn't a solution looking for a problem. A quick search across Reddit shows hundreds of posts from people hitting these exact limitations:
| Community | What They're Asking |
|---|---|
| r/LocalLLaMA | "any opensource TTS without limit on character and can clone voice?" (14+ upvotes) |
| r/elevenlabs | 50+ posts about character limits and workarounds |
| r/TextToSpeech | Multiple threads on long-form audio generation |
| r/SillyTavern | TTS cutting off mid-sentence in roleplay |
| r/selfhosted | Requests for unlimited local TTS solutions |
Common pain points:
- "Text too long" errors when generating audiobooks or articles
- Voice clone quality degrading after ~75 words
- No seamless way to stitch multiple generations together
- Wanting one API that works with multiple backends
Open Unified TTS solves this by chunking text intelligently at natural boundaries (sentences, paragraphs), generating each chunk within model limits, and stitching results seamlessly with crossfade. The result: unlimited-length audio in any voice, with consistent quality throughout.
Key Features:
- Smart Chunking - Splits at sentence/paragraph boundaries, never mid-word
- Crossfade Stitching - 30-50ms overlap eliminates audio seams
- OpenAI-Compatible - Drop-in replacement for OpenAI TTS API
- Multi-Backend - Route different voices to different engines automatically
| I want to... | Use this |
|---|---|
| Just try it (no setup) | Hosted demo - Free during alpha |
| Simple web UI + API | Open TTS Studio - Most users start here |
| Raw API with multi-backend | This repo (instructions below) |
| Terminal UI | ./tui_client.py after setup |
New to this project? Start with Kokoro - 67 built-in voices, runs on CPU, no reference audio needed.
# CPU version (works anywhere)
docker run -d --name kokoro-tts -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest
# GPU version (faster, requires NVIDIA GPU)
docker run -d --name kokoro-tts --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latestgit clone https://github.com/loserbcc/open-unified-tts.git
cd open-unified-tts
pip install -r requirements.txt
python server.py# Check health endpoint
curl http://localhost:8765/health
# Should return: {"status":"ok","backend":"kokoro",...}# Short text - direct to MP3 (fast)
curl -X POST http://localhost:8765/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"tts-1","voice":"bf_emma","input":"Hello, this is a test."}' \
--output test.mp3
# Long text - auto-chunked and stitched
curl -X POST http://localhost:8765/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"tts-1","voice":"am_adam","input":"Your 2000 word article here..."}' \
--output audiobook.mp3Full Kokoro Setup Guide → | Listen to Sample Output
- Watch the Intro (30 sec) - Quick overview
- Live Demo (4 min) - Chunking and stitching in action
- Kokoro Audiobook Sample - 73 seconds of seamless audio (details)
- VoxCPM 1.5 Demo - Morgan Freeman voice at 44.1kHz
- Project Explanation (audio) - Generated using this system
INPUT: 2000-word article + "morgan" voice
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. SMART CHUNKING │
│ │
│ Full text split at natural boundaries: │
│ • Sentence endings │
│ • Paragraph breaks │
│ • Never mid-word │
│ │
│ Chunk size based on backend profile (optimal < max) │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │ │ Chunk N │ │
│ │ ~50 wds │ │ ~50 wds │ │ ~50 wds │ │ ~50 wds │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. GENERATE EACH CHUNK │
│ │
│ Each chunk sent to backend (within its limits) │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Audio 1 │ │ Audio 2 │ │ Audio 3 │ │ Audio N │ │
│ │ ~5 sec │ │ ~5 sec │ │ ~5 sec │ │ ~5 sec │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. STITCH WITH CROSSFADE │
│ │
│ Audio chunks joined with crossfade to eliminate seams: │
│ │
│ ──────┐ │
│ ╲ ← crossfade (30-50ms) │
│ ╲──────┐ │
│ ╲ │
│ ╲────── │
│ │
│ Result: Seamless audio, indistinguishable from single gen │
└─────────────────────────────────────────────────────────────┘
│
▼
OUTPUT: Single audio file, unlimited length, consistent voice
Without crossfade: With crossfade:
─────┐ ┌───── ─────╲ ╱─────
│ │ ← click! ╳ ← smooth
─────┘ └───── ─────╱ ╲─────
The 30-50ms crossfade eliminates audible clicks between chunks while preserving natural speech rhythm.
# Generate speech (OpenAI-compatible)
POST /v1/audio/speech
{
"model": "tts-1",
"voice": "bf_emma",
"input": "Your text here",
"response_format": "mp3" # Optional: mp3, wav, flac, opus
}
# List available voices
GET /v1/voices
# Check health and active backend
GET /health
# List available models (OpenAI-compatible)
GET /v1/models# List backends and status
GET /v1/backends
# Set preferred backend
POST /v1/backends/switch
{"backend": "kokoro"}
# Get voice preferences (which voice uses which backend)
GET /v1/voice-prefs
# Set backend preference for specific voice
POST /v1/voice-prefs/morgan
{"backend": "voxcpm"}| Category | Voices |
|---|---|
| American Female | af_alloy, af_bella, af_heart, af_nova, af_sky, af_sarah, af_jessica, af_nicole, af_river |
| American Male | am_adam, am_echo, am_eric, am_onyx, am_michael, am_liam, am_fenrir, am_puck |
| British Female | bf_alice, bf_emma, bf_lily |
| British Male | bm_daniel, bm_fable, bm_george, bm_lewis |
| OpenAI Compatible | alloy, echo, fable, onyx, nova, shimmer |
Just starting? → Kokoro (easiest setup, 67 voices, CPU-friendly)
Local-first with quality? → Qwen3-TTS (multilingual, native voices, runs on consumer GPUs)
Need voice cloning? → See BACKENDS.md for the various clone-capable backends.
Want everything? → Run multiple backends and let the router choose automatically
Note on backend preference (2026): the project deliberately stays connector-agnostic — that's the point — but the maintainer's current personal stack favors Qwen3-TTS and OmniVoice over VoxCPM for new work. The voxcpm-flavored backends still work for anyone running them, they just aren't where active polish is happening.
Need a backend that isn't here? Open a Backend Request issue and we'll wire it up — or send a PR using one of the existing adapters in
adapters/as a template.
| Backend | Type | Voices | Best For | Setup |
|---|---|---|---|---|
| Kokoro | Neural TTS | 67 built-in | Quick start, high quality, no GPU | Guide |
| Qwen3-TTS | Neural TTS | Multilingual | Local-first, high-quality, multilingual | Requires GPU |
| Pocket-TTS | Neural TTS | 8 builtin + clones | Lightweight, laptop-class GPUs | ~4GB VRAM |
openaudio |
Voice Clone | Custom | Cloning specific voices | Requires separate setup |
voxcpm |
Voice Clone | Custom | High-quality voice cloning | Requires GPU |
voxcpm15 |
Voice Clone | 88+ pre-loaded | 44.1kHz output, lighter VRAM | Requires GPU |
fishtts |
Voice Clone | Custom | Fish Speech synthesis | Requires separate setup |
chatterbox |
Voice Clone | Custom | Emotion control | Requires separate setup |
kyutai |
Emotion | 8 emotions | Emotional expression | Requires separate setup |
higgs |
Generative | Scene-based | Creative voice generation | Requires GPU |
maya1 |
Voice Design | Custom | Emotional TTS, voice design | Requires GPU |
vibevoice |
Streaming | Microsoft | Real-time TTS | Requires separate setup |
modelslab |
Cloud | Multiple | Cloud TTS, voice cloning | API key required |
typecast |
Cloud | Custom | Emotion + prosody control | API key required |
minimax |
Cloud | Professional | Production voices | API key required |
acestep |
Musical | Singing | Music/vocals | Requires GPU |
elevenlabs |
Cloud | Many | Fallback/variety | API key required |
Note: Only Kokoro has an easy Docker setup. Other backends require manual installation. See BACKENDS.md for details.
Each backend has a profile defining its capabilities and optimal chunking strategy:
# backend_profiles.py
"kokoro": {
"max_words": 200, # Hard limit
"max_chars": 1200,
"optimal_words": 150, # Target for chunking
"needs_chunking": True,
"crossfade_ms": 30, # Stitch overlap
},
"voxcpm": {
"max_words": 75,
"max_chars": 400,
"optimal_words": 50,
"needs_chunking": True,
"crossfade_ms": 50,
},
"voxcpm15": {
"max_words": 150, # Handles longer chunks
"max_chars": 800,
"optimal_words": 100,
"needs_chunking": True,
"crossfade_ms": 50,
"sample_rate": 44100, # 2x quality of VoxCPM
}All configuration via environment variables:
# Backend URLs
KOKORO_HOST=http://localhost:8880
OPENAUDIO_URL=http://localhost:8080
VOXCPM_URL=http://localhost:7860
VOXCPM15_HOST=http://mother:7870 # VoxCPM 1.5 (44.1kHz)
FISHTTS_URL=http://localhost:7861
KYUTAI_URL=http://localhost:8086
HIGGS_URL=http://localhost:8085
VIBEVOICE_URL=http://localhost:8087
# Cloud API keys
MINIMAX_API_KEY=your_minimax_key
ELEVENLABS_API_KEY=sk_...
# Server settings
UNIFIED_TTS_PORT=8765
UNIFIED_TTS_HOST=0.0.0.0
# Voice directory (for voice clones)
UNIFIED_TTS_VOICE_DIR=~/.unified-tts/voicesFull-featured Gradio web interface with voice organization, batch processing, and export options.
github.com/loserbcc/open-tts-studio
Simpler Gradio interface for document-to-audiobook conversion:
# Install additional dependencies
pip install gradio pypdf python-docx
# Start the web interface (after server.py is running)
python gradio_studio.py --port 7865Features:
- Upload PDF, DOCX, or TXT files
- Edit extracted text before generation
- 67+ voices organized by category
- Multiple output formats (MP3, WAV, FLAC, Opus)
- Real-time API status monitoring
A modern terminal-based interface built with Textual:
# Install TUI dependencies
pip install textual httpx rich
# Start the server first, then run TUI
python tui_client.pyControls:
Ctrl+G- Generate speech from text inputCtrl+R- Refresh API status and voicesCtrl+Q- Quit
Don't want to self-host? Use our hosted API at tts.scrappylabs.ai:
- 67+ Kokoro voices + character clones (Morgan Freeman, Rick & Morty, Yoda, etc.)
- OpenAI-compatible API - drop-in replacement
- Free during alpha - no credit card required
# Just point at the hosted API instead of localhost
curl -X POST https://tts.scrappylabs.ai/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"voice":"bf_emma","input":"Hello from the cloud!"}' \
-o test.mp3Coming Soon: API keys and MCP access tokens. Currently in dev/testing - free access while we build out authentication. Want early access? Email buddy@loser.com.
Use TTS directly from Claude Code, Cline, Cursor, or any AI agent.
Recommended (2026): wrap the API as a Claude Code Skill. Skills are
lighter-weight than MCP servers — a single Markdown file describes the
trigger, and the skill calls the OpenAI-compatible endpoint with curl
or httpx. No long-running process, no MCP wire protocol overhead, just
a prompt-time tool. The maintainer's fleet wraps this exact API as a
/speak skill (see k2-fsa/OmniVoice
and the skill examples in the Anthropic skill cookbook).
The legacy MCP path still works if you prefer it:
# Legacy MCP server (still functional, no longer the recommended pattern)
git clone https://github.com/loserbcc/open-tts-studio.git
cd open-tts-studio/mcp-server
uv sync
claude mcp add unified-tts-simple uv run python server.pyThen just ask your AI: "Read this article aloud with Emma's voice" — no API calls needed.
┌─────────────────────────────────────────────────────────────┐
│ YOUR APPLICATION │
│ (Any OpenAI TTS compatible client) │
└─────────────────────────────────────────────────────────────┘
│
│ POST /v1/audio/speech
│ {"voice": "morgan", "input": "..."}
▼
┌─────────────────────────────────────────────────────────────┐
│ OPEN UNIFIED TTS │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Router │ │ Chunker │ │ Stitcher │ │
│ │ │ │ │ │ │ │
│ │ • Backend │ │ • Smart │ │ • Crossfade │ │
│ │ selection │ │ splitting │ │ • Normalize │ │
│ │ • Failover │ │ • Profile- │ │ • Format │ │
│ │ • Voice │ │ aware │ │ convert │ │
│ │ prefs │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Backend 1 │ │ Backend 2 │ │ Backend N │
│ (Kokoro) │ │ (VoxCPM) │ │ (ElevenLabs)│
│ │ │ │ │ │
│ 67 Neural │ │ Voice │ │ Cloud │
│ Voices │ │ Clones │ │ Fallback │
└─────────────┘ └─────────────┘ └─────────────┘
The server optimizes format conversion based on text length:
| Scenario | Processing | Why |
|---|---|---|
| Short text (<200 words) | Request final format (MP3) directly from backend | Efficient - no conversion needed |
| Long text (>200 words) | Generate WAV chunks → stitch → convert to final | WAV required for lossless crossfade stitching |
This means short requests are fast and efficient, while long requests maintain quality through proper audio processing.
Tested benchmarks (Kokoro on CPU):
| Input | Output | Time | Hardware |
|---|---|---|---|
| 230 words | 73s audio (630KB MP3) | ~30s | AMD Ryzen + RTX 4070 (GPU unused) |
| 50 words | 15s audio | ~5s | Same |
Kokoro GPU mode is significantly faster for batch generation.
Route specific voices to specific backends for optimal quality:
# Set morgan to always use voxcpm (best quality for this clone)
curl -X POST http://localhost:8765/v1/voice-prefs/morgan \
-H "Content-Type: application/json" \
-d '{"backend": "voxcpm"}'Preferences are stored in ~/.unified-tts/voice_prefs.json.
Any TTS or audio generation model with an API can plug in as a backend. Voice cloning, emotion synthesis, even musical TTS. If it has an endpoint, it can join the party.
Because this is OpenAI TTS-compatible, it plugs directly into tools you already use - OpenWebUI, SillyTavern, or any app with OpenAI TTS support. Point them at this API, connect your backends, and you've got a production audio studio. No code changes needed.
open-unified-tts/
├── server.py # FastAPI application
├── router.py # Backend selection & failover
├── chunker.py # Smart text splitting
├── stitcher.py # Audio concatenation with crossfade
├── voices.py # Voice clone discovery
├── voice_prefs.py # Per-voice backend routing
├── backend_profiles.py # Backend capabilities
├── config.py # Environment configuration
├── gradio_studio.py # Simple web interface
├── tui_client.py # Terminal UI
├── adapters/
│ ├── base.py # Abstract backend interface
│ ├── kokoro.py # Kokoro neural TTS (67+ voices)
│ ├── openaudio.py # OpenAudio/Fish Speech S1-mini
│ ├── voxcpm.py # VoxCPM voice cloning
│ ├── voxcpm15.py # VoxCPM 1.5 (44.1kHz, 88+ voices)
│ ├── fishtts.py # FishTTS
│ ├── kyutai.py # Kyutai/Moshi emotions
│ ├── higgs.py # Higgs Audio generative
│ ├── vibevoice.py # Microsoft VibeVoice
│ ├── minimax.py # MiniMax TTS cloud
│ ├── acestep.py # ACE-Step musical TTS
│ └── elevenlabs.py # ElevenLabs cloud
├── docs/
│ ├── kokoro_setup_guide.md # Kokoro setup documentation
│ └── BACKENDS.md # Backend selection and setup guide
└── demo/
├── kokoro_audiobook_demo.mp3 # Kokoro sample output
├── kokoro_demo_info.md # Demo details
└── voxcpm15_intro_morgan.mp3 # VoxCPM 1.5 Morgan demo
Apache License 2.0 - See LICENSE file.