fix(tts/supertonic): lower inter-chunk silence 0.3s→0.05s + add --silence flag#737
fix(tts/supertonic): lower inter-chunk silence 0.3s→0.05s + add --silence flag#737Alex-Wengg wants to merge 1 commit into
Conversation
…ence flag The 70-char chunk cap (#669) splits a paragraph into several chunks, and the 0.3s silence padded at each seam stacks on top of the model's own trailing sentence silence, inflating natural ~0.5–1.0s pauses to ~1.1–1.2s — the "unintended pauses" reported in #736. Measured on a 4-sentence paragraph, 0.3s→0.0s removed ~2.1s of dead air; 0.05s keeps seams from butting tokens together while letting the model's intrinsic sentence prosody through. - defaultSilenceDuration 0.3 → 0.05 - expose silenceDuration via CLI `--silence` (was hardcoded to the default)
PocketTTS Smoke Test ✅
Runtime: 1m9s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Supertonic3 Smoke Test ✅
Runtime: 0m19s Note: CI VMs lack a physical Neural Engine; the ANE-bucketed VectorEstimator falls back to CPU here. This validates download + variant resolution + synthesis, not ANE residency/perf. |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 5m 28s • 2026-06-24T18:02:19.007Z |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 66.1s diarization time • Test runtime: 3m 11s • 06/24/2026, 02:03 PM EST |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 2m31s • 06/24/2026, 02:04 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 97.5s processing • Test runtime: 1m 43s • 06/24/2026, 02:06 PM EST |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 7m54s • 06/24/2026, 02:08 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Problem (#736)
Supertonic-3 reads short sentences naturally and fast, but full paragraphs get unintended pauses between chunks. Root cause: the 70-char chunk cap (#669) splits a paragraph into several chunks, and a fixed 0.3 s of silence is concatenated at every seam. That pad stacks on top of the model's own trailing sentence silence, inflating natural ~0.5–1.0 s sentence pauses to ~1.1–1.2 s.
Change
Supertonic3Constants.defaultSilenceDuration0.3 → 0.05silenceDurationthrough the CLI--silenceflag (thettscommand was callingsynthesizewithout it, so CLI users were pinned to the default)Measured
Same 4-sentence (351-char) paragraph, ~8 chunks:
--silenceDropping the pad removes ~2.1 s of dead air across the paragraph; remaining gaps land at sentence/comma boundaries (the chunker splits there first), so there are no mid-word cuts.
Not addressed (intrinsic to the base model)
The 70-char cap can't be raised without a WER cost — the base Supertone
supertonic-3weights degrade past ~90 chars (relative-position attention trained at sentence scale: ~0% WER ≤70 chars → ~6% by ~100 chars). Supertonic-3 is a per-sentence/streaming TTS by design; for seamless long-form, PocketTTS is the better fit. This PR only fixes the avoidable silence artifact.🤖 Generated with Claude Code