fix(cli): improve TTS and model download reliability by miguel-heygen · Pull Request #1187 · heygen-com/hyperframes

miguel-heygen · 2026-06-04T02:14:17Z

Summary

Add retry with exponential backoff (3 retries, 1s/2s/4s) to the shared download utility, with socket timeouts (30s), response timeouts (60s), redirect loop protection, and specific HTTP 403/429 rate-limit error messages
Add file-level locking (.lock files) to TTS model and voice downloads to prevent concurrent processes from racing on the same temp file
Increase hasPythonPackage timeout from 10s to 30s to accommodate ONNX runtime cold-start
Set maxBuffer to 10 MB on the synthesis subprocess to prevent truncation from verbose ONNX warnings

Problem

The TTS command fails frequently at scale due to four compounding issues:

GitHub rate limiting — the 311 MB model downloads from GitHub Releases have no retry logic. HTTP 403 responses are terminal.
Download races — concurrent agent-spawned TTS calls write to the same .tmp file, causing corruption.
False negative package checks — ONNX runtime cold import exceeds the 10s timeout on constrained machines, making hasPythonPackage("kokoro_onnx") report false even when it's installed.
maxBuffer overflow — ONNX prints verbose warnings that exceed Node's 1 MB default, killing the subprocess.

Reproduction

# 1. Rate-limit failure (download.ts has no retry):
# Clear the model cache to force a fresh download:
rm -rf ~/.cache/hyperframes/tts/models/

# Run tts — if GitHub rate-limits the 311 MB download, it fails immediately
# with no retry. In agent environments with 3+ concurrent tts calls,
# this is near-guaranteed at scale.
hyperframes tts "Hello world" -o /tmp/test.wav
# Error: Download failed: HTTP 403

# With fix: retries 3 times with 1s/2s/4s backoff, and prints:
# "Download failed: HTTP 403 (rate limited). GitHub throttles
#  unauthenticated release downloads. Retry in a moment."

# 2. Concurrent download race (no file locking):
# Run two tts commands simultaneously — both see model as missing,
# both start downloading to the same .tmp file:
rm -rf ~/.cache/hyperframes/tts/models/
hyperframes tts "Hello" -o /tmp/a.wav &
hyperframes tts "World" -o /tmp/b.wav &
wait
# One or both fail with corrupted model file

# With fix: second process sees .lock file, waits for first to finish

# 3. ONNX cold-start false negative (10s timeout too short):
# On a cold machine or constrained sandbox:
pip install kokoro-onnx  # installed but slow to import
hyperframes tts "test" -o /tmp/test.wav
# Error: The kokoro-onnx package is not installed
# (false negative — import timed out at 10s)

# With fix: timeout increased to 30s

Scope

The download utility (packages/cli/src/utils/download.ts) is shared by TTS, Whisper, and background-removal commands — all three benefit from the retry and timeout improvements.

Test plan

Build passes (bun run build)
All pre-commit hooks pass (lint, format, typecheck, fallow)
Download retry logic handles HTTP 403 with clear rate-limit message
File locking prevents concurrent download races
hyperframes tts "Hello" -o test.wav works end-to-end
Verify TTS failure rate drops after deploy

Three changes to make TTS (and all model downloads) more reliable: 1. download.ts: add retry with exponential backoff (3 retries), socket and response timeouts, redirect loop protection, and specific HTTP 403/429 (rate limit) error messages. This fixes the primary failure mode: GitHub Releases rate-limiting model downloads at scale. 2. tts/manager.ts: add file-level locking (.lock files) for model and voice downloads to prevent concurrent processes from racing on the same .tmp file. The lock is stale-checked after 5 minutes. 3. tts/synthesize.ts: increase hasPythonPackage timeout from 10s to 30s (ONNX runtime cold import exceeds 10s on constrained machines), and set maxBuffer to 10 MB (ONNX prints verbose warnings that can exceed the 1 MB default, killing the subprocess). The download retry also benefits whisper (transcribe) and background-removal commands which share the same download utility. PostHog data: TTS failure rate hit 61% on June 2 when usage spiked 3x.

+      const stat = statSync(lockPath);
+      if (Date.now() - stat.mtimeMs > LOCK_STALE_MS) {
+        unlinkSync(lockPath);
+        writeFileSync(lockPath, String(process.pid), { flag: "wx" });


github-advanced-security AI found potential problems Jun 4, 2026

View reviewed changes

Comment thread packages/cli/src/tts/manager.ts

const stat = statSync(lockPath);

if (Date.now() - stat.mtimeMs > LOCK_STALE_MS) {

unlinkSync(lockPath);

writeFileSync(lockPath, String(process.pid), { flag: "wx" });

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): improve TTS and model download reliability#1187

fix(cli): improve TTS and model download reliability#1187
miguel-heygen wants to merge 1 commit into
mainfrom
worktree-fix+tts-reliability

miguel-heygen commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

miguel-heygen commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Reproduction

Scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miguel-heygen commented Jun 4, 2026 •

edited

Loading