Add transcribe command to transcripted-cli for agent-driven audio/video transcription by r3dbars · Pull Request #1348 · r3dbars/transcripted

r3dbars · 2026-07-01T22:15:43Z

Why

Coding agents can already read saved Transcripted captures through the CLI/MCP tools, but there was no way for Claude Code or Codex to use the on-device Parakeet model on an arbitrary file — the "download a bunch of YouTube videos and transcribe them" workflow. This adds a transcribe command to transcripted-cli so agents get straight-up words from any audio or video file, fully local, without creating a meeting in the app.

Product Impact

Affects: agent artifacts
Lane: agent workflow
Why this matters: agents can now use Transcripted's local model as a general transcription utility from any shell — plain text, timestamped JSON, or SRT — reusing the models the app already downloaded, with zero cloud calls.

What changed

New transcripted-cli transcribe <media...> subcommand: plain text to stdout by default, --json (text + timestamped segments + timing metadata), --srt subtitles, --output for a single file, --output-dir for batch runs (one <stem>.txt|json|srt per input), --no-download to fail instead of fetching models
TranscribeMediaLoader.swift: decodes audio files via AVAudioFile + AVAudioConverter (streaming, anti-aliased resample to 16kHz mono, mirroring the app's AudioResampler approach); video containers (MP4/MOV/M4V) fall back to AVAssetReader with all audio tracks mixed to mono; WebM/MKV get an explicit error suggesting an ffmpeg conversion
TranscribeModelResolver: model lookup order is --models-dir → installed Transcripted.app bundled models → the shared FluidAudio cache the app already uses (~/Library/Application Support/FluidAudio/Models/) → one-time ~600MB download into that cache. Sub-second clips are padded to Parakeet's 1s minimum instead of surfacing a cryptic model error
TranscribeOutput.swift: dependency-free output formatting — caption-sized segment grouping from token timings (gap/duration/length/sentence boundaries), SRT rendering, JSON payload encoding, output-path derivation — kept out of the gated command so it compiles and tests without the deps bundle
Package.swift: TRANSCRIPTEDCLI_ENABLE_TRANSCRIPTION=1 (or the existing TRANSCRIPTEDCLI_ENABLE_DIARIZATION=1) links the same prebuilt FluidAudio bundle and enables both offline-audio command groups; fresh checkouts still build retrieval-only, and the ungated transcribe stub explains how to enable the real one
Tests: TranscribeOutputTests.swift covers format resolution, segment grouping boundaries, SRT timestamps/rendering, output-path derivation, and JSON object-vs-array encoding — all runnable without the deps bundle
Docs: Tools/TranscriptedCLI/CLAUDE.md command reference + gotchas, new "Transcribe Files From an Agent" section in docs/agent-connect.md, one-line updates in Tools/README.md and root CLAUDE.md

How I checked it

scripts/dev/agent-preflight.sh (maps this diff to swift test --package-path Tools/TranscriptedCLI)
Selected checks from .agents/test-matrix.yml for the files changed — not run: this change was authored in a Linux container with no Swift toolchain, so swift test --package-path Tools/TranscriptedCLI needs a run on a Mac checkout before merge
bash build.sh --no-open — n/a, app target untouched
bash run-tests.sh — n/a, app target untouched
Performance budget — n/a
bash run-integration-smoke.sh — n/a
swift test — n/a, root Package.swift/core seam untouched
Manual check: on a Mac, bash build-deps.sh && cd Tools/TranscriptedCLI && TRANSCRIPTEDCLI_ENABLE_TRANSCRIPTION=1 swift build, then transcribe a WAV, an MP3, and an MP4 with audio; verify plain text, --json, --srt, --output-dir batch, and the stub error on a build without the flag

In place of a local test run, every FluidAudio API call was verified against the pinned v0.7.9 sources (AsrModels.load/download/modelsExist/defaultCacheDirectory, AsrManager.initialize/transcribe, ASRResult/TokenTiming fields), and an independent adversarial review pass was run over the new Swift files for compile and logic issues.

Risk Review

Privacy / local-first behavior reviewed — everything runs on-device; the only network call is FluidAudio's existing HuggingFace model download, and --no-download opts out of even that
Storage path or migration impact reviewed — no new storage; reuses the app's existing FluidAudio model cache location read-only-or-populate
Public-facing copy stays concrete and matches current product scope — agent-connect section is labeled contributor/source-build for now
Release/update impact reviewed — none; standalone Tools package, app target untouched
Agent PRs stay draft until human review
UI changes include sanitized .agent-review/visuals/ evidence — n/a, no UI
No private transcripts, audio, tokens, personal paths, or customer data are included

Notes

Whole-file decode holds ~230MB RAM per hour of audio (16kHz mono Float32); documented in the CLI CLAUDE.md. Streaming chunked transcription is a possible follow-up for very long files.
WebM (YouTube's other common container) is out of AVFoundation's reach — the error message points at ffmpeg. A yt-dlp-friendly note could go in docs later.
Follow-up candidate: expose a transcribe_file tool in transcripted-mcp so Claude Desktop gets the same utility without shell access.

Agent handoff

COORD_DONE: BRIEF | PR: this | added transcribe command + media loader + output formatting + tests + docs | none | decide whether SRT segment defaults (6s/84 chars) need tuning | agent-preflight run; swift test not runnable in Linux container — needs a Mac run | run swift test --package-path Tools/TranscriptedCLI on a Mac and exercise the manual check list

Generated by Claude Code

…eo transcription Coding agents (Claude Code, Codex) can now use the on-device Parakeet model on arbitrary files — downloaded videos, voice memos, screen recordings — without going through the app's meeting flow: - transcripted-cli transcribe <media...> outputs plain text by default, with --json (timestamped segments) and --srt options, plus --output / --output-dir for batch runs over folders of files - audio files decode via AVAudioFile + AVAudioConverter; video containers (MP4/MOV/M4V) fall back to AVAssetReader mixing all audio tracks to 16kHz mono; WebM/MKV errors suggest an ffmpeg conversion - models resolve from --models-dir, then the installed Transcripted.app bundle, then the shared FluidAudio cache the app already uses, and only download (~600MB, once) as a last resort; --no-download opts out - TRANSCRIPTEDCLI_ENABLE_TRANSCRIPTION=1 (or the existing diarization flag) links the shared FluidAudio deps bundle; retrieval-only builds keep working on fresh checkouts with a clear stub error - output formatting (segment grouping, SRT rendering, JSON payloads, output paths) lives in dependency-free TranscribeOutput.swift with swift-test coverage that runs without the deps bundle Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0173spvy7ZyMyBYhBJj4rT1z

- batch --output-dir disambiguates inputs sharing a stem (talk.mp4 + talk.mov -> talk.mp4.txt / talk.mov.txt) instead of silently overwriting one transcript with the other - media loading always falls through to the AVAssetReader path when the AVAudioFile path fails, and a total decode failure reports both errors instead of discarding the reader's usually-more-specific one - the SRT minimum-caption-duration clamp no longer pushes a caption past the next caption's start - rename the JSON realTimeFactor field to speedFactor since the value is FluidAudio's rtfx speedup (duration/processing), not conventional RTF - --models-dir help and error text spell out FluidAudio's required parakeet-tdt-0.6b-v3-coreml folder-name layout - declare ArgumentParser as an explicit test-target dependency Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0173spvy7ZyMyBYhBJj4rT1z

claude added 2 commits July 1, 2026 22:14

r3dbars marked this pull request as ready for review July 3, 2026 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transcribe command to transcripted-cli for agent-driven audio/video transcription#1348

Add transcribe command to transcripted-cli for agent-driven audio/video transcription#1348
r3dbars wants to merge 2 commits into
mainfrom
claude/audio-video-transcription-37fe4b

r3dbars commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

r3dbars commented Jul 1, 2026

Why

Product Impact

What changed

How I checked it

Risk Review

Notes

Agent handoff

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants