Skip to content

Add transcribe command to transcripted-cli for agent-driven audio/video transcription#1348

Open
r3dbars wants to merge 2 commits into
mainfrom
claude/audio-video-transcription-37fe4b
Open

Add transcribe command to transcripted-cli for agent-driven audio/video transcription#1348
r3dbars wants to merge 2 commits into
mainfrom
claude/audio-video-transcription-37fe4b

Conversation

@r3dbars

@r3dbars r3dbars commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Why

Coding agents can already read saved Transcripted captures through the CLI/MCP tools, but there was no way for Claude Code or Codex to use the on-device Parakeet model on an arbitrary file — the "download a bunch of YouTube videos and transcribe them" workflow. This adds a transcribe command to transcripted-cli so agents get straight-up words from any audio or video file, fully local, without creating a meeting in the app.

Product Impact

  • Affects: agent artifacts
  • Lane: agent workflow
  • Why this matters: agents can now use Transcripted's local model as a general transcription utility from any shell — plain text, timestamped JSON, or SRT — reusing the models the app already downloaded, with zero cloud calls.

What changed

  • New transcripted-cli transcribe <media...> subcommand: plain text to stdout by default, --json (text + timestamped segments + timing metadata), --srt subtitles, --output for a single file, --output-dir for batch runs (one <stem>.txt|json|srt per input), --no-download to fail instead of fetching models
  • TranscribeMediaLoader.swift: decodes audio files via AVAudioFile + AVAudioConverter (streaming, anti-aliased resample to 16kHz mono, mirroring the app's AudioResampler approach); video containers (MP4/MOV/M4V) fall back to AVAssetReader with all audio tracks mixed to mono; WebM/MKV get an explicit error suggesting an ffmpeg conversion
  • TranscribeModelResolver: model lookup order is --models-dir → installed Transcripted.app bundled models → the shared FluidAudio cache the app already uses (~/Library/Application Support/FluidAudio/Models/) → one-time ~600MB download into that cache. Sub-second clips are padded to Parakeet's 1s minimum instead of surfacing a cryptic model error
  • TranscribeOutput.swift: dependency-free output formatting — caption-sized segment grouping from token timings (gap/duration/length/sentence boundaries), SRT rendering, JSON payload encoding, output-path derivation — kept out of the gated command so it compiles and tests without the deps bundle
  • Package.swift: TRANSCRIPTEDCLI_ENABLE_TRANSCRIPTION=1 (or the existing TRANSCRIPTEDCLI_ENABLE_DIARIZATION=1) links the same prebuilt FluidAudio bundle and enables both offline-audio command groups; fresh checkouts still build retrieval-only, and the ungated transcribe stub explains how to enable the real one
  • Tests: TranscribeOutputTests.swift covers format resolution, segment grouping boundaries, SRT timestamps/rendering, output-path derivation, and JSON object-vs-array encoding — all runnable without the deps bundle
  • Docs: Tools/TranscriptedCLI/CLAUDE.md command reference + gotchas, new "Transcribe Files From an Agent" section in docs/agent-connect.md, one-line updates in Tools/README.md and root CLAUDE.md

How I checked it

  • scripts/dev/agent-preflight.sh (maps this diff to swift test --package-path Tools/TranscriptedCLI)
  • Selected checks from .agents/test-matrix.yml for the files changed — not run: this change was authored in a Linux container with no Swift toolchain, so swift test --package-path Tools/TranscriptedCLI needs a run on a Mac checkout before merge
  • bash build.sh --no-open — n/a, app target untouched
  • bash run-tests.sh — n/a, app target untouched
  • Performance budget — n/a
  • bash run-integration-smoke.sh — n/a
  • swift test — n/a, root Package.swift/core seam untouched
  • Manual check: on a Mac, bash build-deps.sh && cd Tools/TranscriptedCLI && TRANSCRIPTEDCLI_ENABLE_TRANSCRIPTION=1 swift build, then transcribe a WAV, an MP3, and an MP4 with audio; verify plain text, --json, --srt, --output-dir batch, and the stub error on a build without the flag

In place of a local test run, every FluidAudio API call was verified against the pinned v0.7.9 sources (AsrModels.load/download/modelsExist/defaultCacheDirectory, AsrManager.initialize/transcribe, ASRResult/TokenTiming fields), and an independent adversarial review pass was run over the new Swift files for compile and logic issues.

Risk Review

  • Privacy / local-first behavior reviewed — everything runs on-device; the only network call is FluidAudio's existing HuggingFace model download, and --no-download opts out of even that
  • Storage path or migration impact reviewed — no new storage; reuses the app's existing FluidAudio model cache location read-only-or-populate
  • Public-facing copy stays concrete and matches current product scope — agent-connect section is labeled contributor/source-build for now
  • Release/update impact reviewed — none; standalone Tools package, app target untouched
  • Agent PRs stay draft until human review
  • UI changes include sanitized .agent-review/visuals/ evidence — n/a, no UI
  • No private transcripts, audio, tokens, personal paths, or customer data are included

Notes

  • Whole-file decode holds ~230MB RAM per hour of audio (16kHz mono Float32); documented in the CLI CLAUDE.md. Streaming chunked transcription is a possible follow-up for very long files.
  • WebM (YouTube's other common container) is out of AVFoundation's reach — the error message points at ffmpeg. A yt-dlp-friendly note could go in docs later.
  • Follow-up candidate: expose a transcribe_file tool in transcripted-mcp so Claude Desktop gets the same utility without shell access.

Agent handoff

COORD_DONE: BRIEF | PR: this | added transcribe command + media loader + output formatting + tests + docs | none | decide whether SRT segment defaults (6s/84 chars) need tuning | agent-preflight run; swift test not runnable in Linux container — needs a Mac run | run swift test --package-path Tools/TranscriptedCLI on a Mac and exercise the manual check list


Generated by Claude Code

claude added 2 commits July 1, 2026 22:14
…eo transcription

Coding agents (Claude Code, Codex) can now use the on-device Parakeet
model on arbitrary files — downloaded videos, voice memos, screen
recordings — without going through the app's meeting flow:

- transcripted-cli transcribe <media...> outputs plain text by default,
  with --json (timestamped segments) and --srt options, plus --output /
  --output-dir for batch runs over folders of files
- audio files decode via AVAudioFile + AVAudioConverter; video
  containers (MP4/MOV/M4V) fall back to AVAssetReader mixing all audio
  tracks to 16kHz mono; WebM/MKV errors suggest an ffmpeg conversion
- models resolve from --models-dir, then the installed Transcripted.app
  bundle, then the shared FluidAudio cache the app already uses, and
  only download (~600MB, once) as a last resort; --no-download opts out
- TRANSCRIPTEDCLI_ENABLE_TRANSCRIPTION=1 (or the existing diarization
  flag) links the shared FluidAudio deps bundle; retrieval-only builds
  keep working on fresh checkouts with a clear stub error
- output formatting (segment grouping, SRT rendering, JSON payloads,
  output paths) lives in dependency-free TranscribeOutput.swift with
  swift-test coverage that runs without the deps bundle

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0173spvy7ZyMyBYhBJj4rT1z
- batch --output-dir disambiguates inputs sharing a stem (talk.mp4 +
  talk.mov -> talk.mp4.txt / talk.mov.txt) instead of silently
  overwriting one transcript with the other
- media loading always falls through to the AVAssetReader path when the
  AVAudioFile path fails, and a total decode failure reports both
  errors instead of discarding the reader's usually-more-specific one
- the SRT minimum-caption-duration clamp no longer pushes a caption
  past the next caption's start
- rename the JSON realTimeFactor field to speedFactor since the value
  is FluidAudio's rtfx speedup (duration/processing), not conventional
  RTF
- --models-dir help and error text spell out FluidAudio's required
  parakeet-tdt-0.6b-v3-coreml folder-name layout
- declare ArgumentParser as an explicit test-target dependency

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0173spvy7ZyMyBYhBJj4rT1z
@r3dbars r3dbars marked this pull request as ready for review July 3, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants