Add transcribe command to transcripted-cli for agent-driven audio/video transcription#1348
Open
r3dbars wants to merge 2 commits into
Open
Add transcribe command to transcripted-cli for agent-driven audio/video transcription#1348r3dbars wants to merge 2 commits into
r3dbars wants to merge 2 commits into
Conversation
…eo transcription Coding agents (Claude Code, Codex) can now use the on-device Parakeet model on arbitrary files — downloaded videos, voice memos, screen recordings — without going through the app's meeting flow: - transcripted-cli transcribe <media...> outputs plain text by default, with --json (timestamped segments) and --srt options, plus --output / --output-dir for batch runs over folders of files - audio files decode via AVAudioFile + AVAudioConverter; video containers (MP4/MOV/M4V) fall back to AVAssetReader mixing all audio tracks to 16kHz mono; WebM/MKV errors suggest an ffmpeg conversion - models resolve from --models-dir, then the installed Transcripted.app bundle, then the shared FluidAudio cache the app already uses, and only download (~600MB, once) as a last resort; --no-download opts out - TRANSCRIPTEDCLI_ENABLE_TRANSCRIPTION=1 (or the existing diarization flag) links the shared FluidAudio deps bundle; retrieval-only builds keep working on fresh checkouts with a clear stub error - output formatting (segment grouping, SRT rendering, JSON payloads, output paths) lives in dependency-free TranscribeOutput.swift with swift-test coverage that runs without the deps bundle Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0173spvy7ZyMyBYhBJj4rT1z
- batch --output-dir disambiguates inputs sharing a stem (talk.mp4 + talk.mov -> talk.mp4.txt / talk.mov.txt) instead of silently overwriting one transcript with the other - media loading always falls through to the AVAssetReader path when the AVAudioFile path fails, and a total decode failure reports both errors instead of discarding the reader's usually-more-specific one - the SRT minimum-caption-duration clamp no longer pushes a caption past the next caption's start - rename the JSON realTimeFactor field to speedFactor since the value is FluidAudio's rtfx speedup (duration/processing), not conventional RTF - --models-dir help and error text spell out FluidAudio's required parakeet-tdt-0.6b-v3-coreml folder-name layout - declare ArgumentParser as an explicit test-target dependency Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0173spvy7ZyMyBYhBJj4rT1z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Coding agents can already read saved Transcripted captures through the CLI/MCP tools, but there was no way for Claude Code or Codex to use the on-device Parakeet model on an arbitrary file — the "download a bunch of YouTube videos and transcribe them" workflow. This adds a
transcribecommand totranscripted-cliso agents get straight-up words from any audio or video file, fully local, without creating a meeting in the app.Product Impact
agent artifactsagent workflowWhat changed
transcripted-cli transcribe <media...>subcommand: plain text to stdout by default,--json(text + timestamped segments + timing metadata),--srtsubtitles,--outputfor a single file,--output-dirfor batch runs (one<stem>.txt|json|srtper input),--no-downloadto fail instead of fetching modelsTranscribeMediaLoader.swift: decodes audio files viaAVAudioFile+AVAudioConverter(streaming, anti-aliased resample to 16kHz mono, mirroring the app'sAudioResamplerapproach); video containers (MP4/MOV/M4V) fall back toAVAssetReaderwith all audio tracks mixed to mono; WebM/MKV get an explicit error suggesting anffmpegconversionTranscribeModelResolver: model lookup order is--models-dir→ installedTranscripted.appbundled models → the shared FluidAudio cache the app already uses (~/Library/Application Support/FluidAudio/Models/) → one-time ~600MB download into that cache. Sub-second clips are padded to Parakeet's 1s minimum instead of surfacing a cryptic model errorTranscribeOutput.swift: dependency-free output formatting — caption-sized segment grouping from token timings (gap/duration/length/sentence boundaries), SRT rendering, JSON payload encoding, output-path derivation — kept out of the gated command so it compiles and tests without the deps bundlePackage.swift:TRANSCRIPTEDCLI_ENABLE_TRANSCRIPTION=1(or the existingTRANSCRIPTEDCLI_ENABLE_DIARIZATION=1) links the same prebuilt FluidAudio bundle and enables both offline-audio command groups; fresh checkouts still build retrieval-only, and the ungatedtranscribestub explains how to enable the real oneTranscribeOutputTests.swiftcovers format resolution, segment grouping boundaries, SRT timestamps/rendering, output-path derivation, and JSON object-vs-array encoding — all runnable without the deps bundleTools/TranscriptedCLI/CLAUDE.mdcommand reference + gotchas, new "Transcribe Files From an Agent" section indocs/agent-connect.md, one-line updates inTools/README.mdand rootCLAUDE.mdHow I checked it
scripts/dev/agent-preflight.sh(maps this diff toswift test --package-path Tools/TranscriptedCLI).agents/test-matrix.ymlfor the files changed — not run: this change was authored in a Linux container with no Swift toolchain, soswift test --package-path Tools/TranscriptedCLIneeds a run on a Mac checkout before mergebash build.sh --no-open— n/a, app target untouchedbash run-tests.sh— n/a, app target untouchedbash run-integration-smoke.sh— n/aswift test— n/a, rootPackage.swift/core seam untouchedbash build-deps.sh && cd Tools/TranscriptedCLI && TRANSCRIPTEDCLI_ENABLE_TRANSCRIPTION=1 swift build, thentranscribea WAV, an MP3, and an MP4 with audio; verify plain text,--json,--srt,--output-dirbatch, and the stub error on a build without the flagIn place of a local test run, every FluidAudio API call was verified against the pinned v0.7.9 sources (
AsrModels.load/download/modelsExist/defaultCacheDirectory,AsrManager.initialize/transcribe,ASRResult/TokenTimingfields), and an independent adversarial review pass was run over the new Swift files for compile and logic issues.Risk Review
--no-downloadopts out of even that.agent-review/visuals/evidence — n/a, no UINotes
yt-dlp-friendly note could go in docs later.transcribe_filetool intranscripted-mcpso Claude Desktop gets the same utility without shell access.Agent handoff
COORD_DONE: BRIEF | PR: this | added transcribe command + media loader + output formatting + tests + docs | none | decide whether SRT segment defaults (6s/84 chars) need tuning | agent-preflight run; swift test not runnable in Linux container — needs a Mac run | run swift test --package-path Tools/TranscriptedCLI on a Mac and exercise the manual check listGenerated by Claude Code