Skip to content

Local-first speech-to-text hub with local and cloud engines

Notifications You must be signed in to change notification settings

lulucatdev/openstt

Repository files navigation

OpenSTT

Local-first speech-to-text hub with multiple local engines.

A native macOS app that unifies Whisper and MLX models behind a single OpenAI-compatible API endpoint — plus system-wide dictation with a global hotkey.

Requirements: macOS on Apple Silicon (M1 / M2 / M3 / M4). Intel Macs are not supported.

Features

  • Multiple engines — Local Whisper (whisper.cpp with Metal), local MLX models, plus cloud options (ElevenLabs, Soniox)
  • OpenAI-compatible APIPOST /v1/audio/transcriptions on localhost, drop-in replacement
  • System-wide dictation — Hold a global shortcut to record, release to transcribe, auto-paste into any app
    • Real-time tray timer showing listening & processing phases
    • Support for real-time streaming models (ElevenLabs Scribe V2, Soniox STT-RT v4)
  • Model management — Download, switch, and delete models from the GUI with concurrent download support
  • Playground — Built-in record-and-transcribe for quick testing
  • Auto-updater — Automatic update checks on startup with one-click install (uses Sparkle to preserve macOS permissions)
  • Onboarding flow — First-launch setup wizard with permissions check and standalone Python auto-download
  • Real-time status — Overview page with MLX runtime status and sidebar indicators

Tech Stack

  • Frontend: React + TypeScript + Vite
  • Backend: Rust + Tauri v2
  • STT: whisper.cpp (via whisper-rs with Metal), MLX Audio sidecar
  • Updater: Sparkle (tauri-plugin-sparkle-updater) for atomic DMG-based updates that preserve code signature
  • Platform: macOS Apple Silicon only

Development

# Install dependencies
npm install

# Run in development mode
npm run tauri dev

# Build for production
npm run tauri build

Supported Models

Local Models

  • Whisper MLX — 12 models (tiny to large-v3) running locally with Metal acceleration
  • Whisper.cpp — Local Whisper implementation via whisper-rs

Cloud Models

  • ElevenLabs — Scribe V2, Scribe V2 Flash, Scribe V2 Realtime
  • Soniox — STT-RT v4 realtime dictation

Known Issues

ElevenLabs Scribe V2 Realtime — text injection unreliable (critical)

The elevenlabs:scribe_v2_realtime model streams partial_transcript and committed_transcript over WebSocket. The client must inject these into the focused app in real time.

Problem: macOS CGEvent keyboard events (backspace + set_string) are unreliable at high frequency — the OS and/or active IME silently drops events, causing characters to be lost or garbled. The ElevenLabs API also frequently revises partial transcripts (not just appending), which requires deleting previously typed text.

Approaches tried (all insufficient):

Approach Result
Full-draft rewrite (backspace N + type N) Events dropped at high frequency → garbled text
Incremental diff (common-prefix, only touch suffix) Helps for monotonic growth, but API revisions still cause large deletions
Clipboard paste (Cmd+V) for insertion Insertion is reliable, but backspace deletion still drops events
Commit-only mode (skip partials, paste on commit) Text is correct but loses real-time capability; also timing issue where typing task exits before final flush arrives

Current state: Commit-only mode is implemented as a stopgap. Partials are skipped; only committed transcripts are pasted via clipboard. The real-time typing experience is lost.

Needs investigation: How other macOS dictation tools (Superwhisper, Whisper Flow, macOS built-in dictation, Talon Voice) solve reliable text replacement — likely via Input Method Kit (IMK) or Accessibility API (AXUIElement) rather than raw CGEvent key injection.

Roadmap

  • Auto-updater with Sparkle to preserve macOS permissions after update
  • Real-time streaming dictation support (ElevenLabs Scribe V2 Realtime, Soniox STT-RT v4)
  • First-launch onboarding with permissions check
  • Standalone Python auto-download for MLX runtime
  • Migrate macOS MLX inference from Python sidecar to mlx-audio-swift for native performance and zero Python dependency
  • Fix real-time text injection for ElevenLabs Scribe V2 Realtime (see Known Issues)

License

MIT


OpenSTT

聚合多种本地引擎的语音转文字 Hub。

一个原生 macOS 应用,将 Whisper 和 MLX 模型统一在一个 OpenAI 兼容的 API 端点之后,同时提供全局快捷键系统级听写。

系统要求: macOS Apple Silicon(M1 / M2 / M3 / M4)。不支持 Intel Mac。

功能

  • 多引擎支持 — 本地 Whisper (whisper.cpp,Metal 加速)、本地 MLX 模型,以及云端选项(ElevenLabs、Soniox)
  • OpenAI 兼容 API — 本地 POST /v1/audio/transcriptions,可直接替换
  • 系统级听写 — 按住全局快捷键录音,松开转写,自动粘贴到当前应用
    • 实时托盘计时器显示录音和转写阶段
    • 支持实时流式模型(ElevenLabs Scribe V2、Soniox STT-RT v4)
  • 模型管理 — 在界面中下载、切换、删除模型,支持并发下载
  • 试听台 — 内置录音转写,便于快速测试
  • 自动更新 — 启动时自动检查更新,一键安装(使用 Sparkle 保留 macOS 权限)
  • 引导流程 — 首次启动的设置向导,包含权限检查和独立 Python 自动下载
  • 实时状态 — 总览页面显示 MLX 运行状态,侧边栏状态指示器

技术栈

  • 前端: React + TypeScript + Vite
  • 后端: Rust + Tauri v2
  • STT 引擎: whisper.cpp (通过 whisper-rs,Metal 加速)、MLX Audio 侧车
  • 更新器: Sparkle (tauri-plugin-sparkle-updater) 原子化 DMG 更新,保留代码签名
  • 平台: 仅支持 macOS Apple Silicon

开发

# 安装依赖
npm install

# 开发模式运行
npm run tauri dev

# 生产构建
npm run tauri build

支持的模型

本地模型

  • Whisper MLX — 12 个模型(tiny 到 large-v3),本地 Metal 加速运行
  • Whisper.cpp — 通过 whisper-rs 实现的本地 Whisper

云端模型

  • ElevenLabs — Scribe V2、Scribe V2 Flash、Scribe V2 Realtime
  • Soniox — STT-RT v4 实时听写

已知问题

ElevenLabs Scribe V2 Realtime — 实时文本注入不可靠(严重)

elevenlabs:scribe_v2_realtime 通过 WebSocket 流式返回 partial_transcriptcommitted_transcript,客户端需要将其实时注入到当前焦点应用。

问题: macOS CGEvent 键盘事件(退格 + set_string)在高频下不可靠——系统或输入法会静默丢弃事件,导致吃字或乱码。ElevenLabs API 还会频繁全量修订 partial(而非仅追加),需要删除已输入的文本。

已尝试的方案(均不足):

方案 结果
全量重写(退格 N + 输入 N) 高频下事件丢失 → 乱码
增量 diff(公共前缀,只改后缀) 单调增长时有效,但 API 修订仍需大量退格
剪贴板粘贴(Cmd+V)插入 插入可靠,但退格删除仍丢事件
Commit-only 模式(跳过 partial,仅粘贴 committed) 文本正确但失去实时能力;且存在 typing task 先于 flush 退出的时序问题

当前状态: 已实现 commit-only 模式作为临时方案。跳过 partial,仅在 committed 时通过剪贴板粘贴。实时打字体验丧失。

待调研: 其他 macOS 听写工具(Superwhisper、Whisper Flow、macOS 原生听写、Talon Voice)如何实现可靠的文本替换——可能通过 Input Method Kit (IMK) 或 Accessibility API (AXUIElement),而非原始 CGEvent 键盘注入。

路线图

  • 使用 Sparkle 自动更新,保留 macOS 更新后的权限
  • 实时流式听写支持(ElevenLabs Scribe V2 Realtime、Soniox STT-RT v4)
  • 首次启动引导,包含权限检查
  • MLX 运行时的独立 Python 自动下载
  • 将 macOS MLX 推理从 Python 侧车迁移到 mlx-audio-swift,实现原生性能、无需 Python 依赖
  • 修复 ElevenLabs Scribe V2 Realtime 实时文本注入(见已知问题)

许可证

MIT

About

Local-first speech-to-text hub with local and cloud engines

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •