Skip to content

Commit affa44f

Browse files
thereisnotimeclaude
andcommitted
feat: add local whisper.cpp voice transcription provider
Add a third voice provider option (VOICE_PROVIDER=local) that transcribes Telegram voice messages entirely offline using whisper.cpp and ffmpeg. No API keys or cloud services required. - New local provider in voice_handler.py (OGG->WAV via ffmpeg, then whisper.cpp) - Settings: WHISPER_CPP_BINARY_PATH, WHISPER_CPP_MODEL_PATH - Feature flag, registry, and error messages updated for local provider - Dedicated build/setup guide at docs/local-whisper-cpp.md - Full test coverage for the local provider path - Updated .env.example, CLAUDE.md, README.md, docs/configuration.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a1f8f84 commit affa44f

15 files changed

Lines changed: 668 additions & 26 deletions

.env.example

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,34 @@ QUICK_ACTIONS_TIMEOUT=120
140140
# Git operations timeout in seconds
141141
GIT_OPERATIONS_TIMEOUT=30
142142

143+
# === VOICE TRANSCRIPTION ===
144+
# Enable voice message transcription
145+
ENABLE_VOICE_MESSAGES=true
146+
147+
# Voice transcription provider: mistral, openai, or local
148+
# - mistral: Uses Mistral Voxtral (requires MISTRAL_API_KEY)
149+
# - openai: Uses OpenAI Whisper API (requires OPENAI_API_KEY)
150+
# - local: Uses whisper.cpp binary (requires ffmpeg + whisper.cpp installed)
151+
VOICE_PROVIDER=mistral
152+
153+
# API keys (only needed for cloud providers)
154+
MISTRAL_API_KEY=
155+
OPENAI_API_KEY=
156+
157+
# Override transcription model (optional)
158+
# Defaults: voxtral-mini-latest (mistral), whisper-1 (openai), base (local)
159+
VOICE_TRANSCRIPTION_MODEL=
160+
161+
# Maximum voice message size in MB
162+
VOICE_MAX_FILE_SIZE_MB=20
163+
164+
# Local whisper.cpp settings (only used when VOICE_PROVIDER=local)
165+
# Path to whisper.cpp binary (auto-detected from PATH if unset)
166+
WHISPER_CPP_BINARY_PATH=
167+
# Path to GGML model file, or model name like "base", "small", "medium"
168+
# Named models look for ~/.cache/whisper-cpp/ggml-{name}.bin
169+
WHISPER_CPP_MODEL_PATH=base
170+
143171
# === PROJECT THREAD MODE ===
144172
# Enable strict routing by Telegram project topics
145173
ENABLE_PROJECT_THREADS=false

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ Multi-project topics: `ENABLE_PROJECT_THREADS` (default false), `PROJECT_THREADS
102102

103103
Output verbosity: `VERBOSE_LEVEL` (default 1, range 0-2). Controls how much of Claude's background activity is shown to the user in real-time. 0 = quiet (only final response, typing indicator still active), 1 = normal (tool names + reasoning snippets shown during execution), 2 = detailed (tool names with input summaries + longer reasoning text). Users can override per-session via `/verbose 0|1|2`. A persistent typing indicator is refreshed every ~2 seconds at all levels.
104104

105-
Voice transcription: `ENABLE_VOICE_MESSAGES` (default true), `VOICE_PROVIDER` (`mistral`|`openai`, default `mistral`), `MISTRAL_API_KEY`, `OPENAI_API_KEY`, `VOICE_TRANSCRIPTION_MODEL`. Provider implementation is in `src/bot/features/voice_handler.py`.
105+
Voice transcription: `ENABLE_VOICE_MESSAGES` (default true), `VOICE_PROVIDER` (`mistral`|`openai`|`local`, default `mistral`), `MISTRAL_API_KEY`, `OPENAI_API_KEY`, `VOICE_TRANSCRIPTION_MODEL`. For local provider: `WHISPER_CPP_BINARY_PATH`, `WHISPER_CPP_MODEL_PATH` (requires ffmpeg + whisper.cpp installed). Provider implementation is in `src/bot/features/voice_handler.py`.
106106

107107
Feature flags in `src/config/features.py` control: MCP, git integration, file uploads, quick actions, session export, image uploads, voice messages, conversation mode, agentic mode, API server, scheduler.
108108

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ Enable with `ENABLE_API_SERVER=true` and `ENABLE_SCHEDULER=true`. See [docs/setu
194194
- Directory sandboxing with path traversal prevention
195195
- File upload handling with archive extraction
196196
- Image/screenshot upload with analysis
197-
- Voice message transcription (Mistral Voxtral / OpenAI Whisper)
197+
- Voice message transcription (Mistral Voxtral / OpenAI Whisper / [local whisper.cpp](docs/local-whisper-cpp.md))
198198
- Git integration with safe repository operations
199199
- Quick actions system with context-aware buttons
200200
- Session export in Markdown, HTML, and JSON formats

docs/configuration.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,11 +135,15 @@ ENABLE_QUICK_ACTIONS=true
135135

136136
# Enable voice message transcription
137137
ENABLE_VOICE_MESSAGES=true
138-
VOICE_PROVIDER=mistral # 'mistral' (default) or 'openai'
138+
VOICE_PROVIDER=mistral # 'mistral', 'openai', or 'local'
139139
MISTRAL_API_KEY= # Required when VOICE_PROVIDER=mistral
140140
OPENAI_API_KEY= # Required when VOICE_PROVIDER=openai
141141
VOICE_TRANSCRIPTION_MODEL= # Default: voxtral-mini-latest (Mistral) or whisper-1 (OpenAI)
142142
VOICE_MAX_FILE_SIZE_MB=20 # Max Telegram voice file size to download (1-200MB)
143+
144+
# Local whisper.cpp settings (only used when VOICE_PROVIDER=local)
145+
WHISPER_CPP_BINARY_PATH= # Path to whisper.cpp binary (auto-detected from PATH if unset)
146+
WHISPER_CPP_MODEL_PATH=base # Path to GGML model file or model name (base, small, medium, large)
143147
```
144148

145149
#### Agentic Platform

docs/local-whisper-cpp.md

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Local Voice Transcription with whisper.cpp
2+
3+
This guide explains how to build and configure [whisper.cpp](https://github.com/ggerganov/whisper.cpp) for **offline** voice message transcription — no API keys or cloud services required.
4+
5+
## Overview
6+
7+
When `VOICE_PROVIDER=local` the bot transcribes Telegram voice messages entirely on your machine using:
8+
9+
| Component | Purpose |
10+
|---|---|
11+
| **ffmpeg** | Converts Telegram OGG/Opus audio to 16 kHz mono WAV |
12+
| **whisper.cpp** | Runs OpenAI's Whisper model locally via optimised C/C++ |
13+
| **GGML model** | Quantised model weights (downloaded once) |
14+
15+
## Prerequisites
16+
17+
- A C/C++ toolchain (`gcc`/`clang`, `cmake`, `make`)
18+
- `ffmpeg` installed and on PATH
19+
- ~400 MB disk space for the `base` model (~1.5 GB for `medium`)
20+
21+
## 1. Install ffmpeg
22+
23+
### Ubuntu / Debian
24+
25+
```bash
26+
sudo apt update && sudo apt install -y ffmpeg
27+
```
28+
29+
### macOS (Homebrew)
30+
31+
```bash
32+
brew install ffmpeg
33+
```
34+
35+
### Alpine
36+
37+
```bash
38+
apk add ffmpeg
39+
```
40+
41+
Verify:
42+
43+
```bash
44+
ffmpeg -version
45+
```
46+
47+
## 2. Build whisper.cpp from source
48+
49+
```bash
50+
# Clone the repository
51+
git clone https://github.com/ggerganov/whisper.cpp.git
52+
cd whisper.cpp
53+
54+
# Build with CMake (recommended)
55+
cmake -B build
56+
cmake --build build --config Release
57+
58+
# The binary is at build/bin/whisper-cli (or build/bin/main on older versions)
59+
ls build/bin/whisper-cli
60+
```
61+
62+
> **Tip:** For GPU acceleration add `-DWHISPER_CUBLAS=ON` (NVIDIA) or `-DWHISPER_METAL=ON` (Apple Silicon) to the cmake configure step.
63+
64+
### Install system-wide (optional)
65+
66+
```bash
67+
sudo cp build/bin/whisper-cli /usr/local/bin/whisper-cpp
68+
```
69+
70+
Or add the build directory to your `PATH`:
71+
72+
```bash
73+
export PATH="$PWD/build/bin:$PATH"
74+
```
75+
76+
## 3. Download a GGML model
77+
78+
Models are hosted on Hugging Face. Pick one based on your hardware:
79+
80+
| Model | Size | RAM (approx.) | Quality |
81+
|---|---|---|---|
82+
| `tiny` | ~75 MB | ~400 MB | Fast but lower accuracy |
83+
| `base` | ~142 MB | ~500 MB | Good balance (default) |
84+
| `small` | ~466 MB | ~1 GB | Better accuracy |
85+
| `medium` | ~1.5 GB | ~2.5 GB | High accuracy |
86+
| `large-v3` | ~3 GB | ~5 GB | Best accuracy, slow on CPU |
87+
88+
```bash
89+
# Create the model cache directory
90+
mkdir -p ~/.cache/whisper-cpp
91+
92+
# Download the base model (recommended starting point)
93+
curl -L -o ~/.cache/whisper-cpp/ggml-base.bin \
94+
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin
95+
96+
# Or download small for better accuracy
97+
curl -L -o ~/.cache/whisper-cpp/ggml-small.bin \
98+
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin
99+
```
100+
101+
## 4. Configure the bot
102+
103+
Add the following to your `.env`:
104+
105+
```bash
106+
# Enable voice transcription with local provider
107+
ENABLE_VOICE_MESSAGES=true
108+
VOICE_PROVIDER=local
109+
110+
# Path to the whisper.cpp binary (omit if already on PATH as "whisper-cpp")
111+
WHISPER_CPP_BINARY_PATH=/usr/local/bin/whisper-cpp
112+
113+
# Model: a name like "base", "small", "medium" or a full file path
114+
# Named models resolve to ~/.cache/whisper-cpp/ggml-{name}.bin
115+
WHISPER_CPP_MODEL_PATH=base
116+
```
117+
118+
### Minimal configuration
119+
120+
If `whisper-cpp` is on your PATH and you downloaded the `base` model to the default location, you only need:
121+
122+
```bash
123+
VOICE_PROVIDER=local
124+
```
125+
126+
## 5. Verify the setup
127+
128+
```bash
129+
# Test ffmpeg conversion
130+
ffmpeg -f lavfi -i "sine=frequency=440:duration=2" -ar 16000 -ac 1 /tmp/test.wav -y
131+
132+
# Test whisper.cpp
133+
whisper-cpp -m ~/.cache/whisper-cpp/ggml-base.bin -f /tmp/test.wav --no-timestamps
134+
```
135+
136+
You should see a transcription attempt (it will be empty or nonsensical for a sine wave, but the binary should run without errors).
137+
138+
## Troubleshooting
139+
140+
### `whisper.cpp binary not found on PATH`
141+
142+
The bot could not locate the binary. Either:
143+
- Install it system-wide: `sudo cp build/bin/whisper-cli /usr/local/bin/whisper-cpp`
144+
- Or set the full path: `WHISPER_CPP_BINARY_PATH=/path/to/whisper-cli`
145+
146+
### `whisper.cpp model not found`
147+
148+
The model file does not exist at the expected path. Download it:
149+
150+
```bash
151+
mkdir -p ~/.cache/whisper-cpp
152+
curl -L -o ~/.cache/whisper-cpp/ggml-base.bin \
153+
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin
154+
```
155+
156+
### `ffmpeg is required but was not found`
157+
158+
Install ffmpeg for your platform (see step 1 above).
159+
160+
### Poor transcription quality
161+
162+
- Try a larger model (`small` or `medium` instead of `base`)
163+
- Ensure audio is not too short (< 1 second) or too noisy
164+
- whisper.cpp uses `--language auto` by default; this works well for most languages
165+
166+
### High CPU usage / slow transcription
167+
168+
- Use a smaller model (`tiny` or `base`)
169+
- Enable GPU acceleration when building whisper.cpp (CUDA / Metal)
170+
- Consider using the `mistral` or `openai` cloud providers for faster results on low-powered machines

docs/setup.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -197,12 +197,23 @@ VOICE_PROVIDER=openai
197197
OPENAI_API_KEY=your-openai-api-key
198198
```
199199

200-
If you installed via pip/uv, make sure voice extras are installed:
200+
**Local whisper.cpp (offline, no API key needed):**
201+
```bash
202+
VOICE_PROVIDER=local
203+
# Optional — auto-detected from PATH if unset
204+
WHISPER_CPP_BINARY_PATH=/usr/local/bin/whisper-cpp
205+
# Model name ("base", "small", "medium") or full path to .bin file
206+
WHISPER_CPP_MODEL_PATH=base
207+
```
208+
209+
Requires `ffmpeg` and a locally built `whisper.cpp` binary. See the full [local whisper.cpp setup guide](local-whisper-cpp.md) for build instructions and model downloads.
210+
211+
If you installed via pip/uv, make sure voice extras are installed (cloud providers only):
201212
```bash
202213
pip install "claude-code-telegram[voice]"
203214
```
204215

205-
Optionally override the transcription model with `VOICE_TRANSCRIPTION_MODEL` (defaults to `voxtral-mini-latest` for Mistral, `whisper-1` for OpenAI).
216+
Optionally override the transcription model with `VOICE_TRANSCRIPTION_MODEL` (defaults to `voxtral-mini-latest` for Mistral, `whisper-1` for OpenAI, `base` for local).
206217

207218
### Notification Recipients
208219

src/bot/features/registry.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,10 +78,14 @@ def _initialize_features(self):
7878
except Exception as e:
7979
logger.error("Failed to initialize image handler", error=str(e))
8080

81-
# Voice transcription - requires provider-specific API key
81+
# Voice transcription - requires provider-specific API key (or local)
8282
voice_key_available = (
83+
self.config.voice_provider == "local"
84+
) or (
8385
self.config.voice_provider == "openai" and self.config.openai_api_key
84-
) or (self.config.voice_provider == "mistral" and self.config.mistral_api_key)
86+
) or (
87+
self.config.voice_provider == "mistral" and self.config.mistral_api_key
88+
)
8589
if self.config.enable_voice_messages and voice_key_available:
8690
try:
8791
self.features["voice_handler"] = VoiceHandler(config=self.config)

0 commit comments

Comments
 (0)