This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
whisper.cpp is a high-performance C/C++ implementation of OpenAI's Whisper automatic speech recognition (ASR) model. The implementation emphasizes portability, efficiency, and minimal dependencies, supporting multiple hardware acceleration backends.
# Configure and build (Release mode)
cmake -B build
cmake --build build -j --config Release
# Built binaries are in: build/bin/# Download model and run on all samples
make base.en
# Just build
make buildCUDA (NVIDIA GPUs):
cmake -B build -DGGML_CUDA=1
cmake --build build -j --config ReleaseMetal (Apple Silicon): Metal is automatically enabled on macOS with Apple Silicon.
OpenVINO (Intel CPUs/GPUs):
# First source OpenVINO environment
source /path/to/openvino/setupvars.sh
cmake -B build -DWHISPER_OPENVINO=1
cmake --build build -j --config ReleaseCore ML (Apple Neural Engine):
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config ReleaseVulkan (Cross-vendor GPU):
cmake -B build -DGGML_VULKAN=1
cmake --build build -j --config ReleaseOpenBLAS (CPU acceleration):
cmake -B build -DGGML_BLAS=1
cmake --build build -j --config ReleaseFFmpeg support (Linux only, for additional audio formats):
cmake -B build -DWHISPER_FFMPEG=yes
cmake --build build -j --config ReleaseSDL2 support (for real-time audio input examples):
cmake -B build -DWHISPER_SDL2=ON
cmake --build build -j --config Release# From tests/ directory
cd tests
./run-tests.sh base.en
# With specific thread count
./run-tests.sh base.en 4The test suite downloads audio samples, transcribes them, and compares output against reference transcripts using git diff for visual inspection.
# Tests are built by default in standalone mode
cmake -B build -DWHISPER_BUILD_TESTS=ON
cmake --build build -j --config Release
ctest --test-dir build# Download a model first
./models/download-ggml-model.sh base.en
# Transcribe an audio file (16-bit WAV, 16kHz, mono)
./build/bin/whisper-cli -f samples/jfk.wav -m models/ggml-base.en.bin
# Convert audio to correct format
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav# Requires SDL2
./build/bin/whisper-stream -m models/ggml-base.en.bin -t 8 --step 500 --length 5000./build/bin/whisper-server -m models/ggml-base.en.binwhisper.cpp/whisper.h (src/whisper.cpp, include/whisper.h):
- Main implementation of the Whisper model (9000+ lines)
- Contains the complete inference engine
- C-style API for language bindings
- Thread-safe when contexts are not shared
ggml (ggml/):
- Submodule containing the machine learning tensor library
- Provides low-level operations (matrix multiplication, attention, etc.)
- Supports multiple hardware backends (CPU, CUDA, Metal, Vulkan, etc.)
- whisper.cpp builds on top of ggml primitives
whisper-arch.h (src/whisper-arch.h):
- Defines tensor naming conventions mapping to Whisper architecture
- Enumerates encoder, decoder, and cross-attention tensor types
- Used for model loading and conversion
Whisper uses an encoder-decoder transformer architecture:
- Encoder: Processes audio spectrograms (mel-frequency features)
- Decoder: Generates text tokens autoregressively
- Cross-attention: Links encoder outputs to decoder
The encoder can be offloaded to specialized hardware:
- Core ML (Apple Neural Engine)
- OpenVINO (Intel devices)
- Standard backends (CUDA, Metal, etc.)
- Encoder acceleration: Primary target for hardware offload since it's the most computationally intensive part
- Decoder runs on CPU/GPU: Less intensive, runs after encoder
- Mixed precision: Uses F16/F32 for better performance
- Quantization support: Q5_0 and other quantized formats reduce memory and improve speed
Located in examples/, each with its own subdirectory:
- cli: Main command-line transcription tool (formerly "main")
- stream: Real-time audio transcription
- server: HTTP server with OpenAI-compatible API
- bench: Performance benchmarking tool
- command: Voice command recognition
- talk-llama: Integration with LLaMA for conversational AI
- quantize: Model quantization utility
Common code shared across examples:
examples/common.cpp/h: General utilities (arg parsing, file I/O)examples/common-whisper.cpp/h: Whisper-specific utilitiesexamples/common-sdl.cpp/h: SDL2 audio capture utilitiesexamples/grammar-parser.cpp/h: Grammar parsing for constrained generation
Located in bindings/:
- go: Go bindings
- java: Java bindings
- javascript: Node.js/JavaScript bindings
- ruby: Ruby bindings
Each binding wraps the C API from include/whisper.h.
Models use custom ggml format (not PyTorch):
- Single-file format containing weights, vocabulary, and mel filters
- Download pre-converted models:
./models/download-ggml-model.sh [model-name] - Or convert manually:
models/convert-pt-to-ggml.py - Available models: tiny, base, small, medium, large-v1, large-v2, large-v3, large-v3-turbo
- Models ending in
.enare English-only - Models ending in
-q5_0are quantized - Models ending in
-tdrzsupport speaker diarization
When Core ML is enabled:
- Generate Core ML model:
./models/generate-coreml-model.sh base.en - This creates
models/ggml-base.en-encoder.mlmodelcdirectory - At runtime, encoder automatically uses Core ML if available
- First run compiles to device-specific format (slow), subsequent runs are fast
When OpenVINO is enabled:
- Convert model:
python models/convert-whisper-to-openvino.py --model base.en - This creates
ggml-base.en-encoder-openvino.xml/.binfiles - Source OpenVINO environment before building
- First run compiles to device-specific blob (cached for future runs)
Whisper expects:
- Sample rate: 16kHz (WHISPER_SAMPLE_RATE = 16000)
- Format: 32-bit floating point PCM in range [-1.0, 1.0]
- Channels: Mono (single channel)
Most examples handle conversion internally, but whisper-cli requires pre-converted WAV files.
- Core library code in
src/ - Public API in
include/whisper.h - Examples in
examples/(self-contained with their own mains) - Tests in
tests/ - Model utilities in
models/ - Hardware-specific code in
src/coreml/andsrc/openvino/
Important CMake options (see CMakeLists.txt line 66+):
WHISPER_BUILD_EXAMPLES: Build example programs (default: ON in standalone)WHISPER_BUILD_TESTS: Build test suite (default: ON in standalone)WHISPER_BUILD_SERVER: Build server example (default: ON in standalone)WHISPER_COREML: Enable Core ML supportWHISPER_OPENVINO: Enable OpenVINO supportWHISPER_SDL2: Enable SDL2 for audio inputWHISPER_FFMPEG: Enable FFmpeg for additional audio formats (Linux only)GGML_CUDA: Enable CUDA support (replaces deprecated WHISPER_CUDA)GGML_METAL: Enable Metal support (automatic on macOS)GGML_VULKAN: Enable Vulkan supportGGML_BLAS: Enable BLAS/OpenBLAS support
- Load model file (whisper_init_from_file or whisper_init_from_buffer)
- Parse ggml format: magic number, hparams, mel filters, vocabulary, tensors
- Initialize hardware backend if available (Core ML, OpenVINO)
- Create inference state
- Ready for transcription via whisper_full()
- Convert audio to mel spectrogram (using mel filters from model)
- Run encoder on mel spectrogram (potentially on specialized hardware)
- Run decoder autoregressively to generate tokens
- Convert tokens to text using vocabulary
- Apply timestamp alignment if requested
- Return segments with text and timestamps
VAD preprocessing filters silence before transcription:
- Download VAD model:
./models/download-vad-model.sh silero-v6.2.0 - Use with
--vadand-vm path/to/vad-model.bin - Significantly speeds up processing by skipping silence
- Configurable thresholds and parameters (see README section on VAD)
- Use
--debug-modeflag in examples for verbose output - Print internal state with
--print-specialand--print-colors - Enable sanitizers:
-DWHISPER_SANITIZE_ADDRESS=ON,-DWHISPER_SANITIZE_THREAD=ON - Check system info in output to verify hardware acceleration is active
- Create directory in
examples/your-example/ - Add
your-example.cppwith main() - Create
CMakeLists.txtlinking to whisper library - Add to
examples/CMakeLists.txt - Use common utilities from
examples/common.hfor arg parsing
Most backend support goes through ggml. For encoder offloading (Core ML, OpenVINO):
- Conversion script in
models/to generate backend-specific model - Runtime code in
src/coreml/orsrc/openvino/ - CMake option to enable/disable
- Initialization in
whisper_init_state()checks for backend availability - Encoder execution routes through backend if available, falls back to ggml otherwise
Model files are portable but backend-specific accelerator models are not:
- ggml models work everywhere
- Core ML models are macOS/iOS specific
- OpenVINO models are device-specific after first compilation
- Always include both base ggml model and accelerator model for full functionality
Adapt examples/stream/stream.cpp to support an enhanced real-time transcription use case with initial voice activation check, continual silence detection, and UDP network output.
- Start with VAD model only for configurable duration (default: 10 seconds)
- Check audio on capture device in 2-second increments
- If NO audio is detected during initial period, exit before loading whisper model
- This saves resources when no speech is present
- Uses existing
vad_simple()functionality fromexamples/stream/stream.cpp:283
- Already partially implemented at
examples/stream/stream.cpp:249-252 - Current code: exits after 3 minutes (180 seconds) of continual silence
- Needs better integration with existing code structure
- Should match existing coding style and patterns
- Make silence timeout configurable via command-line parameter
- Send current transcription segment as UDP packets
- Use SRT format (see
examples/cli/cli.cpp:489-506for reference) - Each packet should contain:
- Segment number/ID
- Start time (t0)
- End time (t1)
- Transcription text
- Current timestamp (for network unreliability handling)
- Target host/port should be configurable
- Copy
output_srt()formatting logic into stream.cpp to minimize dependencies
- Current behavior: same segment is repeated step/length times as it's refined
- Each inference step updates the current segment (same t0/t1, changing text)
- Need to identify segments consistently across updates (by t0/t1 or other means)
- Consider experiments with console output before implementing UDP transfer
- Do NOT modify existing code - only add new functionality
- This makes it easier to incorporate upstream changes from remote repo
- All new features should be toggleable via command-line parameters
- All features should be OFF by default (opt-in)
- Follow existing code style and patterns
New command-line options needed:
--vad-startup-ms N: Duration of initial VAD check before loading whisper model (default: 10000ms, 0 = disabled)--silence-timeout-ms N: Duration of continual silence before exit (default: 180000ms, 0 = disabled)--udp-host HOST: Target host for UDP output (default: none)--udp-port PORT: Target port for UDP output (default: none)--udp-enable: Enable UDP output (default: false)
From examples/cli/cli.cpp:489-506:
static void output_srt(struct whisper_context * ctx, std::ofstream & fout,
const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
const int n_segments = whisper_full_n_segments(ctx);
for (int i = 0; i < n_segments; ++i) {
const char * text = whisper_full_get_segment_text(ctx, i);
const int64_t t0 = whisper_full_get_segment_t0(ctx, i);
const int64_t t1 = whisper_full_get_segment_t1(ctx, i);
fout << i + 1 + params.offset_n << "\n";
fout << to_timestamp(t0, true) << " --> " << to_timestamp(t1, true) << "\n";
fout << text << "\n\n";
}
}The to_timestamp() function is in examples/common-whisper.cpp:138-151:
std::string to_timestamp(int64_t t, bool comma) {
int64_t msec = t * 10;
int64_t hr = msec / (1000 * 60 * 60);
msec = msec - hr * (1000 * 60 * 60);
int64_t min = msec / (1000 * 60);
msec = msec - min * (1000 * 60);
int64_t sec = msec / 1000;
msec = msec - sec * 1000;
char buf[32];
snprintf(buf, sizeof(buf), "%02d:%02d:%02d%s%03d",
(int) hr, (int) min, (int) sec, comma ? "," : ".", (int) msec);
return std::string(buf);
}VAD usage (examples/stream/stream.cpp:137):
const bool use_vad = n_samples_step <= 0; // sliding window mode uses VADCurrent silence detection (examples/stream/stream.cpp:249-252):
if (silence_count * params.step_ms >= 1000 * 60 * 3) {
printf("silent for %d steps. bye!\n", silence_count);
is_running = false;
}VAD check logic (examples/stream/stream.cpp:283-287):
if (::vad_simple(pcmf32_new, WHISPER_SAMPLE_RATE, 1000, params.vad_thold, params.freq_thold, false)) {
silence_count++;
} else {
silence_count = 0;
}Typical usage pattern:
# User typically runs with 2s step and 10s length
./build/bin/whisper-stream -m models/ggml-base.en.bin --step 2000 --length 10000In stream mode (non-VAD), the same segment is refined across multiple inference steps:
- Same t0 and t1 values across multiple steps
- Text content may change as more context is available
- Console currently shows segment repeatedly
- UDP receiver needs to handle segment updates (same t0/t1 = update existing)
- Target device displays last N segments (e.g., 2) and updates based on timestamp
- First, add new command-line parameters to
whisper_paramsstruct - Implement initial VAD check before whisper model loading
- Better integrate existing silence detection code
- Add UDP socket functionality (platform-specific: POSIX vs Windows)
- Implement SRT formatting for UDP packets
- Test console output behavior across multiple steps
- Implement segment tracking and update logic for UDP transmission