CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

whisper.cpp is a high-performance C/C++ implementation of OpenAI's Whisper automatic speech recognition (ASR) model. The implementation emphasizes portability, efficiency, and minimal dependencies, supporting multiple hardware acceleration backends.

Build Commands

Standard Build (CMake)

# Configure and build (Release mode)
cmake -B build
cmake --build build -j --config Release

# Built binaries are in: build/bin/

Quick Start (Makefile)

# Download model and run on all samples
make base.en

# Just build
make build

Build with Hardware Acceleration

CUDA (NVIDIA GPUs):

cmake -B build -DGGML_CUDA=1
cmake --build build -j --config Release

Metal (Apple Silicon): Metal is automatically enabled on macOS with Apple Silicon.

OpenVINO (Intel CPUs/GPUs):

# First source OpenVINO environment
source /path/to/openvino/setupvars.sh

cmake -B build -DWHISPER_OPENVINO=1
cmake --build build -j --config Release

Core ML (Apple Neural Engine):

cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release

Vulkan (Cross-vendor GPU):

cmake -B build -DGGML_VULKAN=1
cmake --build build -j --config Release

OpenBLAS (CPU acceleration):

cmake -B build -DGGML_BLAS=1
cmake --build build -j --config Release

Build Options

FFmpeg support (Linux only, for additional audio formats):

cmake -B build -DWHISPER_FFMPEG=yes
cmake --build build -j --config Release

SDL2 support (for real-time audio input examples):

cmake -B build -DWHISPER_SDL2=ON
cmake --build build -j --config Release

Testing

Run Integration Tests

# From tests/ directory
cd tests
./run-tests.sh base.en

# With specific thread count
./run-tests.sh base.en 4

The test suite downloads audio samples, transcribes them, and compares output against reference transcripts using git diff for visual inspection.

Build and Run Tests

# Tests are built by default in standalone mode
cmake -B build -DWHISPER_BUILD_TESTS=ON
cmake --build build -j --config Release
ctest --test-dir build

Running Examples

Basic Transcription

# Download a model first
./models/download-ggml-model.sh base.en

# Transcribe an audio file (16-bit WAV, 16kHz, mono)
./build/bin/whisper-cli -f samples/jfk.wav -m models/ggml-base.en.bin

# Convert audio to correct format
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

Real-time Stream Example

# Requires SDL2
./build/bin/whisper-stream -m models/ggml-base.en.bin -t 8 --step 500 --length 5000

Server Example

./build/bin/whisper-server -m models/ggml-base.en.bin

Architecture Overview

Core Components

whisper.cpp/whisper.h (src/whisper.cpp, include/whisper.h):

Main implementation of the Whisper model (9000+ lines)
Contains the complete inference engine
C-style API for language bindings
Thread-safe when contexts are not shared

ggml (ggml/):

Submodule containing the machine learning tensor library
Provides low-level operations (matrix multiplication, attention, etc.)
Supports multiple hardware backends (CPU, CUDA, Metal, Vulkan, etc.)
whisper.cpp builds on top of ggml primitives

whisper-arch.h (src/whisper-arch.h):

Defines tensor naming conventions mapping to Whisper architecture
Enumerates encoder, decoder, and cross-attention tensor types
Used for model loading and conversion

Model Architecture

Whisper uses an encoder-decoder transformer architecture:

Encoder: Processes audio spectrograms (mel-frequency features)
Decoder: Generates text tokens autoregressively
Cross-attention: Links encoder outputs to decoder

The encoder can be offloaded to specialized hardware:

Core ML (Apple Neural Engine)
OpenVINO (Intel devices)
Standard backends (CUDA, Metal, etc.)

Hardware Acceleration Strategy

Encoder acceleration: Primary target for hardware offload since it's the most computationally intensive part
Decoder runs on CPU/GPU: Less intensive, runs after encoder
Mixed precision: Uses F16/F32 for better performance
Quantization support: Q5_0 and other quantized formats reduce memory and improve speed

Examples Structure

Located in examples/, each with its own subdirectory:

cli: Main command-line transcription tool (formerly "main")
stream: Real-time audio transcription
server: HTTP server with OpenAI-compatible API
bench: Performance benchmarking tool
command: Voice command recognition
talk-llama: Integration with LLaMA for conversational AI
quantize: Model quantization utility

Common code shared across examples:

examples/common.cpp/h: General utilities (arg parsing, file I/O)
examples/common-whisper.cpp/h: Whisper-specific utilities
examples/common-sdl.cpp/h: SDL2 audio capture utilities
examples/grammar-parser.cpp/h: Grammar parsing for constrained generation

Language Bindings

Located in bindings/:

go: Go bindings
java: Java bindings
javascript: Node.js/JavaScript bindings
ruby: Ruby bindings

Each binding wraps the C API from include/whisper.h.

Model Format

Models use custom ggml format (not PyTorch):

Single-file format containing weights, vocabulary, and mel filters
Download pre-converted models: ./models/download-ggml-model.sh [model-name]
Or convert manually: models/convert-pt-to-ggml.py
Available models: tiny, base, small, medium, large-v1, large-v2, large-v3, large-v3-turbo
Models ending in .en are English-only
Models ending in -q5_0 are quantized
Models ending in -tdrz support speaker diarization

Core ML Integration

When Core ML is enabled:

Generate Core ML model: ./models/generate-coreml-model.sh base.en
This creates models/ggml-base.en-encoder.mlmodelc directory
At runtime, encoder automatically uses Core ML if available
First run compiles to device-specific format (slow), subsequent runs are fast

OpenVINO Integration

When OpenVINO is enabled:

Convert model: python models/convert-whisper-to-openvino.py --model base.en
This creates ggml-base.en-encoder-openvino.xml/.bin files
Source OpenVINO environment before building
First run compiles to device-specific blob (cached for future runs)

Development Notes

Audio Input Requirements

Whisper expects:

Sample rate: 16kHz (WHISPER_SAMPLE_RATE = 16000)
Format: 32-bit floating point PCM in range [-1.0, 1.0]
Channels: Mono (single channel)

Most examples handle conversion internally, but whisper-cli requires pre-converted WAV files.

Code Organization

Core library code in src/
Public API in include/whisper.h
Examples in examples/ (self-contained with their own mains)
Tests in tests/
Model utilities in models/
Hardware-specific code in src/coreml/ and src/openvino/

CMake Configuration

Important CMake options (see CMakeLists.txt line 66+):

WHISPER_BUILD_EXAMPLES: Build example programs (default: ON in standalone)
WHISPER_BUILD_TESTS: Build test suite (default: ON in standalone)
WHISPER_BUILD_SERVER: Build server example (default: ON in standalone)
WHISPER_COREML: Enable Core ML support
WHISPER_OPENVINO: Enable OpenVINO support
WHISPER_SDL2: Enable SDL2 for audio input
WHISPER_FFMPEG: Enable FFmpeg for additional audio formats (Linux only)
GGML_CUDA: Enable CUDA support (replaces deprecated WHISPER_CUDA)
GGML_METAL: Enable Metal support (automatic on macOS)
GGML_VULKAN: Enable Vulkan support
GGML_BLAS: Enable BLAS/OpenBLAS support

Model Loading Flow

Load model file (whisper_init_from_file or whisper_init_from_buffer)
Parse ggml format: magic number, hparams, mel filters, vocabulary, tensors
Initialize hardware backend if available (Core ML, OpenVINO)
Create inference state
Ready for transcription via whisper_full()

Transcription Flow

Convert audio to mel spectrogram (using mel filters from model)
Run encoder on mel spectrogram (potentially on specialized hardware)
Run decoder autoregressively to generate tokens
Convert tokens to text using vocabulary
Apply timestamp alignment if requested
Return segments with text and timestamps

Voice Activity Detection (VAD)

VAD preprocessing filters silence before transcription:

Download VAD model: ./models/download-vad-model.sh silero-v6.2.0
Use with --vad and -vm path/to/vad-model.bin
Significantly speeds up processing by skipping silence
Configurable thresholds and parameters (see README section on VAD)

Debugging

Use --debug-mode flag in examples for verbose output
Print internal state with --print-special and --print-colors
Enable sanitizers: -DWHISPER_SANITIZE_ADDRESS=ON, -DWHISPER_SANITIZE_THREAD=ON
Check system info in output to verify hardware acceleration is active

Common Patterns

Adding a New Example

Create directory in examples/your-example/
Add your-example.cpp with main()
Create CMakeLists.txt linking to whisper library
Add to examples/CMakeLists.txt
Use common utilities from examples/common.h for arg parsing

Adding Hardware Backend Support

Most backend support goes through ggml. For encoder offloading (Core ML, OpenVINO):

Conversion script in models/ to generate backend-specific model
Runtime code in src/coreml/ or src/openvino/
CMake option to enable/disable
Initialization in whisper_init_state() checks for backend availability
Encoder execution routes through backend if available, falls back to ggml otherwise

Working with Models

Model files are portable but backend-specific accelerator models are not:

ggml models work everywhere
Core ML models are macOS/iOS specific
OpenVINO models are device-specific after first compilation
Always include both base ggml model and accelerator model for full functionality

Current Project: Enhanced whisper-stream

Goal

Adapt examples/stream/stream.cpp to support an enhanced real-time transcription use case with initial voice activation check, continual silence detection, and UDP network output.

Key Requirements

1. Initial Voice Activation Check

Start with VAD model only for configurable duration (default: 10 seconds)
Check audio on capture device in 2-second increments
If NO audio is detected during initial period, exit before loading whisper model
This saves resources when no speech is present
Uses existing vad_simple() functionality from examples/stream/stream.cpp:283

2. Continual Silence Detection

Already partially implemented at examples/stream/stream.cpp:249-252
Current code: exits after 3 minutes (180 seconds) of continual silence
Needs better integration with existing code structure
Should match existing coding style and patterns
Make silence timeout configurable via command-line parameter

3. UDP Network Output

Send current transcription segment as UDP packets
Use SRT format (see examples/cli/cli.cpp:489-506 for reference)
Each packet should contain:
- Segment number/ID
- Start time (t0)
- End time (t1)
- Transcription text
- Current timestamp (for network unreliability handling)
Target host/port should be configurable
Copy output_srt() formatting logic into stream.cpp to minimize dependencies

4. Console Output Enhancement

Current behavior: same segment is repeated step/length times as it's refined
Each inference step updates the current segment (same t0/t1, changing text)
Need to identify segments consistently across updates (by t0/t1 or other means)
Consider experiments with console output before implementing UDP transfer

5. Implementation Constraints

Do NOT modify existing code - only add new functionality
This makes it easier to incorporate upstream changes from remote repo
All new features should be toggleable via command-line parameters
All features should be OFF by default (opt-in)
Follow existing code style and patterns

Configuration Parameters to Add

New command-line options needed:

--vad-startup-ms N: Duration of initial VAD check before loading whisper model (default: 10000ms, 0 = disabled)
--silence-timeout-ms N: Duration of continual silence before exit (default: 180000ms, 0 = disabled)
--udp-host HOST: Target host for UDP output (default: none)
--udp-port PORT: Target port for UDP output (default: none)
--udp-enable: Enable UDP output (default: false)

Technical Notes

SRT Format Reference

From examples/cli/cli.cpp:489-506:

static void output_srt(struct whisper_context * ctx, std::ofstream & fout,
                       const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
    const int n_segments = whisper_full_n_segments(ctx);
    for (int i = 0; i < n_segments; ++i) {
        const char * text = whisper_full_get_segment_text(ctx, i);
        const int64_t t0 = whisper_full_get_segment_t0(ctx, i);
        const int64_t t1 = whisper_full_get_segment_t1(ctx, i);

        fout << i + 1 + params.offset_n << "\n";
        fout << to_timestamp(t0, true) << " --> " << to_timestamp(t1, true) << "\n";
        fout << text << "\n\n";
    }
}

The to_timestamp() function is in examples/common-whisper.cpp:138-151:

std::string to_timestamp(int64_t t, bool comma) {
    int64_t msec = t * 10;
    int64_t hr = msec / (1000 * 60 * 60);
    msec = msec - hr * (1000 * 60 * 60);
    int64_t min = msec / (1000 * 60);
    msec = msec - min * (1000 * 60);
    int64_t sec = msec / 1000;
    msec = msec - sec * 1000;

    char buf[32];
    snprintf(buf, sizeof(buf), "%02d:%02d:%02d%s%03d",
             (int) hr, (int) min, (int) sec, comma ? "," : ".", (int) msec);
    return std::string(buf);
}

Current Stream Implementation Details

VAD usage (examples/stream/stream.cpp:137):

const bool use_vad = n_samples_step <= 0; // sliding window mode uses VAD

Current silence detection (examples/stream/stream.cpp:249-252):

if (silence_count * params.step_ms >= 1000 * 60 * 3) {
    printf("silent for %d steps. bye!\n", silence_count);
    is_running = false;
}

VAD check logic (examples/stream/stream.cpp:283-287):

if (::vad_simple(pcmf32_new, WHISPER_SAMPLE_RATE, 1000, params.vad_thold, params.freq_thold, false)) {
    silence_count++;
} else {
    silence_count = 0;
}

Typical usage pattern:

# User typically runs with 2s step and 10s length
./build/bin/whisper-stream -m models/ggml-base.en.bin --step 2000 --length 10000

Segment Update Behavior

In stream mode (non-VAD), the same segment is refined across multiple inference steps:

Same t0 and t1 values across multiple steps
Text content may change as more context is available
Console currently shows segment repeatedly
UDP receiver needs to handle segment updates (same t0/t1 = update existing)
Target device displays last N segments (e.g., 2) and updates based on timestamp

Implementation Strategy

First, add new command-line parameters to whisper_params struct
Implement initial VAD check before whisper model loading
Better integrate existing silence detection code
Add UDP socket functionality (platform-specific: POSIX vs Windows)
Implement SRT formatting for UDP packets
Test console output behavior across multiple steps
Implement segment tracking and update logic for UDP transmission

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Overview

Build Commands

Standard Build (CMake)

Quick Start (Makefile)

Build with Hardware Acceleration

Build Options

Testing

Run Integration Tests

Build and Run Tests

Running Examples

Basic Transcription

Real-time Stream Example

Server Example

Architecture Overview

Core Components

Model Architecture

Hardware Acceleration Strategy

Examples Structure

Language Bindings

Model Format

Core ML Integration

OpenVINO Integration

Development Notes

Audio Input Requirements

Code Organization

CMake Configuration

Model Loading Flow

Transcription Flow

Voice Activity Detection (VAD)

Debugging

Common Patterns

Adding a New Example

Adding Hardware Backend Support

Working with Models

Current Project: Enhanced whisper-stream

Goal

Key Requirements

1. Initial Voice Activation Check

2. Continual Silence Detection

3. UDP Network Output

4. Console Output Enhancement

5. Implementation Constraints

Configuration Parameters to Add

Technical Notes

SRT Format Reference

Current Stream Implementation Details

Segment Update Behavior

Implementation Strategy