Whisper Dictation

A minimalistic, real-time speech-to-text dictation system using OpenAI's Whisper model.

Features

Two-Stage Processing

Real-Time Preview (Quick Chunk Mode)
- Processes the last 3 seconds of audio in real-time
- Provides instant feedback as you speak
- Fast, responsive transcription for monitoring
Final Transcription (Full Processing with Overlap)
- When you click "Stop Recording", processes the entire recording
- Uses 30-second chunks with 5-second overlap
- Provides the most accurate "ground truth" transcription
- Leverages Whisper's full context window for best results

Why Two Stages?

Quick preview: Gives you immediate feedback so you know the system is working
Final processing: Ensures maximum accuracy by processing with proper overlap and context
Best of both worlds: Real-time monitoring + production-quality output

Quick Start

Prerequisites

Python 3.8+
CUDA-capable GPU (recommended) or CPU
Microphone

Easy Installation (Windows)

Double-click setup.bat to create virtual environment and install dependencies
Double-click run.bat to start the server
Open browser to http://localhost:8080

The Whisper model (~3GB) will download automatically on first run.

Easy Installation (Linux/Mac)

./setup.sh   # One-time setup
./run.sh     # Start server

Then open browser to http://localhost:8080

Manual Installation

Click to expand manual installation instructions

# Navigate to project directory
cd whisper_dictation

# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# For CUDA support (recommended):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

Running Manually

# Start the server
python server.py

# Open browser
# Navigate to: http://localhost:8080

Usage

Click "Start Recording"
- Grant microphone permissions when prompted
- Begin speaking
Monitor Real-Time Preview
- See transcription of your last 3 seconds of speech
- Updates continuously as you speak
Click "Stop Recording"
- Server processes entire recording with overlap
- Final transcription appears in the "Final Transcription" section
- This is your production-quality output
Click "Clear"
- Clears all text and resets for a new recording

Architecture

Backend (server.py)

FastAPI web server with WebSocket support
Whisper Large v3 for transcription
Real-time audio streaming via WebSocket
Dual processing modes:
- Quick: Last 3 seconds for preview
- Full: Entire recording with 30s chunks + 5s overlap

Frontend (static/index.html)

Minimal HTML/CSS/JavaScript
Web Audio API for microphone capture
WebSocket for real-time audio streaming
Responsive UI with visual feedback

Configuration

Edit server.py to adjust parameters:

QUICK_CHUNK_SECONDS = 3.0   # Preview window size
FULL_CHUNK_SECONDS = 30.0   # Full processing chunk size
OVERLAP_SECONDS = 5.0       # Overlap between chunks
MODEL_NAME = "openai/whisper-large-v3"  # Whisper model

Performance

Expected Latency

Preview: ~1-2 seconds (GPU) or ~3-5 seconds (CPU)
Final processing: Depends on recording length
- ~2-3 seconds per 30s chunk (GPU)
- ~10-15 seconds per 30s chunk (CPU)

GPU vs CPU

GPU: Recommended for real-time performance
CPU: Works but slower, preview may lag

Memory Usage

~6GB VRAM (GPU) or ~8GB RAM (CPU)
Model loaded once at startup
Scales with recording duration

Troubleshooting

Model Download

On first run, Whisper models will be downloaded (~3GB):

Models cached in ~/.cache/huggingface/
Subsequent runs use cached models

Microphone Access

Browser will prompt for microphone permissions
Allow access for the application to work
Check browser console for errors

WebSocket Connection

Ensure server is running on port 8080
Check firewall settings
Review server logs for errors

License

MIT License

Acknowledgments

OpenAI Whisper for speech recognition
FastAPI for web framework
Hugging Face Transformers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whisper Dictation

Features

Two-Stage Processing

Why Two Stages?

Quick Start

Prerequisites

Easy Installation (Windows)

Easy Installation (Linux/Mac)

Manual Installation

Running Manually

Usage

Architecture

Backend (server.py)

Frontend (static/index.html)

Configuration

Performance

Expected Latency

GPU vs CPU

Memory Usage

Troubleshooting

Model Download

Microphone Access

WebSocket Connection

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
static		static
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.bat		run.bat
run.sh		run.sh
server.py		server.py
setup.bat		setup.bat
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

Whisper Dictation

Features

Two-Stage Processing

Why Two Stages?

Quick Start

Prerequisites

Easy Installation (Windows)

Easy Installation (Linux/Mac)

Manual Installation

Running Manually

Usage

Architecture

Backend (server.py)

Frontend (static/index.html)

Configuration

Performance

Expected Latency

GPU vs CPU

Memory Usage

Troubleshooting

Model Download

Microphone Access

WebSocket Connection

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages