A minimalistic, real-time speech-to-text dictation system using OpenAI's Whisper model.
-
Real-Time Preview (Quick Chunk Mode)
- Processes the last 3 seconds of audio in real-time
- Provides instant feedback as you speak
- Fast, responsive transcription for monitoring
-
Final Transcription (Full Processing with Overlap)
- When you click "Stop Recording", processes the entire recording
- Uses 30-second chunks with 5-second overlap
- Provides the most accurate "ground truth" transcription
- Leverages Whisper's full context window for best results
- Quick preview: Gives you immediate feedback so you know the system is working
- Final processing: Ensures maximum accuracy by processing with proper overlap and context
- Best of both worlds: Real-time monitoring + production-quality output
- Python 3.8+
- CUDA-capable GPU (recommended) or CPU
- Microphone
- Double-click
setup.batto create virtual environment and install dependencies - Double-click
run.batto start the server - Open browser to http://localhost:8080
The Whisper model (~3GB) will download automatically on first run.
./setup.sh # One-time setup
./run.sh # Start serverThen open browser to http://localhost:8080
Click to expand manual installation instructions
# Navigate to project directory
cd whisper_dictation
# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# For CUDA support (recommended):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118# Start the server
python server.py
# Open browser
# Navigate to: http://localhost:8080-
Click "Start Recording"
- Grant microphone permissions when prompted
- Begin speaking
-
Monitor Real-Time Preview
- See transcription of your last 3 seconds of speech
- Updates continuously as you speak
-
Click "Stop Recording"
- Server processes entire recording with overlap
- Final transcription appears in the "Final Transcription" section
- This is your production-quality output
-
Click "Clear"
- Clears all text and resets for a new recording
- FastAPI web server with WebSocket support
- Whisper Large v3 for transcription
- Real-time audio streaming via WebSocket
- Dual processing modes:
- Quick: Last 3 seconds for preview
- Full: Entire recording with 30s chunks + 5s overlap
- Minimal HTML/CSS/JavaScript
- Web Audio API for microphone capture
- WebSocket for real-time audio streaming
- Responsive UI with visual feedback
Edit server.py to adjust parameters:
QUICK_CHUNK_SECONDS = 3.0 # Preview window size
FULL_CHUNK_SECONDS = 30.0 # Full processing chunk size
OVERLAP_SECONDS = 5.0 # Overlap between chunks
MODEL_NAME = "openai/whisper-large-v3" # Whisper model- Preview: ~1-2 seconds (GPU) or ~3-5 seconds (CPU)
- Final processing: Depends on recording length
- ~2-3 seconds per 30s chunk (GPU)
- ~10-15 seconds per 30s chunk (CPU)
- GPU: Recommended for real-time performance
- CPU: Works but slower, preview may lag
- ~6GB VRAM (GPU) or ~8GB RAM (CPU)
- Model loaded once at startup
- Scales with recording duration
On first run, Whisper models will be downloaded (~3GB):
- Models cached in
~/.cache/huggingface/ - Subsequent runs use cached models
- Browser will prompt for microphone permissions
- Allow access for the application to work
- Check browser console for errors
- Ensure server is running on port 8080
- Check firewall settings
- Review server logs for errors
MIT License
- OpenAI Whisper for speech recognition
- FastAPI for web framework
- Hugging Face Transformers