Skip to content

cronos3k/whisper-dictation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Whisper Dictation

A minimalistic, real-time speech-to-text dictation system using OpenAI's Whisper model.

Features

Two-Stage Processing

  1. Real-Time Preview (Quick Chunk Mode)

    • Processes the last 3 seconds of audio in real-time
    • Provides instant feedback as you speak
    • Fast, responsive transcription for monitoring
  2. Final Transcription (Full Processing with Overlap)

    • When you click "Stop Recording", processes the entire recording
    • Uses 30-second chunks with 5-second overlap
    • Provides the most accurate "ground truth" transcription
    • Leverages Whisper's full context window for best results

Why Two Stages?

  • Quick preview: Gives you immediate feedback so you know the system is working
  • Final processing: Ensures maximum accuracy by processing with proper overlap and context
  • Best of both worlds: Real-time monitoring + production-quality output

Quick Start

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (recommended) or CPU
  • Microphone

Easy Installation (Windows)

  1. Double-click setup.bat to create virtual environment and install dependencies
  2. Double-click run.bat to start the server
  3. Open browser to http://localhost:8080

The Whisper model (~3GB) will download automatically on first run.

Easy Installation (Linux/Mac)

./setup.sh   # One-time setup
./run.sh     # Start server

Then open browser to http://localhost:8080

Manual Installation

Click to expand manual installation instructions
# Navigate to project directory
cd whisper_dictation

# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# For CUDA support (recommended):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

Running Manually

# Start the server
python server.py

# Open browser
# Navigate to: http://localhost:8080

Usage

  1. Click "Start Recording"

    • Grant microphone permissions when prompted
    • Begin speaking
  2. Monitor Real-Time Preview

    • See transcription of your last 3 seconds of speech
    • Updates continuously as you speak
  3. Click "Stop Recording"

    • Server processes entire recording with overlap
    • Final transcription appears in the "Final Transcription" section
    • This is your production-quality output
  4. Click "Clear"

    • Clears all text and resets for a new recording

Architecture

Backend (server.py)

  • FastAPI web server with WebSocket support
  • Whisper Large v3 for transcription
  • Real-time audio streaming via WebSocket
  • Dual processing modes:
    • Quick: Last 3 seconds for preview
    • Full: Entire recording with 30s chunks + 5s overlap

Frontend (static/index.html)

  • Minimal HTML/CSS/JavaScript
  • Web Audio API for microphone capture
  • WebSocket for real-time audio streaming
  • Responsive UI with visual feedback

Configuration

Edit server.py to adjust parameters:

QUICK_CHUNK_SECONDS = 3.0   # Preview window size
FULL_CHUNK_SECONDS = 30.0   # Full processing chunk size
OVERLAP_SECONDS = 5.0       # Overlap between chunks
MODEL_NAME = "openai/whisper-large-v3"  # Whisper model

Performance

Expected Latency

  • Preview: ~1-2 seconds (GPU) or ~3-5 seconds (CPU)
  • Final processing: Depends on recording length
    • ~2-3 seconds per 30s chunk (GPU)
    • ~10-15 seconds per 30s chunk (CPU)

GPU vs CPU

  • GPU: Recommended for real-time performance
  • CPU: Works but slower, preview may lag

Memory Usage

  • ~6GB VRAM (GPU) or ~8GB RAM (CPU)
  • Model loaded once at startup
  • Scales with recording duration

Troubleshooting

Model Download

On first run, Whisper models will be downloaded (~3GB):

  • Models cached in ~/.cache/huggingface/
  • Subsequent runs use cached models

Microphone Access

  • Browser will prompt for microphone permissions
  • Allow access for the application to work
  • Check browser console for errors

WebSocket Connection

  • Ensure server is running on port 8080
  • Check firewall settings
  • Review server logs for errors

License

MIT License

Acknowledgments

  • OpenAI Whisper for speech recognition
  • FastAPI for web framework
  • Hugging Face Transformers

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors