Skip to content

daryljones/speech-to-text

Repository files navigation

speech-to-text — Flask UI for faster-whisper transcription

This project provides a small Flask web UI to upload audio files (mp3, wav, flac) and transcribe them using faster-whisper.

Key files

  • app.py — Flask application and transcription handler (uses faster-whisper)
  • templates/ — Jinja2 templates (index.html, result.html)
  • static/ — styles, images and client JS
  • pyproject.toml — project manifest (dependencies)

Quick start (development)

  1. Sync dependencies and create the virtual environment with uv:
uv sync
  1. Start the app (development):
uv run --reload
# or use the venv python directly
./.venv/bin/python -m app

Open http://127.0.0.1:8000/ in your browser.

Running in production with Gunicorn

  1. Ensure dependencies are installed (see above):
uv sync
  1. Start Gunicorn using the included config:
gunicorn -c gunicorn_conf.py app:app

Gunicorn will bind to 0.0.0.0:8000 by default; adjust gunicorn_conf.py as needed.

What the UI does

  • Upload an audio file (mp3, wav, flac).
  • Upload an audio file (mp3, wav, flac, au).
  • The UI shows upload progress and an indeterminate "Transcribing..." state while the server runs faster-whisper.
  • After completion the transcript is displayed and you can copy it, download it as a .txt, or go back to upload another file.

Where files are stored

  • Uploaded audio files are saved to a temporary file on disk (OS temp directory, e.g. /tmp) only for the duration of transcription and are deleted immediately after processing.
  • Transcribed text is not persisted to disk by default; it is rendered into the result page and kept in memory only for the request.

Environment variables

  • FLASK_SECRET — Flask secret key used for session/flashing (default: change-me)
  • WHISPER_DEVICE — optional override for model device (cpu, cuda, mps)
  • WHISPER_COMPUTE_TYPE — optional override for compute type (float32, float16)

These env vars can be set before running the app, for example:

export WHISPER_DEVICE=cuda
export WHISPER_COMPUTE_TYPE=float16
export FLASK_SECRET='a-secret'
uv run --reload

Notes on performance & model files

  • The first run will download model weights to your machine (HF cache). Faster-whisper may also use ctranslate2 and other compiled backends; those caches live in the normal HF/ctranslate2 cache locations on your system.
  • On CPU-only machines the app prefers float32 to avoid inefficient float16 conversions; on GPU-enabled machines the code attempts to select float16 for better throughput.

Security & privacy

  • Uploaded audio and generated transcripts are not retained by default. If you change the app to persist data, consider access controls and GDPR/privacy requirements.

Extending persistence (optional)

  • If you want to keep transcripts or uploads, two low-risk approaches are:
    • Save files to an uploads/ folder and transcripts to transcripts/ as .txt files (simple, file-based). Add rotation/cleanup.
    • Save transcripts to a small SQLite database with metadata (filename, timestamp, duration, language).

Troubleshooting

  • If you see a warning about float16 -> float32, either force WHISPER_COMPUTE_TYPE=float32 or run on a machine with appropriate GPU/backends.
  • If uv sync shows a TOML warning about [tool.uv.scripts] ignore it (config quirk). The project should still install dependencies.
  • If uploads through an nginx proxy return HTTP 413 (Request Entity Too Large), increase client_max_body_size in your nginx config. Example:
server {
	listen 80;
	server_name example.com;

	client_max_body_size 100M; # set to desired size

	location / {
		proxy_pass http://127.0.0.1:8000;
		proxy_set_header Host $host;
		proxy_set_header X-Real-IP $remote_addr;
		proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
		proxy_set_header X-Forwarded-Proto $scheme;
	}
}

Reload nginx after change:

sudo nginx -t && sudo systemctl reload nginx

Feedback & next steps

  • I can: add persistent storage, SRT/VTT export formats, WebSocket/SSE partial updates, or admin pages that show model/device state. Tell me which feature you want next.

Systemd unit (example)

Copy the example unit to /etc/systemd/system/ and an environment file to /etc/default/:

sudo cp deploy/speech-to-text.service /etc/systemd/system/speech-to-text.service
sudo cp deploy/speech-to-text.env.example /etc/default/speech-to-text
sudo systemctl daemon-reload
sudo systemctl enable --now speech-to-text.service

The unit uses the project's .venv gunicorn binary and the gunicorn_conf.py config. Adjust paths or the User/Group in the unit as needed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors