speech-to-text — Flask UI for faster-whisper transcription
This project provides a small Flask web UI to upload audio files (mp3, wav, flac) and transcribe them using faster-whisper.
Key files
app.py— Flask application and transcription handler (uses faster-whisper)templates/— Jinja2 templates (index.html,result.html)static/— styles, images and client JSpyproject.toml— project manifest (dependencies)
Quick start (development)
- Sync dependencies and create the virtual environment with uv:
uv sync- Start the app (development):
uv run --reload
# or use the venv python directly
./.venv/bin/python -m appOpen http://127.0.0.1:8000/ in your browser.
Running in production with Gunicorn
- Ensure dependencies are installed (see above):
uv sync- Start Gunicorn using the included config:
gunicorn -c gunicorn_conf.py app:appGunicorn will bind to 0.0.0.0:8000 by default; adjust gunicorn_conf.py as needed.
What the UI does
- Upload an audio file (mp3, wav, flac).
- Upload an audio file (mp3, wav, flac, au).
- The UI shows upload progress and an indeterminate "Transcribing..." state while the server runs faster-whisper.
- After completion the transcript is displayed and you can copy it, download it as a .txt, or go back to upload another file.
Where files are stored
- Uploaded audio files are saved to a temporary file on disk (OS temp directory, e.g.
/tmp) only for the duration of transcription and are deleted immediately after processing. - Transcribed text is not persisted to disk by default; it is rendered into the result page and kept in memory only for the request.
Environment variables
FLASK_SECRET— Flask secret key used for session/flashing (default:change-me)WHISPER_DEVICE— optional override for model device (cpu,cuda,mps)WHISPER_COMPUTE_TYPE— optional override for compute type (float32,float16)
These env vars can be set before running the app, for example:
export WHISPER_DEVICE=cuda
export WHISPER_COMPUTE_TYPE=float16
export FLASK_SECRET='a-secret'
uv run --reloadNotes on performance & model files
- The first run will download model weights to your machine (HF cache). Faster-whisper may also use ctranslate2 and other compiled backends; those caches live in the normal HF/ctranslate2 cache locations on your system.
- On CPU-only machines the app prefers
float32to avoid inefficient float16 conversions; on GPU-enabled machines the code attempts to selectfloat16for better throughput.
Security & privacy
- Uploaded audio and generated transcripts are not retained by default. If you change the app to persist data, consider access controls and GDPR/privacy requirements.
Extending persistence (optional)
- If you want to keep transcripts or uploads, two low-risk approaches are:
- Save files to an
uploads/folder and transcripts totranscripts/as.txtfiles (simple, file-based). Add rotation/cleanup. - Save transcripts to a small SQLite database with metadata (filename, timestamp, duration, language).
- Save files to an
Troubleshooting
- If you see a warning about float16 -> float32, either force
WHISPER_COMPUTE_TYPE=float32or run on a machine with appropriate GPU/backends. - If
uv syncshows a TOML warning about[tool.uv.scripts]ignore it (config quirk). The project should still install dependencies. - If uploads through an nginx proxy return HTTP 413 (Request Entity Too Large), increase
client_max_body_sizein your nginx config. Example:
server {
listen 80;
server_name example.com;
client_max_body_size 100M; # set to desired size
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}Reload nginx after change:
sudo nginx -t && sudo systemctl reload nginxFeedback & next steps
- I can: add persistent storage, SRT/VTT export formats, WebSocket/SSE partial updates, or admin pages that show model/device state. Tell me which feature you want next.
Systemd unit (example)
Copy the example unit to /etc/systemd/system/ and an environment file to /etc/default/:
sudo cp deploy/speech-to-text.service /etc/systemd/system/speech-to-text.service
sudo cp deploy/speech-to-text.env.example /etc/default/speech-to-text
sudo systemctl daemon-reload
sudo systemctl enable --now speech-to-text.serviceThe unit uses the project's .venv gunicorn binary and the gunicorn_conf.py config. Adjust paths or the User/Group in the unit as needed.