简体中文 | English
This guide is for first-time deployers. Expect 15–30 minutes, most of it waiting for model weights to download.
| Platform | Path | Quality | Notes |
|---|---|---|---|
| Linux + NVIDIA GPU | docker-compose (main flow below) | Best | Recommended, the main path of this doc |
| Windows 11 + WSL2 + NVIDIA GPU | docker-compose (Linux flow inside WSL2) | Best | See 0.2 |
| macOS Apple Silicon (M1/M2/M3/M4) | native venv, CPU-only | Usable but slow | Docker Desktop on macOS cannot pass through the GPU; see 0.3 |
| macOS Intel | native venv, CPU-only | Usable but very slow | Same as Apple Silicon; see 0.3 |
CPU / low-VRAM deployments: use WHISPER_MODEL=medium (covered in step 2).
It's 3–4× faster than large-v3 and quality stays acceptable, especially for
Chinese and English.
- Create a read token at https://huggingface.co/settings/tokens (starts
with
hf_). Tokens are account-level — you can create it any time, order doesn't matter. - Click Agree and access repository at https://huggingface.co/pyannote/speaker-diarization-3.1.
- Do the same at https://huggingface.co/pyannote/segmentation-3.0.
- And at https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM (added in 0.3.0 — used for speaker embeddings).
Creating the token and accepting gated-model terms are independent. Order doesn't matter, but all three model agreements + a valid token must be in place to download weights: token without accepted terms → 403, accepted terms without token → 401.
- Docker 24+
- NVIDIA Container Toolkit (without it, compose fails with
could not select device driver):# Ubuntu example curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker # verify docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
VRAM guidance:
- ≥ 12 GB → default
large-v3 - 8–12 GB →
large-v3still fits (~9 GB in practice), just don't share the GPU with another heavy job - < 8 GB → set
WHISPER_MODEL=medium
Then jump to step 1.
The officially supported path is WSL2 + NVIDIA Container Toolkit. Docker Desktop's GPU passthrough is itself routed through WSL2 under the hood, so the two are equivalent.
Prereqs:
- Windows 11, or Windows 10 21H2+
- WSL2 Ubuntu installed (
wsl --install -d Ubuntu) - NVIDIA driver ≥ 470 on Windows
- Docker available inside WSL2 (either install docker directly in WSL2, or install Docker Desktop on Windows and enable "Use WSL 2 based engine" + "Enable integration with my default WSL distro")
From then on, every command runs inside the WSL2 Ubuntu shell. Follow 0.1 Linux. Verify:
# inside WSL2 Ubuntu
nvidia-smi # should see your NVIDIA GPU
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smiIf both work, the Linux flow applies verbatim.
Important: Docker Desktop on macOS cannot pass a GPU into containers
(no CUDA, no Metal). The docker-compose path on macOS is CPU-only and
large-v3 is effectively unusable there. Use native venv + CPU and
drop the model to medium.
Prereqs:
- Python 3.11 (recommend
brew install python@3.11) - ffmpeg (
brew install ffmpeg) - libsndfile (
brew install libsndfile) - 16 GB RAM or more (Apple Silicon unified memory counts)
# 1. clone
git clone https://github.com/MapleEve/voscript.git
cd voscript
# 2. create venv at the repo root
python3.11 -m venv .venv
source .venv/bin/activate
# 3. install deps (on macOS, torch resolves to the CPU/MPS wheel, not CUDA)
pip install --upgrade pip
pip install -r app/requirements.txt
# 4. set env vars
export HF_TOKEN="<HF_TOKEN>"
export API_KEY=$(openssl rand -hex 32)
export DEVICE=cpu # macOS must be cpu (pyannote's MPS support is incomplete)
export WHISPER_MODEL=medium # large-v3 on CPU is too slow
export DATA_DIR=$(pwd)/data
mkdir -p "$DATA_DIR"
# Note this API_KEY — BetterAINote needs the exact same value
# 5. launch
cd app
uvicorn main:app --host 0.0.0.0 --port 8780Expected performance (ballpark):
- M2 Pro / M3 Pro +
medium+ 1 minute of audio ≈ 30–60 s - M1 / Intel +
medium+ 1 minute of audio ≈ 1.5–3 min large-v3on CPU is 3–5× slower. Not recommended.
Known limitations:
- Docker-compose path is not supported on macOS
- No MPS acceleration (pyannote 3.1 has unimplemented ops on MPS, either errors or silently falls back to CPU)
- If you have access to a Linux / Windows host with an NVIDIA GPU, run the service there and use Mac as the client
Everything after this point (config, wiring into BetterAINote) is the same, just skip every docker step.
git clone https://github.com/MapleEve/voscript.git
cd voscriptcp .env.example .envEdit .env. At minimum fill in:
HF_TOKEN=<HF_TOKEN>
API_KEY=<API_KEY>If you're short on VRAM (< 12 GB), or deploying on macOS / CPU-only, drop the model one size:
WHISPER_MODEL=mediumChoices: tiny / base / small / medium / large-v3. medium gives a
~3–4× speed-up with only a small quality drop, especially for Chinese and
English.
If you are on a China network, also add:
HF_ENDPOINT=https://hf-mirror.comGenerate a strong API key:
openssl rand -hex 32
Every other env var has a sane default — see .env.example.
For the complete list, defaults, API override semantics, and tuning boundaries
that are not exposed yet, see configuration.en.md.
A few worth knowing about:
| Variable | Default | Effect |
|---|---|---|
MAX_UPLOAD_BYTES |
2147483648 (2 GiB) |
Per-request upload cap; requests past this get HTTP 413 |
APP_UID |
1000 |
uid the container runs as — must match the owner of DATA_DIR on the host |
APP_GID |
1000 |
same, gid |
JOBS_MAX_CACHE |
200 |
LRU cap for the in-memory job dictionary; evicted jobs remain queryable via disk status.json |
FFMPEG_TIMEOUT_SEC |
1800 |
Timeout in seconds for ffmpeg conversion; returns 504 on expiry |
CUDA_VISIBLE_DEVICES |
unset | Optional NVIDIA visibility limit; by default this variable is not injected and compose requests every Docker-exposed GPU. To restrict visibility, add it through docker-compose.override.yml or another explicit operator env override. When set, in-container cuda:N indexes are remapped from the visible set and may not match physical host GPU numbers |
MODEL_IDLE_TIMEOUT_SEC |
180 |
GPU model idle-unload timeout, defaulting to 180 seconds (3 minutes); set 0 to disable idle unload and keep models resident. When enabled, ASR, diarization, and embedding each reselect the best visible CUDA device on their next lazy load |
ALLOW_NO_AUTH |
0 |
Set to 1 to suppress the startup warning when no API_KEY is configured (explicitly confirms unauthenticated mode) |
DENOISE_MODEL |
none |
Service default denoise backend: none, deepfilternet, or noisereduce; API requests may override it per job |
DENOISE_SNR_THRESHOLD |
10.0 |
DeepFilterNet SNR gate in dB; audio at or above this value skips DeepFilterNet when deepfilternet is selected; noisereduce is not gated |
VOICEPRINT_THRESHOLD |
0.75 |
Base raw-cosine voiceprint threshold before per-speaker adaptive adjustment |
PYANNOTE_MIN_DURATION_OFF |
0.5 |
Pyannote off-turn smoothing, used to reduce over-segmentation of short pauses |
MIN_EMBED_DURATION |
1.5 |
Minimum diarization turn duration used for speaker embedding extraction |
MAX_EMBED_DURATION |
10.0 |
Maximum per-turn audio window used for speaker embedding extraction |
WHISPERX_ALIGN_DISABLED_LANGUAGES |
empty | Comma-separated languages that explicitly skip WhisperX forced alignment; use only as a temporary operational fallback |
WHISPERX_ALIGN_DEVICE |
cpu |
Runtime device for WhisperX forced alignment; CPU is the default to keep alignment isolated from GPU ASR / speaker-embedding runtimes |
WHISPERX_ALIGN_MODEL_MAP |
empty | Comma-separated lang=model overrides, for example zh=your-org/your-zh-align-model |
WHISPERX_ALIGN_MODEL_DIR |
empty | Optional alignment model cache directory passed through when the installed WhisperX supports it |
WHISPERX_ALIGN_CACHE_ONLY |
0 |
Set to 1 to request cache-only alignment model loading when supported by the installed WhisperX version |
For POST /api/transcribe, omitting denoise_model means "use the service
default from DENOISE_MODEL". Sending denoise_model=none is the explicit
per-request opt-out. Sending snr_threshold always overrides
DENOISE_SNR_THRESHOLD for that request only, but only affects
deepfilternet; noisereduce runs whenever selected.
For every supported setting, the Whisper / ASR parameters that are not exposed
as env yet, and AS-norm cohort preservation semantics, see
configuration.en.md.
Chinese word-level alignment is attempted by default and runs on CPU by
default to keep wav2vec2 alignment isolated from GPU ASR / speaker-embedding
runtimes. The Docker image uses PyTorch 2.6.0 so recent transformers safety
checks can load the default Chinese .bin alignment weights. If you run a
custom image with older torch, use torch>=2.6 or a trusted replacement
alignment model that provides safetensors; only set
WHISPERX_ALIGN_DISABLED_LANGUAGES=zh if you intentionally want a temporary
segment-level fallback.
The container runs as uid 1000 by default, so DATA_DIR (default
./data) and MODEL_CACHE_DIR (default ./models) must be writable
by uid 1000 on the host. On most Linux distros the first regular user
is uid 1000, so plain mkdir -p data models before docker compose up just works. If your user is a different uid, pick one:
# A. change the host dirs to 1000:1000
sudo chown -R 1000:1000 ./data ./models
# B. or tell the container to use YOUR uid
echo "APP_UID=$(id -u)" >> .env
echo "APP_GID=$(id -g)" >> .envdocker compose up -d --buildThe first boot downloads ~5 GB of model weights into ./models/. Watch
progress with:
docker logs -f voscriptYou are good when you see Uvicorn running on http://0.0.0.0:8780.
Or run the bundled helper:
./scripts/deploy.shIt checks .env, starts the container, and waits for /healthz.
# Health check (always unauthenticated)
curl -sf http://localhost:8780/healthz
# → {"ok":true}
# Any /api/* call needs API_KEY
curl -sS http://localhost:8780/api/voiceprints \
-H "Authorization: Bearer $API_KEY"
# → [] (empty on first boot)Open http://localhost:8780/ in a browser for a minimal web UI you can upload audio to.
In BetterAINote → Settings → Transcription, set:
- Private transcription base URL:
http://<host>:8780 - Private transcription API key: the exact
API_KEYfrom.env
Once saved, the BetterAINote worker will route every recording through
this service. See api.en.md for the full contract.
cd voscript
git pull
docker compose up -d --buildModel weights in ./models/ are cached, rebuild won't redownload them.
Two extra steps:
- Accept one more HuggingFace gated model at https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM — WeSpeaker replaces ECAPA-TDNN for speaker embeddings. Without the click-through you'll get a 403 on first boot.
- Re-enroll every existing voiceprint. 0.3.0 uses a new embedding
space (WeSpeaker ≠ ECAPA, cosine distances between the two spaces
are meaningless), so:
- Quickest:
rm data/voiceprints/voiceprints.db, let the container rebuild it empty, then re-enroll from fresh transcriptions. - Or per-speaker:
curl -X DELETE -H "Authorization: Bearer $API_KEY" http://host:8780/api/voiceprints/<spk_id>for each enrolled id, then re-enroll.
- Quickest:
Legacy index.json + .npy files from 0.2.x are auto-imported into the
new sqlite store on first boot — that doesn't lose data, but the
imported embeddings are still ECAPA-based and won't match any
WeSpeaker-generated queries. You still have to re-enroll.
→ NVIDIA Container Toolkit missing or Docker wasn't restarted. Redo step 0.
→ Gated-model terms not accepted. Revisit HuggingFace prep in step 0.
→ HF_TOKEN missing, wrong, or expired. Check .env.
→ Confirm DEVICE=cpu and WHISPER_MODEL=medium. Large models on CPU
really are this slow — consider running the service on a Linux/Windows
host with an NVIDIA GPU instead.
→ Your requirements.txt has been edited and numpy upgraded to 2.x. Keep
the numpy<2.0 pin.
→ Check that API_KEY matches exactly on both sides (case/whitespace),
and that BetterAINote's host can actually reach :8780 (firewall,
docker networks).
→ Just data/voiceprints/. Everything else can be re-derived from the
original audio.
See security.en.md for deployment-risk details.