Quickstart

This guide is for first-time deployers. Expect 15–30 minutes, most of it waiting for model weights to download.

0. Pick your deployment path

Platform	Path	Quality	Notes
Linux + NVIDIA GPU	docker-compose (main flow below)	Best	Recommended, the main path of this doc
Windows 11 + WSL2 + NVIDIA GPU	docker-compose (Linux flow inside WSL2)	Best	See 0.2
macOS Apple Silicon (M1/M2/M3/M4)	native venv, CPU-only	Usable but slow	Docker Desktop on macOS cannot pass through the GPU; see 0.3
macOS Intel	native venv, CPU-only	Usable but very slow	Same as Apple Silicon; see 0.3

CPU / low-VRAM deployments: use WHISPER_MODEL=medium (covered in step 2). It's 3–4× faster than large-v3 and quality stays acceptable, especially for Chinese and English.

HuggingFace prep (all platforms)

Create a read token at https://huggingface.co/settings/tokens (starts with hf_). Tokens are account-level — you can create it any time, order doesn't matter.
Click Agree and access repository at https://huggingface.co/pyannote/speaker-diarization-3.1.
Do the same at https://huggingface.co/pyannote/segmentation-3.0.
And at https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM (added in 0.3.0 — used for speaker embeddings).

Creating the token and accepting gated-model terms are independent. Order doesn't matter, but all three model agreements + a valid token must be in place to download weights: token without accepted terms → 403, accepted terms without token → 401.

0.1 Linux + NVIDIA GPU (main path)

Docker 24+

NVIDIA Container Toolkit (without it, compose fails with could not select device driver):

# Ubuntu example
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
    sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# verify
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

VRAM guidance:

≥ 12 GB → default large-v3
8–12 GB → large-v3 still fits (~9 GB in practice), just don't share the GPU with another heavy job
< 8 GB → set WHISPER_MODEL=medium

Then jump to step 1.

0.2 Windows 11 + WSL2

The officially supported path is WSL2 + NVIDIA Container Toolkit. Docker Desktop's GPU passthrough is itself routed through WSL2 under the hood, so the two are equivalent.

Prereqs:

Windows 11, or Windows 10 21H2+
WSL2 Ubuntu installed (wsl --install -d Ubuntu)
NVIDIA driver ≥ 470 on Windows
Docker available inside WSL2 (either install docker directly in WSL2, or install Docker Desktop on Windows and enable "Use WSL 2 based engine" + "Enable integration with my default WSL distro")

From then on, every command runs inside the WSL2 Ubuntu shell. Follow 0.1 Linux. Verify:

# inside WSL2 Ubuntu
nvidia-smi                        # should see your NVIDIA GPU
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

If both work, the Linux flow applies verbatim.

0.3 macOS (Apple Silicon / Intel)

Important: Docker Desktop on macOS cannot pass a GPU into containers (no CUDA, no Metal). The docker-compose path on macOS is CPU-only and large-v3 is effectively unusable there. Use native venv + CPU and drop the model to medium.

Prereqs:

Python 3.11 (recommend brew install python@3.11)
ffmpeg (brew install ffmpeg)
libsndfile (brew install libsndfile)
16 GB RAM or more (Apple Silicon unified memory counts)

# 1. clone
git clone https://github.com/MapleEve/voscript.git
cd voscript

# 2. create venv at the repo root
python3.11 -m venv .venv
source .venv/bin/activate

# 3. install deps (on macOS, torch resolves to the CPU/MPS wheel, not CUDA)
pip install --upgrade pip
pip install -r app/requirements.txt

# 4. set env vars
export HF_TOKEN="<HF_TOKEN>"
export API_KEY=$(openssl rand -hex 32)
export DEVICE=cpu                 # macOS must be cpu (pyannote's MPS support is incomplete)
export WHISPER_MODEL=medium       # large-v3 on CPU is too slow
export DATA_DIR=$(pwd)/data
mkdir -p "$DATA_DIR"

# Note this API_KEY — BetterAINote needs the exact same value

# 5. launch
cd app
uvicorn main:app --host 0.0.0.0 --port 8780

Expected performance (ballpark):

M2 Pro / M3 Pro + medium + 1 minute of audio ≈ 30–60 s
M1 / Intel + medium + 1 minute of audio ≈ 1.5–3 min
large-v3 on CPU is 3–5× slower. Not recommended.

Known limitations:

Docker-compose path is not supported on macOS
No MPS acceleration (pyannote 3.1 has unimplemented ops on MPS, either errors or silently falls back to CPU)
If you have access to a Linux / Windows host with an NVIDIA GPU, run the service there and use Mac as the client

Everything after this point (config, wiring into BetterAINote) is the same, just skip every docker step.

1. Clone the repo

git clone https://github.com/MapleEve/voscript.git
cd voscript

2. Configure `.env`

cp .env.example .env

Edit .env. At minimum fill in:

HF_TOKEN=<HF_TOKEN>
API_KEY=<API_KEY>

If you're short on VRAM (< 12 GB), or deploying on macOS / CPU-only, drop the model one size:

WHISPER_MODEL=medium

Choices: tiny / base / small / medium / large-v3. medium gives a ~3–4× speed-up with only a small quality drop, especially for Chinese and English.

If you are on a China network, also add:

HF_ENDPOINT=https://hf-mirror.com

Generate a strong API key: openssl rand -hex 32

Every other env var has a sane default — see .env.example. For the complete list, defaults, API override semantics, and tuning boundaries that are not exposed yet, see configuration.en.md. A few worth knowing about:

Variable	Default	Effect
`MAX_UPLOAD_BYTES`	`2147483648` (2 GiB)	Per-request upload cap; requests past this get `HTTP 413`
`APP_UID`	`1000`	uid the container runs as — must match the owner of `DATA_DIR` on the host
`APP_GID`	`1000`	same, gid
`JOBS_MAX_CACHE`	`200`	LRU cap for the in-memory job dictionary; evicted jobs remain queryable via disk status.json
`FFMPEG_TIMEOUT_SEC`	`1800`	Timeout in seconds for ffmpeg conversion; returns 504 on expiry
`CUDA_VISIBLE_DEVICES`	unset	Optional NVIDIA visibility limit; by default this variable is not injected and compose requests every Docker-exposed GPU. To restrict visibility, add it through `docker-compose.override.yml` or another explicit operator env override. When set, in-container `cuda:N` indexes are remapped from the visible set and may not match physical host GPU numbers
`MODEL_IDLE_TIMEOUT_SEC`	`180`	GPU model idle-unload timeout, defaulting to 180 seconds (3 minutes); set `0` to disable idle unload and keep models resident. When enabled, ASR, diarization, and embedding each reselect the best visible CUDA device on their next lazy load
`ALLOW_NO_AUTH`	`0`	Set to 1 to suppress the startup warning when no API_KEY is configured (explicitly confirms unauthenticated mode)
`DENOISE_MODEL`	`none`	Service default denoise backend: `none`, `deepfilternet`, or `noisereduce`; API requests may override it per job
`DENOISE_SNR_THRESHOLD`	`10.0`	DeepFilterNet SNR gate in dB; audio at or above this value skips DeepFilterNet when `deepfilternet` is selected; `noisereduce` is not gated
`VOICEPRINT_THRESHOLD`	`0.75`	Base raw-cosine voiceprint threshold before per-speaker adaptive adjustment
`PYANNOTE_MIN_DURATION_OFF`	`0.5`	Pyannote off-turn smoothing, used to reduce over-segmentation of short pauses
`MIN_EMBED_DURATION`	`1.5`	Minimum diarization turn duration used for speaker embedding extraction
`MAX_EMBED_DURATION`	`10.0`	Maximum per-turn audio window used for speaker embedding extraction
`WHISPERX_ALIGN_DISABLED_LANGUAGES`	empty	Comma-separated languages that explicitly skip WhisperX forced alignment; use only as a temporary operational fallback
`WHISPERX_ALIGN_DEVICE`	`cpu`	Runtime device for WhisperX forced alignment; CPU is the default to keep alignment isolated from GPU ASR / speaker-embedding runtimes
`WHISPERX_ALIGN_MODEL_MAP`	empty	Comma-separated `lang=model` overrides, for example `zh=your-org/your-zh-align-model`
`WHISPERX_ALIGN_MODEL_DIR`	empty	Optional alignment model cache directory passed through when the installed WhisperX supports it
`WHISPERX_ALIGN_CACHE_ONLY`	`0`	Set to 1 to request cache-only alignment model loading when supported by the installed WhisperX version

For POST /api/transcribe, omitting denoise_model means "use the service default from DENOISE_MODEL". Sending denoise_model=none is the explicit per-request opt-out. Sending snr_threshold always overrides DENOISE_SNR_THRESHOLD for that request only, but only affects deepfilternet; noisereduce runs whenever selected. For every supported setting, the Whisper / ASR parameters that are not exposed as env yet, and AS-norm cohort preservation semantics, see configuration.en.md.

Chinese word-level alignment is attempted by default and runs on CPU by default to keep wav2vec2 alignment isolated from GPU ASR / speaker-embedding runtimes. The Docker image uses PyTorch 2.6.0 so recent transformers safety checks can load the default Chinese .bin alignment weights. If you run a custom image with older torch, use torch>=2.6 or a trusted replacement alignment model that provides safetensors; only set WHISPERX_ALIGN_DISABLED_LANGUAGES=zh if you intentionally want a temporary segment-level fallback.

Host directory ownership

The container runs as uid 1000 by default, so DATA_DIR (default ./data) and MODEL_CACHE_DIR (default ./models) must be writable by uid 1000 on the host. On most Linux distros the first regular user is uid 1000, so plain mkdir -p data models before docker compose up just works. If your user is a different uid, pick one:

# A. change the host dirs to 1000:1000
sudo chown -R 1000:1000 ./data ./models

# B. or tell the container to use YOUR uid
echo "APP_UID=$(id -u)" >> .env
echo "APP_GID=$(id -g)" >> .env

3. Start the service

docker compose up -d --build

The first boot downloads ~5 GB of model weights into ./models/. Watch progress with:

docker logs -f voscript

You are good when you see Uvicorn running on http://0.0.0.0:8780.

Or run the bundled helper:

./scripts/deploy.sh

It checks .env, starts the container, and waits for /healthz.

4. Verify the deployment

# Health check (always unauthenticated)
curl -sf http://localhost:8780/healthz
# → {"ok":true}

# Any /api/* call needs API_KEY
curl -sS http://localhost:8780/api/voiceprints \
    -H "Authorization: Bearer $API_KEY"
# → [] (empty on first boot)

Open http://localhost:8780/ in a browser for a minimal web UI you can upload audio to.

5. Wire it into BetterAINote

In BetterAINote → Settings → Transcription, set:

Private transcription base URL: http://<host>:8780
Private transcription API key: the exact API_KEY from .env

Once saved, the BetterAINote worker will route every recording through this service. See api.en.md for the full contract.

Upgrades

cd voscript
git pull
docker compose up -d --build

Model weights in ./models/ are cached, rebuild won't redownload them.

Upgrading from 0.2.x to 0.3.0

Two extra steps:

Accept one more HuggingFace gated model at https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM — WeSpeaker replaces ECAPA-TDNN for speaker embeddings. Without the click-through you'll get a 403 on first boot.
Re-enroll every existing voiceprint. 0.3.0 uses a new embedding space (WeSpeaker ≠ ECAPA, cosine distances between the two spaces are meaningless), so:
- Quickest: rm data/voiceprints/voiceprints.db, let the container rebuild it empty, then re-enroll from fresh transcriptions.
- Or per-speaker: curl -X DELETE -H "Authorization: Bearer $API_KEY" http://host:8780/api/voiceprints/<spk_id> for each enrolled id, then re-enroll.

Legacy index.json + .npy files from 0.2.x are auto-imported into the new sqlite store on first boot — that doesn't lose data, but the imported embeddings are still ECAPA-based and won't match any WeSpeaker-generated queries. You still have to re-enroll.

Troubleshooting

`nvidia-smi` not found inside container

→ NVIDIA Container Toolkit missing or Docker wasn't restarted. Redo step 0.

`403 Forbidden` downloading pyannote models

→ Gated-model terms not accepted. Revisit HuggingFace prep in step 0.

`401 Unauthorized` downloading pyannote models

→ HF_TOKEN missing, wrong, or expired. Check .env.

macOS runs but is painfully slow

→ Confirm DEVICE=cpu and WHISPER_MODEL=medium. Large models on CPU really are this slow — consider running the service on a Linux/Windows host with an NVIDIA GPU instead.

Crashes with `np.NaN was removed`

→ Your requirements.txt has been edited and numpy upgraded to 2.x. Keep the numpy<2.0 pin.

Service is up but BetterAINote can't reach it

→ Check that API_KEY matches exactly on both sides (case/whitespace), and that BetterAINote's host can actually reach :8780 (firewall, docker networks).

What do I back up?

→ Just data/voiceprints/. Everything else can be re-derived from the original audio.

See security.en.md for deployment-risk details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quickstart

0. Pick your deployment path

HuggingFace prep (all platforms)

0.1 Linux + NVIDIA GPU (main path)

0.2 Windows 11 + WSL2

0.3 macOS (Apple Silicon / Intel)

1. Clone the repo

2. Configure `.env`

Host directory ownership

3. Start the service

4. Verify the deployment

5. Wire it into BetterAINote

Upgrades

Upgrading from 0.2.x to 0.3.0

Troubleshooting

`nvidia-smi` not found inside container

`403 Forbidden` downloading pyannote models

`401 Unauthorized` downloading pyannote models

macOS runs but is painfully slow

Crashes with `np.NaN was removed`

Service is up but BetterAINote can't reach it

What do I back up?

FilesExpand file tree

quickstart.en.md

Latest commit

History

quickstart.en.md

File metadata and controls

Quickstart

0. Pick your deployment path

HuggingFace prep (all platforms)

0.1 Linux + NVIDIA GPU (main path)

0.2 Windows 11 + WSL2

0.3 macOS (Apple Silicon / Intel)

1. Clone the repo

2. Configure .env

Host directory ownership

3. Start the service

4. Verify the deployment

5. Wire it into BetterAINote

Upgrades

Upgrading from 0.2.x to 0.3.0

Troubleshooting

nvidia-smi not found inside container

403 Forbidden downloading pyannote models

401 Unauthorized downloading pyannote models

macOS runs but is painfully slow

Crashes with np.NaN was removed

Service is up but BetterAINote can't reach it

What do I back up?

2. Configure `.env`

`nvidia-smi` not found inside container

`403 Forbidden` downloading pyannote models

`401 Unauthorized` downloading pyannote models

Crashes with `np.NaN was removed`