Skip to content

Latest commit

 

History

History
358 lines (275 loc) · 14.1 KB

File metadata and controls

358 lines (275 loc) · 14.1 KB

Quickstart

简体中文 | English

This guide is for first-time deployers. Expect 15–30 minutes, most of it waiting for model weights to download.

0. Pick your deployment path

Platform Path Quality Notes
Linux + NVIDIA GPU docker-compose (main flow below) Best Recommended, the main path of this doc
Windows 11 + WSL2 + NVIDIA GPU docker-compose (Linux flow inside WSL2) Best See 0.2
macOS Apple Silicon (M1/M2/M3/M4) native venv, CPU-only Usable but slow Docker Desktop on macOS cannot pass through the GPU; see 0.3
macOS Intel native venv, CPU-only Usable but very slow Same as Apple Silicon; see 0.3

CPU / low-VRAM deployments: use WHISPER_MODEL=medium (covered in step 2). It's 3–4× faster than large-v3 and quality stays acceptable, especially for Chinese and English.

HuggingFace prep (all platforms)

Creating the token and accepting gated-model terms are independent. Order doesn't matter, but all three model agreements + a valid token must be in place to download weights: token without accepted terms → 403, accepted terms without token → 401.

0.1 Linux + NVIDIA GPU (main path)

  • Docker 24+
  • NVIDIA Container Toolkit (without it, compose fails with could not select device driver):
    # Ubuntu example
    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
        sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
    curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
        sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
        sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
    sudo nvidia-ctk runtime configure --runtime=docker
    sudo systemctl restart docker
    
    # verify
    docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

VRAM guidance:

  • ≥ 12 GB → default large-v3
  • 8–12 GB → large-v3 still fits (~9 GB in practice), just don't share the GPU with another heavy job
  • < 8 GB → set WHISPER_MODEL=medium

Then jump to step 1.

0.2 Windows 11 + WSL2

The officially supported path is WSL2 + NVIDIA Container Toolkit. Docker Desktop's GPU passthrough is itself routed through WSL2 under the hood, so the two are equivalent.

Prereqs:

  • Windows 11, or Windows 10 21H2+
  • WSL2 Ubuntu installed (wsl --install -d Ubuntu)
  • NVIDIA driver ≥ 470 on Windows
  • Docker available inside WSL2 (either install docker directly in WSL2, or install Docker Desktop on Windows and enable "Use WSL 2 based engine" + "Enable integration with my default WSL distro")

From then on, every command runs inside the WSL2 Ubuntu shell. Follow 0.1 Linux. Verify:

# inside WSL2 Ubuntu
nvidia-smi                        # should see your NVIDIA GPU
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

If both work, the Linux flow applies verbatim.

0.3 macOS (Apple Silicon / Intel)

Important: Docker Desktop on macOS cannot pass a GPU into containers (no CUDA, no Metal). The docker-compose path on macOS is CPU-only and large-v3 is effectively unusable there. Use native venv + CPU and drop the model to medium.

Prereqs:

  • Python 3.11 (recommend brew install python@3.11)
  • ffmpeg (brew install ffmpeg)
  • libsndfile (brew install libsndfile)
  • 16 GB RAM or more (Apple Silicon unified memory counts)
# 1. clone
git clone https://github.com/MapleEve/voscript.git
cd voscript

# 2. create venv at the repo root
python3.11 -m venv .venv
source .venv/bin/activate

# 3. install deps (on macOS, torch resolves to the CPU/MPS wheel, not CUDA)
pip install --upgrade pip
pip install -r app/requirements.txt

# 4. set env vars
export HF_TOKEN="<HF_TOKEN>"
export API_KEY=$(openssl rand -hex 32)
export DEVICE=cpu                 # macOS must be cpu (pyannote's MPS support is incomplete)
export WHISPER_MODEL=medium       # large-v3 on CPU is too slow
export DATA_DIR=$(pwd)/data
mkdir -p "$DATA_DIR"

# Note this API_KEY — BetterAINote needs the exact same value

# 5. launch
cd app
uvicorn main:app --host 0.0.0.0 --port 8780

Expected performance (ballpark):

  • M2 Pro / M3 Pro + medium + 1 minute of audio ≈ 30–60 s
  • M1 / Intel + medium + 1 minute of audio ≈ 1.5–3 min
  • large-v3 on CPU is 3–5× slower. Not recommended.

Known limitations:

  • Docker-compose path is not supported on macOS
  • No MPS acceleration (pyannote 3.1 has unimplemented ops on MPS, either errors or silently falls back to CPU)
  • If you have access to a Linux / Windows host with an NVIDIA GPU, run the service there and use Mac as the client

Everything after this point (config, wiring into BetterAINote) is the same, just skip every docker step.

1. Clone the repo

git clone https://github.com/MapleEve/voscript.git
cd voscript

2. Configure .env

cp .env.example .env

Edit .env. At minimum fill in:

HF_TOKEN=<HF_TOKEN>
API_KEY=<API_KEY>

If you're short on VRAM (< 12 GB), or deploying on macOS / CPU-only, drop the model one size:

WHISPER_MODEL=medium

Choices: tiny / base / small / medium / large-v3. medium gives a ~3–4× speed-up with only a small quality drop, especially for Chinese and English.

If you are on a China network, also add:

HF_ENDPOINT=https://hf-mirror.com

Generate a strong API key: openssl rand -hex 32

Every other env var has a sane default — see .env.example. For the complete list, defaults, API override semantics, and tuning boundaries that are not exposed yet, see configuration.en.md. A few worth knowing about:

Variable Default Effect
MAX_UPLOAD_BYTES 2147483648 (2 GiB) Per-request upload cap; requests past this get HTTP 413
APP_UID 1000 uid the container runs as — must match the owner of DATA_DIR on the host
APP_GID 1000 same, gid
JOBS_MAX_CACHE 200 LRU cap for the in-memory job dictionary; evicted jobs remain queryable via disk status.json
FFMPEG_TIMEOUT_SEC 1800 Timeout in seconds for ffmpeg conversion; returns 504 on expiry
CUDA_VISIBLE_DEVICES unset Optional NVIDIA visibility limit; by default this variable is not injected and compose requests every Docker-exposed GPU. To restrict visibility, add it through docker-compose.override.yml or another explicit operator env override. When set, in-container cuda:N indexes are remapped from the visible set and may not match physical host GPU numbers
MODEL_IDLE_TIMEOUT_SEC 180 GPU model idle-unload timeout, defaulting to 180 seconds (3 minutes); set 0 to disable idle unload and keep models resident. When enabled, ASR, diarization, and embedding each reselect the best visible CUDA device on their next lazy load
ALLOW_NO_AUTH 0 Set to 1 to suppress the startup warning when no API_KEY is configured (explicitly confirms unauthenticated mode)
DENOISE_MODEL none Service default denoise backend: none, deepfilternet, or noisereduce; API requests may override it per job
DENOISE_SNR_THRESHOLD 10.0 DeepFilterNet SNR gate in dB; audio at or above this value skips DeepFilterNet when deepfilternet is selected; noisereduce is not gated
VOICEPRINT_THRESHOLD 0.75 Base raw-cosine voiceprint threshold before per-speaker adaptive adjustment
PYANNOTE_MIN_DURATION_OFF 0.5 Pyannote off-turn smoothing, used to reduce over-segmentation of short pauses
MIN_EMBED_DURATION 1.5 Minimum diarization turn duration used for speaker embedding extraction
MAX_EMBED_DURATION 10.0 Maximum per-turn audio window used for speaker embedding extraction
WHISPERX_ALIGN_DISABLED_LANGUAGES empty Comma-separated languages that explicitly skip WhisperX forced alignment; use only as a temporary operational fallback
WHISPERX_ALIGN_DEVICE cpu Runtime device for WhisperX forced alignment; CPU is the default to keep alignment isolated from GPU ASR / speaker-embedding runtimes
WHISPERX_ALIGN_MODEL_MAP empty Comma-separated lang=model overrides, for example zh=your-org/your-zh-align-model
WHISPERX_ALIGN_MODEL_DIR empty Optional alignment model cache directory passed through when the installed WhisperX supports it
WHISPERX_ALIGN_CACHE_ONLY 0 Set to 1 to request cache-only alignment model loading when supported by the installed WhisperX version

For POST /api/transcribe, omitting denoise_model means "use the service default from DENOISE_MODEL". Sending denoise_model=none is the explicit per-request opt-out. Sending snr_threshold always overrides DENOISE_SNR_THRESHOLD for that request only, but only affects deepfilternet; noisereduce runs whenever selected. For every supported setting, the Whisper / ASR parameters that are not exposed as env yet, and AS-norm cohort preservation semantics, see configuration.en.md.

Chinese word-level alignment is attempted by default and runs on CPU by default to keep wav2vec2 alignment isolated from GPU ASR / speaker-embedding runtimes. The Docker image uses PyTorch 2.6.0 so recent transformers safety checks can load the default Chinese .bin alignment weights. If you run a custom image with older torch, use torch>=2.6 or a trusted replacement alignment model that provides safetensors; only set WHISPERX_ALIGN_DISABLED_LANGUAGES=zh if you intentionally want a temporary segment-level fallback.

Host directory ownership

The container runs as uid 1000 by default, so DATA_DIR (default ./data) and MODEL_CACHE_DIR (default ./models) must be writable by uid 1000 on the host. On most Linux distros the first regular user is uid 1000, so plain mkdir -p data models before docker compose up just works. If your user is a different uid, pick one:

# A. change the host dirs to 1000:1000
sudo chown -R 1000:1000 ./data ./models

# B. or tell the container to use YOUR uid
echo "APP_UID=$(id -u)" >> .env
echo "APP_GID=$(id -g)" >> .env

3. Start the service

docker compose up -d --build

The first boot downloads ~5 GB of model weights into ./models/. Watch progress with:

docker logs -f voscript

You are good when you see Uvicorn running on http://0.0.0.0:8780.

Or run the bundled helper:

./scripts/deploy.sh

It checks .env, starts the container, and waits for /healthz.

4. Verify the deployment

# Health check (always unauthenticated)
curl -sf http://localhost:8780/healthz
# → {"ok":true}

# Any /api/* call needs API_KEY
curl -sS http://localhost:8780/api/voiceprints \
    -H "Authorization: Bearer $API_KEY"
# → [] (empty on first boot)

Open http://localhost:8780/ in a browser for a minimal web UI you can upload audio to.

5. Wire it into BetterAINote

In BetterAINote → Settings → Transcription, set:

  • Private transcription base URL: http://<host>:8780
  • Private transcription API key: the exact API_KEY from .env

Once saved, the BetterAINote worker will route every recording through this service. See api.en.md for the full contract.

Upgrades

cd voscript
git pull
docker compose up -d --build

Model weights in ./models/ are cached, rebuild won't redownload them.

Upgrading from 0.2.x to 0.3.0

Two extra steps:

  1. Accept one more HuggingFace gated model at https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM — WeSpeaker replaces ECAPA-TDNN for speaker embeddings. Without the click-through you'll get a 403 on first boot.
  2. Re-enroll every existing voiceprint. 0.3.0 uses a new embedding space (WeSpeaker ≠ ECAPA, cosine distances between the two spaces are meaningless), so:
    • Quickest: rm data/voiceprints/voiceprints.db, let the container rebuild it empty, then re-enroll from fresh transcriptions.
    • Or per-speaker: curl -X DELETE -H "Authorization: Bearer $API_KEY" http://host:8780/api/voiceprints/<spk_id> for each enrolled id, then re-enroll.

Legacy index.json + .npy files from 0.2.x are auto-imported into the new sqlite store on first boot — that doesn't lose data, but the imported embeddings are still ECAPA-based and won't match any WeSpeaker-generated queries. You still have to re-enroll.

Troubleshooting

nvidia-smi not found inside container

→ NVIDIA Container Toolkit missing or Docker wasn't restarted. Redo step 0.

403 Forbidden downloading pyannote models

→ Gated-model terms not accepted. Revisit HuggingFace prep in step 0.

401 Unauthorized downloading pyannote models

HF_TOKEN missing, wrong, or expired. Check .env.

macOS runs but is painfully slow

→ Confirm DEVICE=cpu and WHISPER_MODEL=medium. Large models on CPU really are this slow — consider running the service on a Linux/Windows host with an NVIDIA GPU instead.

Crashes with np.NaN was removed

→ Your requirements.txt has been edited and numpy upgraded to 2.x. Keep the numpy<2.0 pin.

Service is up but BetterAINote can't reach it

→ Check that API_KEY matches exactly on both sides (case/whitespace), and that BetterAINote's host can actually reach :8780 (firewall, docker networks).

What do I back up?

→ Just data/voiceprints/. Everything else can be re-derived from the original audio.

See security.en.md for deployment-risk details.