OpenClaw voice assistant — ML wake word, speaker ID, voice-gated security, multi-provider TTS, and Apple Silicon support
Website ·
Documentation ·
Issues ·
OpenClaw Docs
Kiwi Voice is a real-time voice interface that turns OpenClaw into a hands-free assistant. It captures audio from your microphone, detects the wake word via ML model or text matching, recognizes speech locally via Faster Whisper (or MLX Whisper on Apple Silicon), identifies who is speaking, enforces voice-based security policies, talks to any LLM through OpenClaw's WebSocket gateway, and speaks the response back through one of five TTS providers — all in a continuous loop.
Think of it as Alexa/Siri, but self-hosted, privacy-first, and plugged into your own AI stack.
| Feature | Description |
|---|---|
| 🗣️ Wake Word | Text-based fuzzy matching or ML detection via OpenWakeWord — low-latency ONNX model on raw audio (~2% CPU) |
| 🎭 Speaker ID | Voiceprint recognition via pyannote embeddings — knows who's talking |
| 🔐 Voice Security | Priority hierarchy (Owner → Friend → Guest → Blocked) with Telegram approval for dangerous commands |
| 🔊 Multi-Provider TTS | ElevenLabs (cloud), Piper (local/free), Qwen3-TTS (local GPU / RunPod), Kokoro ONNX (local/free, 14 voices) |
| ⚡ Streaming TTS | Sentence-aware chunking — starts speaking before the LLM finishes |
| 🛑 Barge-In | Interrupt the assistant mid-sentence by speaking over it |
| 🧠 Auto-Learning | Automatically remembers new voices after first interaction |
| 🔌 WebSocket | Native OpenClaw Gateway v3 protocol with delta/final streaming |
| 🌍 Multi-Language | Built-in i18n with YAML locale files — switch language with a single config field |
| 🍎 MLX Whisper | Optional Apple Silicon optimized STT via Lightning Whisper MLX (~10x faster on M-series) |
| 🌐 Web Dashboard | Real-time glassmorphism dashboard with live status, event log, controls, and personality switching |
| 🎙️ Web Microphone | Talk to Kiwi from any browser — WebSocket audio streaming with AudioWorklet, no local mic setup needed |
Mic → [OpenWakeWord OR VAD + Energy Detection]
→ Faster Whisper STT (or MLX Whisper on Apple Silicon)
→ Wake Word Check (text fuzzy match or ML pre-detection)
→ Speaker ID (pyannote) → Priority Gate → Voice Security
→ OpenClaw Gateway (WebSocket) → LLM response stream
→ Real-time streaming TTS → Speaker Output (with barge-in)
→ Back to listening
- Python 3.10+
- FFmpeg (for audio processing)
- OpenClaw running locally
- GPU with CUDA recommended (for STT & local TTS), but not required
git clone https://github.com/ekleziast/kiwi-voice.git
cd kiwi-voice
python -m venv venv
# Linux / macOS
source venv/bin/activate
# Windows
venv\Scripts\activate
pip install -r requirements.txtcp .env.example .env
# Fill in your API keys (ElevenLabs, RunPod, Telegram — all optional)Edit config.yaml to match your setup:
# Language: controls UI strings, STT, TTS, wake word, and command patterns
language: "ru" # ru | en (add more in kiwi/locales/)
# TTS provider: elevenlabs | piper | qwen3 | kokoro
tts:
provider: "piper" # Free, local, no API key needed
# STT engine
stt:
engine: "faster-whisper" # faster-whisper | mlx-whisper | elevenlabs
model: "small" # small = fast, large = accurate
device: "cuda" # cuda | cpu
# Wake word
wake_word:
engine: "text" # text (fuzzy match) | openwakeword (ML model)
keyword: "kiwi"
model: "hey_jarvis" # OpenWakeWord model (built-in or path to .onnx)
threshold: 0.5 # Detection sensitivity (0.0–1.0)
# Owner name (used for voice commands like "I'm <name>")
speaker_priority:
owner:
name: "Owner" # Change to your nameOn first launch, Kiwi registers as a new device with the Gateway. You need to approve the pairing request before it can connect:
-
Start Kiwi in one terminal:
python -m kiwi
-
In a second terminal, approve the pending device:
openclaw devices list # see pending requests openclaw devices approve # approve the latest one
Kiwi will automatically reconnect once approved. This is a one-time step — the device identity is saved to device-identity.json and persists across restarts.
Note: If you skip device pairing, Kiwi will fall back to OpenClaw CLI mode, which is functional but slower due to process spawning overhead. WebSocket pairing is recommended for real-time voice interaction.
python -m kiwiOr use the launcher scripts:
# Windows
start.bat
.\start.ps1
# Linux / macOS
python -m kiwi| Engine | Quality | Latency | Cost | Local | GPU |
|---|---|---|---|---|---|
| Faster Whisper | Excellent | ~1–3s | Free | Yes | Optional (CUDA) |
| ElevenLabs | Excellent | ~0.3–0.5s | ~$0.018/min | No | No |
| MLX Whisper | Excellent | ~0.5–1s | Free | Yes | Apple Silicon |
Switch engines in config.yaml or via environment variable:
KIWI_STT_ENGINE=elevenlabs python -m kiwi| Provider | Quality | Latency | Cost | Local GPU |
|---|---|---|---|---|
| ElevenLabs | Excellent | ~0.3–0.5s | ~$0.30/1K chars | No |
| Qwen3-TTS (local) | High | ~1–3s | Free | Yes (CUDA) |
| Qwen3-TTS (RunPod) | High | ~2–5s | ~$0.0003/sec | No |
| Kokoro ONNX | High | <0.5s | Free | No |
| Piper | Good | <0.5s | Free | No |
Switch providers in config.yaml or via environment variable:
KIWI_TTS_PROVIDER=kokoro python -m kiwiKokoro ONNX — free, fully local TTS with 14 voices at 24kHz. Models auto-download on first use (~340MB). Supports English, Japanese, Chinese, Korean, and several European languages. Russian is not yet supported.
Instead of running full Whisper transcription to detect the wake word, you can use OpenWakeWord — a small ONNX model (~10MB) that listens to raw audio in real time with ~80ms latency and ~2% CPU usage.
wake_word:
engine: "openwakeword"
model: "hey_jarvis" # Built-in: hey_jarvis, alexa, hey_mycroft
threshold: 0.5Train a custom wake word (e.g. "hey kiwi") using Google Colab — no voice recordings needed:
python scripts/train_wake_word.py --phrase "hey kiwi"Kiwi identifies speakers by voiceprint and enforces a priority hierarchy:
OWNER (priority 0) — Full access, cannot be blocked
FRIEND (priority 1) — Dangerous commands require Telegram approval
GUEST (priority 2) — All sensitive commands require approval
BLOCKED (priority 99) — Completely ignored
| Command | Action |
|---|---|
| "Kiwi, remember my voice" | Register your voiceprint as owner |
| "Kiwi, this is my friend [name]" | Add someone as a friend |
| "Kiwi, block them" | Block the last speaker |
| "Kiwi, who is speaking?" | Identify the current speaker |
| "Kiwi, what voices do you know?" | List all known voiceprints |
💡 Voice commands are language-dependent. Set
languageinconfig.yamlto match your locale. Seekiwi/locales/*.yamlfor the full command lists.
Pre-filter (Kiwi) — catches dangerous commands before they reach the LLM:
- Regex-based
DangerousCommandDetectorclassifies commands as SAFE / WARNING / DANGEROUS / CRITICAL - Non-owner actionable commands are held until owner approves (by voice or Telegram)
Post-filter (OpenClaw) — catches dangerous shell commands the LLM tries to execute:
- When the OpenClaw agent calls the
exectool, the Gateway broadcasts anexec.approval.requestedevent - Kiwi subscribes to this event, announces the command to the owner via voice
- Owner approves/denies by voice ("allow" / "deny") or via Telegram inline buttons
- Decision is sent back to OpenClaw via
exec.approval.resolve - Auto-deny on timeout (55s) if no response
This means even if a voice command passes the pre-filter, the actual shell execution still requires approval through OpenClaw's own security layer.
When a non-owner speaker issues a potentially dangerous command, Kiwi sends a confirmation request to the owner via Telegram. The owner can approve or deny it from their phone. Telegram is also used as a secondary channel for OpenClaw exec approvals.
Set KIWI_TELEGRAM_BOT_TOKEN and KIWI_TELEGRAM_CHAT_ID in .env to enable.
| Variable | Description |
|---|---|
KIWI_ELEVENLABS_API_KEY |
ElevenLabs API key |
RUNPOD_API_KEY |
RunPod API key (for Qwen3-TTS serverless) |
RUNPOD_TTS_ENDPOINT_ID |
RunPod endpoint ID |
KIWI_TELEGRAM_BOT_TOKEN |
Telegram bot token (voice security) |
KIWI_TELEGRAM_CHAT_ID |
Telegram chat ID for approval messages |
KIWI_TTS_PROVIDER |
Override TTS provider (elevenlabs, piper, qwen3, kokoro) |
KIWI_WAKE_ENGINE |
Override wake word engine (text, openwakeword) |
KIWI_WAKE_MODEL |
Override OpenWakeWord model name or path |
KIWI_WAKE_THRESHOLD |
Override OpenWakeWord detection threshold |
KIWI_STT_ENGINE |
Override STT engine (faster-whisper, mlx-whisper) |
KIWI_FFMPEG_PATH |
Custom FFmpeg path |
KIWI_LANGUAGE |
Override language/locale (ru, en, etc.) |
KIWI_DEBUG |
Enable debug logging |
LLM_MODEL |
Override LLM model |
See .env.example for the full list.
# Run tests
pytest tests/
# Code conventions:
# - Logging: kiwi_log("TAG", "message", level="INFO") — never print()
# - Paths: PROJECT_ROOT from kiwi package
# - Optional modules: try/except + *_AVAILABLE flags
# - Threads: daemon threads + crash protection
# - GPU: auto-detect CUDA with CPU fallbackKiwi uses YAML-based locale files in kiwi/locales/. All user-facing strings, voice commands, wake word variants, hallucination filters, and security patterns are loaded from locale files.
Switch language:
# config.yaml
language: "en" # or "ru", etc.Add a new language:
- Copy
kiwi/locales/en.yamltokiwi/locales/{lang}.yaml - Translate all strings
- Set
language: "{lang}"inconfig.yaml
Currently shipped — 15 languages:
ru Russian |
en English |
es Spanish |
pt Portuguese |
fr French |
it Italian |
de German |
tr Turkish |
pl Polish |
zh Chinese |
ja Japanese |
ko Korean |
hi Hindi |
ar Arabic |
id Indonesian |
Kiwi supports dynamic personality switching via markdown-based "souls" in kiwi/souls/. Each soul defines a system prompt overlay that shapes how Kiwi responds.
Built-in souls: Mindful Companion (default), Storyteller, Comedian, Hype Person, Siren (NSFW)
The base system prompt is loaded from SOUL.md in the project root. Soul personalities are layered on top of it.
Switch via voice command, Web UI (click the card), or API:
curl -X POST http://localhost:7789/api/soul -d '{"soul_id": "comedian"}'The Siren (NSFW) soul routes to a separate OpenClaw agent with its own LLM model, configured in config.yaml:
souls:
default: "mindful-companion"
nsfw:
model: "openrouter/mistralai/mistral-7b-instruct"
session: "kiwi-nsfw"Kiwi includes a built-in REST API and a real-time web dashboard.
# Starts automatically with the service
http://localhost:7789/Dashboard features:
- Live state orb — animated indicator that changes color and pulse speed with assistant state (idle / listening / thinking / speaking)
- Real-time event log — terminal-style feed of all system events via WebSocket
- Personality cards — horizontal carousel with holographic accents; click to activate, NSFW souls highlighted in ruby
- Speaker management — table with voiceprint priority badges, block/unblock/delete actions
- Controls — stop playback, reset context, restart/shutdown, TTS test
- Language switcher — change locale on the fly
- Web Microphone — talk to Kiwi directly from the browser via WebSocket audio streaming
API endpoints: /api/status, /api/config, /api/speakers, /api/languages, /api/souls, /api/soul, /api/tts/test, /api/stop, /api/reset-context, /api/restart, /api/shutdown, /api/audio (WebSocket for browser mic), plus /api/events for real-time event streaming.
Configure in config.yaml:
api:
enabled: true
host: "0.0.0.0"
port: 7789By default the API is open (no auth). To secure it, enable token-based authentication with scopes:
api:
auth:
enabled: true
tokens:
- token: "your-secret-token-here"
name: "Home Assistant"
scopes: ["read", "control", "tts"]
- token: "admin-token-here"
name: "Admin"
scopes: ["read", "control", "tts", "speakers", "admin"]Each token gets a set of scopes that control what it can access:
| Scope | Allows |
|---|---|
read |
GET endpoints — status, config, speakers, languages, souls |
control |
POST stop, reset-context, language, soul; PATCH config |
tts |
POST tts/test — speak arbitrary text |
speakers |
DELETE/block/unblock speaker profiles |
admin |
POST restart, shutdown |
Requests must include Authorization: Bearer <token>. Static files (/, /static/*) are always open.
Use GET /api/auth/scopes to discover what scopes a token has.
Override via env var: KIWI_API_AUTH_ENABLED=true
CLI mode caveat: API scopes only apply to the REST API. When Kiwi falls back to CLI mode (direct
openclawsubprocess calls during WebSocket outage), exec approval is not available — that's an architectural boundary, not a bug. WebSocket mode is recommended for full security coverage.
The dashboard includes a Web Microphone that lets you talk to Kiwi directly from the browser — no local microphone setup or pyaudio installation needed.
How it works:
- Browser captures audio via AudioWorklet (low-latency, runs off main thread)
- PCM Int16 audio is streamed to Kiwi over a WebSocket (
/api/audio) - Kiwi processes it through the same STT → wake word → LLM → TTS pipeline
- TTS responses are streamed back to the browser and played via AudioWorklet
Click the microphone button in the dashboard to connect. Volume bars show live audio level.
Configure in config.yaml:
web_audio:
enabled: true
sample_rate: 16000
max_clients: 3Bidirectional integration: control Kiwi from HA dashboard, and control your smart home by voice through Kiwi via the Conversation API.
Copy custom_components/kiwi_voice/ to your HA custom_components/ directory. Add the integration via the HA UI — enter host, port, and optionally an API token (if auth is enabled in config.yaml). The token can also be changed later in the integration options without re-pairing.
Entities: state sensor, language sensor, HA connection sensor, speakers count, uptime, listening switch, stop/reset/TTS buttons, TTS platform, voice control services. Entities are created based on token scopes — e.g. the stop button only appears if the token has control scope.
Voice control: Say "Kiwi, turn on the lights" — the command is routed to HA Conversation API and the response is spoken back. Configure in config.yaml:
homeassistant:
enabled: true
url: "http://homeassistant.local:8123"
token: "" # Long-Lived Access TokenMIT — do whatever you want with it.
Built with 🥝 and too much coffee
