Kiwi Voice

OpenClaw voice assistant — ML wake word, speaker ID, voice-gated security, multi-provider TTS, and Apple Silicon support

Website · Documentation · Issues · OpenClaw Docs

What is Kiwi Voice?

Kiwi Voice is a real-time voice interface that turns OpenClaw into a hands-free assistant. It captures audio from your microphone, detects the wake word via ML model or text matching, recognizes speech locally via Faster Whisper (or MLX Whisper on Apple Silicon), identifies who is speaking, enforces voice-based security policies, talks to any LLM through OpenClaw's WebSocket gateway, and speaks the response back through one of five TTS providers — all in a continuous loop.

Think of it as Alexa/Siri, but self-hosted, privacy-first, and plugged into your own AI stack.

Key Features

Feature	Description
🗣️ Wake Word	Text-based fuzzy matching or ML detection via OpenWakeWord — low-latency ONNX model on raw audio (~2% CPU)
🎭 Speaker ID	Voiceprint recognition via pyannote embeddings — knows who's talking
🔐 Voice Security	Priority hierarchy (Owner → Friend → Guest → Blocked) with Telegram approval for dangerous commands
🔊 Multi-Provider TTS	ElevenLabs (cloud), Piper (local/free), Qwen3-TTS (local GPU / RunPod), Kokoro ONNX (local/free, 14 voices)
⚡ Streaming TTS	Sentence-aware chunking — starts speaking before the LLM finishes
🛑 Barge-In	Interrupt the assistant mid-sentence by speaking over it
🧠 Auto-Learning	Automatically remembers new voices after first interaction
🔌 WebSocket	Native OpenClaw Gateway v3 protocol with delta/final streaming
🌍 Multi-Language	Built-in i18n with YAML locale files — switch language with a single config field
🍎 MLX Whisper	Optional Apple Silicon optimized STT via Lightning Whisper MLX (~10x faster on M-series)
🌐 Web Dashboard	Real-time glassmorphism dashboard with live status, event log, controls, and personality switching
🎙️ Web Microphone	Talk to Kiwi from any browser — WebSocket audio streaming with AudioWorklet, no local mic setup needed

Architecture

Mic → [OpenWakeWord OR VAD + Energy Detection]
  → Faster Whisper STT (or MLX Whisper on Apple Silicon)
  → Wake Word Check (text fuzzy match or ML pre-detection)
  → Speaker ID (pyannote) → Priority Gate → Voice Security
  → OpenClaw Gateway (WebSocket) → LLM response stream
  → Real-time streaming TTS → Speaker Output (with barge-in)
  → Back to listening

Quick Start

Requirements

Python 3.10+
FFmpeg (for audio processing)
OpenClaw running locally
GPU with CUDA recommended (for STT & local TTS), but not required

Installation

git clone https://github.com/ekleziast/kiwi-voice.git
cd kiwi-voice

python -m venv venv
# Linux / macOS
source venv/bin/activate
# Windows
venv\Scripts\activate

pip install -r requirements.txt

Configuration

cp .env.example .env
# Fill in your API keys (ElevenLabs, RunPod, Telegram — all optional)

Edit config.yaml to match your setup:

# Language: controls UI strings, STT, TTS, wake word, and command patterns
language: "ru"               # ru | en (add more in kiwi/locales/)

# TTS provider: elevenlabs | piper | qwen3 | kokoro
tts:
  provider: "piper"          # Free, local, no API key needed

# STT engine
stt:
  engine: "faster-whisper"   # faster-whisper | mlx-whisper | elevenlabs
  model: "small"             # small = fast, large = accurate
  device: "cuda"             # cuda | cpu

# Wake word
wake_word:
  engine: "text"             # text (fuzzy match) | openwakeword (ML model)
  keyword: "kiwi"
  model: "hey_jarvis"        # OpenWakeWord model (built-in or path to .onnx)
  threshold: 0.5             # Detection sensitivity (0.0–1.0)

# Owner name (used for voice commands like "I'm <name>")
speaker_priority:
  owner:
    name: "Owner"            # Change to your name

Pair with OpenClaw Gateway (first time only)

On first launch, Kiwi registers as a new device with the Gateway. You need to approve the pairing request before it can connect:

Start Kiwi in one terminal:
```
python -m kiwi
```

In a second terminal, approve the pending device:

openclaw devices list      # see pending requests
openclaw devices approve   # approve the latest one

Kiwi will automatically reconnect once approved. This is a one-time step — the device identity is saved to device-identity.json and persists across restarts.

Note: If you skip device pairing, Kiwi will fall back to OpenClaw CLI mode, which is functional but slower due to process spawning overhead. WebSocket pairing is recommended for real-time voice interaction.

Run

python -m kiwi

Or use the launcher scripts:

# Windows
start.bat
.\start.ps1

# Linux / macOS
python -m kiwi

STT Engines

Engine	Quality	Latency	Cost	Local	GPU
Faster Whisper	Excellent	~1–3s	Free	Yes	Optional (CUDA)
ElevenLabs	Excellent	~0.3–0.5s	~$0.018/min	No	No
MLX Whisper	Excellent	~0.5–1s	Free	Yes	Apple Silicon

Switch engines in config.yaml or via environment variable:

KIWI_STT_ENGINE=elevenlabs python -m kiwi

TTS Providers

Provider	Quality	Latency	Cost	Local GPU
ElevenLabs	Excellent	~0.3–0.5s	~$0.30/1K chars	No
Qwen3-TTS (local)	High	~1–3s	Free	Yes (CUDA)
Qwen3-TTS (RunPod)	High	~2–5s	~$0.0003/sec	No
Kokoro ONNX	High	<0.5s	Free	No
Piper	Good	<0.5s	Free	No

Switch providers in config.yaml or via environment variable:

KIWI_TTS_PROVIDER=kokoro python -m kiwi

Kokoro ONNX — free, fully local TTS with 14 voices at 24kHz. Models auto-download on first use (~340MB). Supports English, Japanese, Chinese, Korean, and several European languages. Russian is not yet supported.

OpenWakeWord (ML Wake Word Detection)

Instead of running full Whisper transcription to detect the wake word, you can use OpenWakeWord — a small ONNX model (~10MB) that listens to raw audio in real time with ~80ms latency and ~2% CPU usage.

wake_word:
  engine: "openwakeword"
  model: "hey_jarvis"        # Built-in: hey_jarvis, alexa, hey_mycroft
  threshold: 0.5

Train a custom wake word (e.g. "hey kiwi") using Google Colab — no voice recordings needed:

python scripts/train_wake_word.py --phrase "hey kiwi"

Voice Security

Kiwi identifies speakers by voiceprint and enforces a priority hierarchy:

OWNER (priority 0)   — Full access, cannot be blocked
FRIEND (priority 1)  — Dangerous commands require Telegram approval
GUEST (priority 2)   — All sensitive commands require approval
BLOCKED (priority 99) — Completely ignored

Voice Commands

Command	Action
"Kiwi, remember my voice"	Register your voiceprint as owner
"Kiwi, this is my friend [name]"	Add someone as a friend
"Kiwi, block them"	Block the last speaker
"Kiwi, who is speaking?"	Identify the current speaker
"Kiwi, what voices do you know?"	List all known voiceprints

💡 Voice commands are language-dependent. Set language in config.yaml to match your locale. See kiwi/locales/*.yaml for the full command lists.

Two-Layer Security

Pre-filter (Kiwi) — catches dangerous commands before they reach the LLM:

Regex-based DangerousCommandDetector classifies commands as SAFE / WARNING / DANGEROUS / CRITICAL
Non-owner actionable commands are held until owner approves (by voice or Telegram)

Post-filter (OpenClaw) — catches dangerous shell commands the LLM tries to execute:

When the OpenClaw agent calls the exec tool, the Gateway broadcasts an exec.approval.requested event
Kiwi subscribes to this event, announces the command to the owner via voice
Owner approves/denies by voice ("allow" / "deny") or via Telegram inline buttons
Decision is sent back to OpenClaw via exec.approval.resolve
Auto-deny on timeout (55s) if no response

This means even if a voice command passes the pre-filter, the actual shell execution still requires approval through OpenClaw's own security layer.

Telegram Approval

When a non-owner speaker issues a potentially dangerous command, Kiwi sends a confirmation request to the owner via Telegram. The owner can approve or deny it from their phone. Telegram is also used as a secondary channel for OpenClaw exec approvals.

Set KIWI_TELEGRAM_BOT_TOKEN and KIWI_TELEGRAM_CHAT_ID in .env to enable.

Environment Variables

Variable	Description
`KIWI_ELEVENLABS_API_KEY`	ElevenLabs API key
`RUNPOD_API_KEY`	RunPod API key (for Qwen3-TTS serverless)
`RUNPOD_TTS_ENDPOINT_ID`	RunPod endpoint ID
`KIWI_TELEGRAM_BOT_TOKEN`	Telegram bot token (voice security)
`KIWI_TELEGRAM_CHAT_ID`	Telegram chat ID for approval messages
`KIWI_TTS_PROVIDER`	Override TTS provider (`elevenlabs`, `piper`, `qwen3`, `kokoro`)
`KIWI_WAKE_ENGINE`	Override wake word engine (`text`, `openwakeword`)
`KIWI_WAKE_MODEL`	Override OpenWakeWord model name or path
`KIWI_WAKE_THRESHOLD`	Override OpenWakeWord detection threshold
`KIWI_STT_ENGINE`	Override STT engine (`faster-whisper`, `mlx-whisper`)
`KIWI_FFMPEG_PATH`	Custom FFmpeg path
`KIWI_LANGUAGE`	Override language/locale (`ru`, `en`, etc.)
`KIWI_DEBUG`	Enable debug logging
`LLM_MODEL`	Override LLM model

See .env.example for the full list.

Development

# Run tests
pytest tests/

# Code conventions:
# - Logging: kiwi_log("TAG", "message", level="INFO") — never print()
# - Paths: PROJECT_ROOT from kiwi package
# - Optional modules: try/except + *_AVAILABLE flags
# - Threads: daemon threads + crash protection
# - GPU: auto-detect CUDA with CPU fallback

Multi-Language Support

Kiwi uses YAML-based locale files in kiwi/locales/. All user-facing strings, voice commands, wake word variants, hallucination filters, and security patterns are loaded from locale files.

Switch language:

# config.yaml
language: "en"   # or "ru", etc.

Add a new language:

Copy kiwi/locales/en.yaml to kiwi/locales/{lang}.yaml
Translate all strings
Set language: "{lang}" in config.yaml

Currently shipped — 15 languages:


`ru` Russian	`en` English	`es` Spanish	`pt` Portuguese
`fr` French	`it` Italian	`de` German	`tr` Turkish
`pl` Polish	`zh` Chinese	`ja` Japanese	`ko` Korean
`hi` Hindi	`ar` Arabic	`id` Indonesian

Soul System (Personalities)

Kiwi supports dynamic personality switching via markdown-based "souls" in kiwi/souls/. Each soul defines a system prompt overlay that shapes how Kiwi responds.

Built-in souls: Mindful Companion (default), Storyteller, Comedian, Hype Person, Siren (NSFW)

The base system prompt is loaded from SOUL.md in the project root. Soul personalities are layered on top of it.

Switch via voice command, Web UI (click the card), or API:

curl -X POST http://localhost:7789/api/soul -d '{"soul_id": "comedian"}'

The Siren (NSFW) soul routes to a separate OpenClaw agent with its own LLM model, configured in config.yaml:

souls:
  default: "mindful-companion"
  nsfw:
    model: "openrouter/mistralai/mistral-7b-instruct"
    session: "kiwi-nsfw"

REST API & Web Dashboard

Kiwi includes a built-in REST API and a real-time web dashboard.

# Starts automatically with the service
http://localhost:7789/

Dashboard features:

Live state orb — animated indicator that changes color and pulse speed with assistant state (idle / listening / thinking / speaking)
Real-time event log — terminal-style feed of all system events via WebSocket
Personality cards — horizontal carousel with holographic accents; click to activate, NSFW souls highlighted in ruby
Speaker management — table with voiceprint priority badges, block/unblock/delete actions
Controls — stop playback, reset context, restart/shutdown, TTS test
Language switcher — change locale on the fly
Web Microphone — talk to Kiwi directly from the browser via WebSocket audio streaming

API endpoints: /api/status, /api/config, /api/speakers, /api/languages, /api/souls, /api/soul, /api/tts/test, /api/stop, /api/reset-context, /api/restart, /api/shutdown, /api/audio (WebSocket for browser mic), plus /api/events for real-time event streaming.

Configure in config.yaml:

api:
  enabled: true
  host: "0.0.0.0"
  port: 7789

API Authentication

By default the API is open (no auth). To secure it, enable token-based authentication with scopes:

api:
  auth:
    enabled: true
    tokens:
      - token: "your-secret-token-here"
        name: "Home Assistant"
        scopes: ["read", "control", "tts"]
      - token: "admin-token-here"
        name: "Admin"
        scopes: ["read", "control", "tts", "speakers", "admin"]

Each token gets a set of scopes that control what it can access:

Scope	Allows
`read`	GET endpoints — status, config, speakers, languages, souls
`control`	POST stop, reset-context, language, soul; PATCH config
`tts`	POST tts/test — speak arbitrary text
`speakers`	DELETE/block/unblock speaker profiles
`admin`	POST restart, shutdown

Requests must include Authorization: Bearer <token>. Static files (/, /static/*) are always open.

Use GET /api/auth/scopes to discover what scopes a token has.

Override via env var: KIWI_API_AUTH_ENABLED=true

CLI mode caveat: API scopes only apply to the REST API. When Kiwi falls back to CLI mode (direct openclaw subprocess calls during WebSocket outage), exec approval is not available — that's an architectural boundary, not a bug. WebSocket mode is recommended for full security coverage.

Web Audio Streaming

The dashboard includes a Web Microphone that lets you talk to Kiwi directly from the browser — no local microphone setup or pyaudio installation needed.

How it works:

Browser captures audio via AudioWorklet (low-latency, runs off main thread)
PCM Int16 audio is streamed to Kiwi over a WebSocket (/api/audio)
Kiwi processes it through the same STT → wake word → LLM → TTS pipeline
TTS responses are streamed back to the browser and played via AudioWorklet

Click the microphone button in the dashboard to connect. Volume bars show live audio level.

Configure in config.yaml:

web_audio:
  enabled: true
  sample_rate: 16000
  max_clients: 3

Home Assistant Integration

Bidirectional integration: control Kiwi from HA dashboard, and control your smart home by voice through Kiwi via the Conversation API.

Copy custom_components/kiwi_voice/ to your HA custom_components/ directory. Add the integration via the HA UI — enter host, port, and optionally an API token (if auth is enabled in config.yaml). The token can also be changed later in the integration options without re-pairing.

Entities: state sensor, language sensor, HA connection sensor, speakers count, uptime, listening switch, stop/reset/TTS buttons, TTS platform, voice control services. Entities are created based on token scopes — e.g. the stop button only appears if the token has control scope.

Voice control: Say "Kiwi, turn on the lights" — the command is routed to HA Conversation API and the response is spoken back. Configure in config.yaml:

homeassistant:
  enabled: true
  url: "http://homeassistant.local:8123"
  token: ""  # Long-Lived Access Token

License

MIT — do whatever you want with it.

Built with 🥝 and too much coffee

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
custom_components/kiwi_voice		custom_components/kiwi_voice
docs		docs
kiwi		kiwi
runpod		runpod
scripts		scripts
sounds		sounds
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
SOUL.md		SOUL.md
config.yaml		config.yaml
install.ps1		install.ps1
mkdocs.yml		mkdocs.yml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ruff.toml		ruff.toml
start.bat		start.bat
start.ps1		start.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kiwi Voice

What is Kiwi Voice?

Key Features

Architecture

Quick Start

Requirements

Installation

Configuration

Pair with OpenClaw Gateway (first time only)

Run

STT Engines

TTS Providers

OpenWakeWord (ML Wake Word Detection)

Voice Security

Voice Commands

Two-Layer Security

Telegram Approval

Environment Variables

Development

Multi-Language Support

Soul System (Personalities)

REST API & Web Dashboard

API Authentication

Web Audio Streaming

Home Assistant Integration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kiwi Voice

What is Kiwi Voice?

Key Features

Architecture

Quick Start

Requirements

Installation

Configuration

Pair with OpenClaw Gateway (first time only)

Run

STT Engines

TTS Providers

OpenWakeWord (ML Wake Word Detection)

Voice Security

Voice Commands

Two-Layer Security

Telegram Approval

Environment Variables

Development

Multi-Language Support

Soul System (Personalities)

REST API & Web Dashboard

API Authentication

Web Audio Streaming

Home Assistant Integration

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages