Add first-pass Windows Voice Mode by NichUK · Pull Request #120 · openclaw/openclaw-windows-node

NichUK · 2026-03-30T18:29:40Z

Summary

This PR adds the first-pass Windows Voice Mode implementation to the tray app. It's by no means finished, but the first feature-set is working. I apologise for the hugeness... Also there was quite a lot of experimentation and reversion so it's not quite as bad as it looks...

What works now

Talk Mode on Windows (via Windows STT)
direct voice send into the main chat session
assistant reply playback (not yet streamed) - currently supports
- Windows TTS
- Minimax
- Eleven Labs
compact repeater window with:
- transcript/reply display
- pause/resume
- response skip
- settings access
- repeater position/size persistence
tray icon state reflecting real listening readiness
configurable provider catalog for TTS/STT rather than hard-coded
default device use

What didn't work

I tried to fully integrate with the WebChat UI, but couldn't achieve it without nasty local DOM-writes, which is very hacky. Also the Windows STT (Windows.Media.SpeechRecognizer) works pretty well, but it has to have control of the entire pipeline, and we can't select an input device without changing the default devices.

Coming Next

Voice Wake implementation (WakeWord)
Push To Talk implementation
true streaming first-chunk TTS playback
true streaming STT using an AudioGraph pipeline
- via cloud providers (OpenAI Whisper/Eleven Labs)
- via local model (hosted in sherpa-onnx)
selected non-default microphone/speaker support for actual STT capture across all providers
voice control record parsing
central pronunciation dictionary support

Notes

I kept the architecture intentionally close to the existing tray/node model and documented the current and planned states in docs/VOICE-MODE.md as well as the architecture. Also made as few touch points to the existing app as possible to minimise change risk,

Happy to receive notes/change requests before merging, etc., and attempt to deal with issues if anyone actually uses it! :)

…lient.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ovider route kinds Agent-Logs-Url: https://github.com/NichUK/openclaw-windows-node/sessions/f2ae3d04-4f08-49c2-8095-9e801a4ccf6d Co-authored-by: NichUK <346792+NichUK@users.noreply.github.com>

…ws-node into feature/voice-mode

…aming provider route kinds" Reverts CoPilot fix This reverts commit 78d0a3d.

Move voice-mode test-targeted logic out of the WinUI app and into a dedicated shared project so tray tests no longer need to reference OpenClaw.Tray.WinUI directly. This restores the original CI assumption that the tray test project can be built on its own without transitively building a Windows App SDK application with an implicit architecture. It also keeps the voice/chat extraction scoped away from the broader OpenClaw.Shared library, which remains general-purpose and non-tray-specific. The new OpenClaw.Tray.Shared project now contains the shared voice/chat surface used by both the tray app and tray tests, including voice transport helpers, provider catalog loading, cloud TTS support, chat coordination, and the web chat DOM bridge. The WinUI app retains the UI shell pieces, including DispatcherQueueAdapter and the app-level icon path helper. As a follow-up cleanup during the extraction, split the previous IconHelper into AppIconHelper in the WinUI project and VoiceTrayIconHelper in the shared tray project so the new shared library stays focused on voice-related behavior rather than wider tray infrastructure.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Refactor tray voice code into OpenClaw.Tray.Shared

NichUK · 2026-04-03T15:51:02Z

@shanselman, @copilot

I introduced a new OpenClaw.Tray.Shared project to separate tray voice/chat logic that needs unit coverage from the WinUI app shell.

Before this change, the tray test project had to reference OpenClaw.Tray.WinUI directly in order to test the new voice-mode code. That made the existing CI test job transitively build the WinUI app, which is what caused the ARM64 architecture-specific Windows App SDK failure. Moving the shared voice/chat logic into its own project let the tests exercise that code without having to build the WinUI application itself.

I kept OpenClaw.Shared unchanged because it is the broader repo-wide shared library (and built against net10), while this extracted code is tray-specific and still Windows-oriented (so net10-windows). The new project is intentionally narrower: it holds the tray voice/chat logic shared by the WinUI app and tray tests, while WinUI-only pieces such as dispatcher plumbing and app-level icon handling remain in OpenClaw.Tray.WinUI.

If this doesn't work for you, and you'd prefer that stuff to go into OpenClaw.Shared, and change the TargetFramework instead, then let me know and I'll refactor.

I'm also addressing the points raised by Repo Assist above. This stuff is f**kin' magic! :)

Cover the pure shared logic in VoiceProviderConfigurationStoreExtensions with focused unit tests for case-insensitive provider lookup, case-insensitive setting lookup, SetValue creation/update behavior, and removal of blank or null values.

NichUK · 2026-04-04T17:28:11Z

Increased test coverage as suggested

Add tests for voice provider configuration helpers

# Conflicts: # tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs

shanselman · 2026-04-14T01:47:26Z

Hi @NichUK — Copilot here replying on Scott's behalf.

Thanks again for the big push on this. I spent some time comparing the Windows approach here to the current Apple-side voice stack, and the high-level direction feels reasonable: keep OS/audio concerns local to the app, and keep OpenClaw/gateway responsible for normal chat/session routing.

Concern	Current Apple app	This PR	Suggestion
OS speech I/O	`TalkModeRuntime` / `VoiceWakeRuntime` use local OS audio + speech APIs, with local system/cloud playback	`VoiceService` + STT routes + `SpeechSynthesizer` / `MediaPlayer` / cloud TTS	This boundary is good. Voice should live in the node/tray app, not in the gateway.
Session transport	Voice text is forwarded through the normal chat/session path and replies come back over standard chat events	Same general approach via `chat.send` and reply playback	Keep this. Avoid inventing a second voice-specific transport if possible.
UI/control split	macOS separates runtime/control/session presentation (`TalkModeRuntime`, `VoiceWakeRuntime`, `TalkModeController`, `VoiceSessionCoordinator`)	Functional, but more centralized in `VoiceService` plus the repeater/settings UI	Consider splitting responsibilities a bit more so runtime, control, and UI state are easier to reason about and test independently.
Feature scope	Talk Mode and Voice Wake are separable pieces	This PR lands runtime, provider plumbing, UI, and some future scaffolding together	If you keep iterating here, I'd suggest treating Talk Mode as the first shippable slice and letting Voice Wake / extra provider work follow in smaller steps.
Provider abstraction	Apple uses concrete local runtimes with a few focused abstractions around playback	Windows catalog/provider model is flexible, but it introduces a lot of generalization early	Keep the catalog idea, but try not to let provider complexity outrun the actual short-term use cases.

Overall, I think Voice Mode is a good idea for this repo — especially as a thin, optional local voice layer over the existing chat architecture. The main thing that gives me pause isn't the philosophy; it's the size/scope of the merge and how much it asks reviewers to reason about at once.

So my friendly suggestion would be:

keep the “local audio, normal chat transport” boundary;
sharpen the runtime/controller/session seams a bit more, closer to the Apple split;
ship Talk Mode first if that helps reduce risk;
keep the provider abstraction as light as possible until the surface stabilizes.

Promising direction overall — just worth keeping the architecture tight so it stays maintainable.

NichUK · 2026-04-18T14:58:11Z

Apologies for the delay.
OK, I'll get back onto this, and probably create a new PR in the next week or so.

github-actions · 2026-04-19T12:40:35Z

🤖 This is an automated response from Repo Assist.

@NichUK — no problem at all, and thanks for the update! It's great to hear you're planning a new PR. shanselman's architectural summary is really useful guidance to build from; a quick recap of the key points to keep in mind for the new PR:

Keep the "local audio, normal chat transport" boundary — voice audio stays in the node/tray app; OpenClaw gateway handles routing as normal. This part of the architecture is already solid in the existing PR.
Tighter runtime/controller/session seams — closer to the Apple split (TalkModeRuntime, TalkModeController, VoiceSessionCoordinator). Separating the state machine, runtime lifecycle, and UI state into distinct classes will make VoiceService.cs more manageable and testable going forward.
Talk Mode as the first shippable slice — ship the core Talk Mode first; let Voice Wake / extra providers follow in smaller follow-up PRs. This reduces review scope significantly.
Keep provider abstraction light — the catalog idea is good, but avoid generalising too far ahead of the actual use cases.

If it helps, the test patterns added in the previous iteration (VoiceCapabilityTests, VoiceModeSchemaTests, VoiceProviderConfigurationStoreExtensionsTests) are worth preserving in the new PR — they're a solid baseline.

Looking forward to the new PR! 👍

Generated by 🌈 Repo Assist, see workflow run. Learn more.

To install this agentic workflow, run
gh aw add githubnext/agentics/workflows/repo-assist.md@97143ac59cb3a13ef2a77581f929f06719c7402a

shanselman · 2026-05-01T18:30:42Z

Thanks again for the huge amount of work here, @NichUK. This PR sketches out a much broader Voice Mode vision: STT, TTS, repeater UI, provider plumbing, and WebChat integration.

We’re going to land the focused TTS capability slice via #253 first because it is smaller, tested, and easier to review safely. I see that as a foundation for this work rather than a rejection of it.

After #253 lands, could you rebase this branch and split the remaining Voice Mode pieces into smaller follow-up PRs? The most useful next slices would probably be:

STT capability / command surface
Voice or Talk Mode UX
repeater window behavior
WebChat bridge integration
provider catalog/settings cleanup

Looping in @RBrid as well since #253 overlaps with the TTS portion. It would be great if the two of you can coordinate so the next PRs build cleanly on the shared tts.speak foundation.

Adds a focused Windows node text-to-speech capability as the first stable voice-support primitive. - adds the shared `tts.speak` capability and MCP/gateway documentation - wires Windows and ElevenLabs TTS behind opt-in tray settings - protects the ElevenLabs API key with DPAPI - adds shared and tray tests for capability behavior, settings, and ElevenLabs requests This lands the focused TTS foundation from the broader Voice Mode discussion in #120 so remaining voice UX/STT/repeater work can build on top in smaller follow-up PRs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

shanselman · 2026-05-01T18:33:09Z

#253 has landed on master, so the shared tts.speak foundation is now available.

The best next step for this broader Voice Mode work is to rebase on current master and split the remaining pieces into smaller follow-up PRs on top of that foundation. Thanks again, @NichUK — there’s a lot of useful direction here, and landing it in smaller layers should make it much easier to review and merge safely.

NichUK added 30 commits March 23, 2026 01:34

Add Windows voice mode foundation and AlwaysOn runtime

be624fe

Fix voice chat transport and reply routing

f40ffc3

Add configurable voice mode settings and setup UI

a81d31e

Integrate always-on voice mode with tray chat workflow

197a89b

Fix tray voice startup and chat window submission

1340bde

Remove stale always-on autosubmit setting

aed8cb8

Add focused coordinator coverage for tray voice chat

25dd06b

Address voice mode review findings and harden runtime

1336472

Document required Minimax and ElevenLabs provider support

0f1028a

Harden tray chat voice message handling

2c8a46d

Fix voice transport connection task reuse

fdbf48e

Group voice runtime services under Services/Voice

b556c64

Implement MiniMax TTS for voice mode

7f31c12

Add editable TTS provider settings to voice mode

c64f168

Move voice settings into main settings window

907a1a0

Extract hosted voice settings panel from settings window

6dba89b

Generalize cloud TTS providers through catalog contracts

ded41a2

Rename voice modes to VoiceWake and TalkMode

199e534

Move voice settings below node mode toggle

47efc3e

Make cloud TTS voice settings fully catalog-driven

85d7b90

Ship voice provider catalog with the tray app

c1cc0ff

Instrument voice output latency and reduce TTS buffering

83f05ee

Tighten talk mode speech recognition filtering

d137409

Use MiniMax api-uw endpoint for lower TTS latency

05d7bae

Add catalog-driven MiniMax WebSocket TTS

5efcebf

Fix voice restart after settings save

45ff8f8

Fix MiniMax websocket voice playback routing

71d0de4

Add dynamic tray icons for voice states

91ccec3

Add pre-response voice latency timing logs

2ff57fc

Keep talk mode alive after input failures

ffa3fa2

NichUK and others added 6 commits April 2, 2026 21:56

Update src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechC…

777088c

…lient.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix SupportsSpeechToTextRuntime test assertions to match streaming pr…

78d0a3d

…ovider route kinds Agent-Logs-Url: https://github.com/NichUK/openclaw-windows-node/sessions/f2ae3d04-4f08-49c2-8095-9e801a4ccf6d Co-authored-by: NichUK <346792+NichUK@users.noreply.github.com>

Fix incorrect test project binding to x64

3d5b4e7

Merge branch 'feature/voice-mode' of github.com:NichUK/openclaw-windo…

a6592e8

…ws-node into feature/voice-mode

Revert "Fix SupportsSpeechToTextRuntime test assertions to match stre…

83c7983

…aming provider route kinds" Reverts CoPilot fix This reverts commit 78d0a3d.

Remove incorrect project binding to x64

1820901

This comment has been minimized.

Sign in to view

NichUK and others added 5 commits April 3, 2026 16:06

Update src/OpenClaw.Tray.WinUI/App.xaml.cs

fd6c89f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/OpenClaw.Tray.WinUI/App.xaml.cs

4f3c8c6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/OpenClaw.Tray.WinUI/Helpers/AppIconHelper.cs

7e7bb32

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge pull request #5 from NichUK/codex/fix-pr120-ci

88caa1a

Refactor tray voice code into OpenClaw.Tray.Shared

This comment has been minimized.

Sign in to view

Merge pull request #6 from NichUK/codex/fix-pr120-ci

57ae532

Add tests for voice provider configuration helpers

This comment has been minimized.

Sign in to view

Merge remote-tracking branch 'upstream/master' into feature/voice-mode

7b48541

# Conflicts: # tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs

AlexAlves87 mentioned this pull request Apr 20, 2026

[Proposal] Add a formal WebView2 bridge for native ↔ SPA communication #191

Open

github-actions Bot mentioned this pull request Apr 21, 2026

[Repo Assist] feat: wire WebView2 bidirectional native↔SPA bridge in WebChatWindow #192

Merged

This was referenced Apr 30, 2026

feat: Add Windows node text-to-speech command (tts.speak) #252

Closed

Add Windows node text-to-speech #253

Merged

github-actions Bot mentioned this pull request May 1, 2026

[Repo Assist] Monthly Activity 2026-05 #254

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add first-pass Windows Voice Mode#120

Add first-pass Windows Voice Mode#120
NichUK wants to merge 105 commits intoopenclaw:masterfrom
NichUK:feature/voice-mode

NichUK commented Mar 30, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

NichUK commented Apr 3, 2026

Uh oh!

This comment has been minimized.

NichUK commented Apr 4, 2026

Uh oh!

This comment has been minimized.

shanselman commented Apr 14, 2026

Uh oh!

NichUK commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

shanselman commented May 1, 2026

Uh oh!

shanselman commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

NichUK commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What works now

What didn't work

Coming Next

Notes

Uh oh!

This comment has been minimized.

NichUK commented Apr 3, 2026

Uh oh!

This comment has been minimized.

NichUK commented Apr 4, 2026

Uh oh!

This comment has been minimized.

shanselman commented Apr 14, 2026

Uh oh!

NichUK commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

shanselman commented May 1, 2026

Uh oh!

shanselman commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NichUK commented Mar 30, 2026 •

edited

Loading