Skip to content

Add first-pass Windows Voice Mode#120

Open
NichUK wants to merge 105 commits intoopenclaw:masterfrom
NichUK:feature/voice-mode
Open

Add first-pass Windows Voice Mode#120
NichUK wants to merge 105 commits intoopenclaw:masterfrom
NichUK:feature/voice-mode

Conversation

@NichUK
Copy link
Copy Markdown
Contributor

@NichUK NichUK commented Mar 30, 2026

Summary

This PR adds the first-pass Windows Voice Mode implementation to the tray app. It's by no means finished, but the first feature-set is working. I apologise for the hugeness... Also there was quite a lot of experimentation and reversion so it's not quite as bad as it looks...

What works now

  • Talk Mode on Windows (via Windows STT)
  • direct voice send into the main chat session
  • assistant reply playback (not yet streamed) - currently supports
    • Windows TTS
    • Minimax
    • Eleven Labs
  • compact repeater window with:
    • transcript/reply display
    • pause/resume
    • response skip
    • settings access
    • repeater position/size persistence
  • tray icon state reflecting real listening readiness
  • configurable provider catalog for TTS/STT rather than hard-coded
  • default device use

What didn't work

I tried to fully integrate with the WebChat UI, but couldn't achieve it without nasty local DOM-writes, which is very hacky. Also the Windows STT (Windows.Media.SpeechRecognizer) works pretty well, but it has to have control of the entire pipeline, and we can't select an input device without changing the default devices.

Coming Next

  • Voice Wake implementation (WakeWord)
  • Push To Talk implementation
  • true streaming first-chunk TTS playback
  • true streaming STT using an AudioGraph pipeline
    • via cloud providers (OpenAI Whisper/Eleven Labs)
    • via local model (hosted in sherpa-onnx)
  • selected non-default microphone/speaker support for actual STT capture across all providers
  • voice control record parsing
  • central pronunciation dictionary support

Notes

I kept the architecture intentionally close to the existing tray/node model and documented the current and planned states in docs/VOICE-MODE.md as well as the architecture. Also made as few touch points to the existing app as possible to minimise change risk,

Happy to receive notes/change requests before merging, etc., and attempt to deal with issues if anyone actually uses it! :)

NichUK added 30 commits March 23, 2026 01:34
@github-actions

This comment has been minimized.

NichUK and others added 5 commits April 3, 2026 16:06
Move voice-mode test-targeted logic out of the WinUI app and into a dedicated shared project so tray tests no longer need to reference OpenClaw.Tray.WinUI directly.

This restores the original CI assumption that the tray test project can be built on its own without transitively building a Windows App SDK application with an implicit architecture. It also keeps the voice/chat extraction scoped away from the broader OpenClaw.Shared library, which remains general-purpose and non-tray-specific.

The new OpenClaw.Tray.Shared project now contains the shared voice/chat surface used by both the tray app and tray tests, including voice transport helpers, provider catalog loading, cloud TTS support, chat coordination, and the web chat DOM bridge. The WinUI app retains the UI shell pieces, including DispatcherQueueAdapter and the app-level icon path helper.

As a follow-up cleanup during the extraction, split the previous IconHelper into AppIconHelper in the WinUI project and VoiceTrayIconHelper in the shared tray project so the new shared library stays focused on voice-related behavior rather than wider tray infrastructure.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Refactor tray voice code into OpenClaw.Tray.Shared
@NichUK
Copy link
Copy Markdown
Contributor Author

NichUK commented Apr 3, 2026

@shanselman, @copilot

I introduced a new OpenClaw.Tray.Shared project to separate tray voice/chat logic that needs unit coverage from the WinUI app shell.

Before this change, the tray test project had to reference OpenClaw.Tray.WinUI directly in order to test the new voice-mode code. That made the existing CI test job transitively build the WinUI app, which is what caused the ARM64 architecture-specific Windows App SDK failure. Moving the shared voice/chat logic into its own project let the tests exercise that code without having to build the WinUI application itself.

I kept OpenClaw.Shared unchanged because it is the broader repo-wide shared library (and built against net10), while this extracted code is tray-specific and still Windows-oriented (so net10-windows). The new project is intentionally narrower: it holds the tray voice/chat logic shared by the WinUI app and tray tests, while WinUI-only pieces such as dispatcher plumbing and app-level icon handling remain in OpenClaw.Tray.WinUI.

If this doesn't work for you, and you'd prefer that stuff to go into OpenClaw.Shared, and change the TargetFramework instead, then let me know and I'll refactor.

I'm also addressing the points raised by Repo Assist above. This stuff is f**kin' magic! :)

@github-actions

This comment has been minimized.

Cover the pure shared logic in VoiceProviderConfigurationStoreExtensions with focused unit tests for case-insensitive provider lookup, case-insensitive setting lookup, SetValue creation/update behavior, and removal of blank or null values.
@NichUK
Copy link
Copy Markdown
Contributor Author

NichUK commented Apr 4, 2026

Increased test coverage as suggested

Add tests for voice provider configuration helpers
@github-actions

This comment has been minimized.

# Conflicts:
#	tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs
@shanselman
Copy link
Copy Markdown
Contributor

Hi @NichUKCopilot here replying on Scott's behalf.

Thanks again for the big push on this. I spent some time comparing the Windows approach here to the current Apple-side voice stack, and the high-level direction feels reasonable: keep OS/audio concerns local to the app, and keep OpenClaw/gateway responsible for normal chat/session routing.

Concern Current Apple app This PR Suggestion
OS speech I/O TalkModeRuntime / VoiceWakeRuntime use local OS audio + speech APIs, with local system/cloud playback VoiceService + STT routes + SpeechSynthesizer / MediaPlayer / cloud TTS This boundary is good. Voice should live in the node/tray app, not in the gateway.
Session transport Voice text is forwarded through the normal chat/session path and replies come back over standard chat events Same general approach via chat.send and reply playback Keep this. Avoid inventing a second voice-specific transport if possible.
UI/control split macOS separates runtime/control/session presentation (TalkModeRuntime, VoiceWakeRuntime, TalkModeController, VoiceSessionCoordinator) Functional, but more centralized in VoiceService plus the repeater/settings UI Consider splitting responsibilities a bit more so runtime, control, and UI state are easier to reason about and test independently.
Feature scope Talk Mode and Voice Wake are separable pieces This PR lands runtime, provider plumbing, UI, and some future scaffolding together If you keep iterating here, I'd suggest treating Talk Mode as the first shippable slice and letting Voice Wake / extra provider work follow in smaller steps.
Provider abstraction Apple uses concrete local runtimes with a few focused abstractions around playback Windows catalog/provider model is flexible, but it introduces a lot of generalization early Keep the catalog idea, but try not to let provider complexity outrun the actual short-term use cases.

Overall, I think Voice Mode is a good idea for this repo — especially as a thin, optional local voice layer over the existing chat architecture. The main thing that gives me pause isn't the philosophy; it's the size/scope of the merge and how much it asks reviewers to reason about at once.

So my friendly suggestion would be:

  1. keep the “local audio, normal chat transport” boundary;
  2. sharpen the runtime/controller/session seams a bit more, closer to the Apple split;
  3. ship Talk Mode first if that helps reduce risk;
  4. keep the provider abstraction as light as possible until the surface stabilizes.

Promising direction overall — just worth keeping the architecture tight so it stays maintainable.

@NichUK
Copy link
Copy Markdown
Contributor Author

NichUK commented Apr 18, 2026

Apologies for the delay.
OK, I'll get back onto this, and probably create a new PR in the next week or so.

@github-actions
Copy link
Copy Markdown
Contributor

🤖 This is an automated response from Repo Assist.

@NichUK — no problem at all, and thanks for the update! It's great to hear you're planning a new PR. shanselman's architectural summary is really useful guidance to build from; a quick recap of the key points to keep in mind for the new PR:

  1. Keep the "local audio, normal chat transport" boundary — voice audio stays in the node/tray app; OpenClaw gateway handles routing as normal. This part of the architecture is already solid in the existing PR.

  2. Tighter runtime/controller/session seams — closer to the Apple split (TalkModeRuntime, TalkModeController, VoiceSessionCoordinator). Separating the state machine, runtime lifecycle, and UI state into distinct classes will make VoiceService.cs more manageable and testable going forward.

  3. Talk Mode as the first shippable slice — ship the core Talk Mode first; let Voice Wake / extra providers follow in smaller follow-up PRs. This reduces review scope significantly.

  4. Keep provider abstraction light — the catalog idea is good, but avoid generalising too far ahead of the actual use cases.

If it helps, the test patterns added in the previous iteration (VoiceCapabilityTests, VoiceModeSchemaTests, VoiceProviderConfigurationStoreExtensionsTests) are worth preserving in the new PR — they're a solid baseline.

Looking forward to the new PR! 👍

Generated by 🌈 Repo Assist, see workflow run. Learn more.

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@97143ac59cb3a13ef2a77581f929f06719c7402a

@shanselman
Copy link
Copy Markdown
Contributor

Thanks again for the huge amount of work here, @NichUK. This PR sketches out a much broader Voice Mode vision: STT, TTS, repeater UI, provider plumbing, and WebChat integration.

We’re going to land the focused TTS capability slice via #253 first because it is smaller, tested, and easier to review safely. I see that as a foundation for this work rather than a rejection of it.

After #253 lands, could you rebase this branch and split the remaining Voice Mode pieces into smaller follow-up PRs? The most useful next slices would probably be:

  1. STT capability / command surface
  2. Voice or Talk Mode UX
  3. repeater window behavior
  4. WebChat bridge integration
  5. provider catalog/settings cleanup

Looping in @RBrid as well since #253 overlaps with the TTS portion. It would be great if the two of you can coordinate so the next PRs build cleanly on the shared tts.speak foundation.

shanselman pushed a commit that referenced this pull request May 1, 2026
Adds a focused Windows node text-to-speech capability as the first stable voice-support primitive.

- adds the shared `tts.speak` capability and MCP/gateway documentation
- wires Windows and ElevenLabs TTS behind opt-in tray settings
- protects the ElevenLabs API key with DPAPI
- adds shared and tray tests for capability behavior, settings, and ElevenLabs requests

This lands the focused TTS foundation from the broader Voice Mode discussion in #120 so remaining voice UX/STT/repeater work can build on top in smaller follow-up PRs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@shanselman
Copy link
Copy Markdown
Contributor

#253 has landed on master, so the shared tts.speak foundation is now available.

The best next step for this broader Voice Mode work is to rebase on current master and split the remaining pieces into smaller follow-up PRs on top of that foundation. Thanks again, @NichUK — there’s a lot of useful direction here, and landing it in smaller layers should make it much easier to review and merge safely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants