mojovoice Project Roadmap

Status: Active Development | Last Updated: 2026-02-08

Project roadmap and feature planning.

Version Roadmap

v0.6.0 - UI/UX Overhaul & Platform Maturity

Theme: Production-ready design and best-in-class user experience

Priority Order:

UI/UX Overhaul (Critical - Comprehensive redesign)

Settings Reorganization:
- Collapse advanced options (4 primary settings visible vs current 12)
- Voice Recognition section (model + language dropdown)
- Recording section (duration + audio device selection)
- Behavior section (text options)
- Appearance section (UI scale)
- Advanced section (collapsed by default for power users)
- Smart restart notifications (only show when daemon settings change)
Models Tab (New Navigation Item):
- Separate model management from Settings
- Model library browser with download/cancel/delete
- Visual download progress with real-time stats
- Model cards with speed/quality indicators
Dev Coffee Brand Consistency:
- Apply bracket frames, subtle glows, technical grids across all components
- Typography hierarchy using JetBrains Mono for headings
- Purposeful emerald green accents (not decorative)
- Near-black backgrounds with glass layers
- Scanlines and retro-tech aesthetic refinements
Dashboard Improvements:
- Better visual hierarchy and information density
- Enhanced status visualization
- Recording feedback with waveform visualization
- Clearer daemon/GPU/model status display
Model Setup Wizard (First-Run Experience):
- Hardware detection (GPU type, VRAM, disk space)
- Model recommendation based on capabilities
- Guided download with progress feedback
- Test recording/transcription in wizard
- Auto-configure environment
Reference Documents:
- docs/project/2026-01-14-settings-ui-redesign.md
- docs/context/dev-coffee-brand-style-guide.md
Linux Automated Testing (Critical - Reliability)
- CI testing on Ubuntu, Fedora, Arch
- Validate CUDA installation across distros
- Test audio capture on different audio servers (PipeWire, PulseAudio, ALSA)
- Integration tests for Waybar/systemd
- Distro-specific packaging validation
Cross-Platform Dictation Indicator (High - Consistency)
- System tray icon with recording state (all platforms)
- Visual feedback during recording (waveform/pulsing indicator)
- Status overlay (OSD) support
- Consistent behavior on Linux/macOS/Windows
CUDA Multi-Distro Support (High - Remove friction)
- Automated CUDA detection and fallback
- Distro-specific CUDA path handling
- Better error messages for missing CUDA
- Documentation for each major distro
macOS Unsigned .dmg (Medium - Good enough for v0.6.0)
- Improve .dmg with better first-run experience
- Handle permission prompts gracefully
- Clear instructions for Accessibility/Microphone permissions

Deferred to v0.6.1+:

macOS signed installer with code signing/notarization (requires $99/year Apple Developer account)

v0.7.0 - Platform Expansion

Theme: AMD and Windows support

Features:

AMD ROCm Support (Critical)
- ROCm GPU acceleration for AMD graphics cards
- Automated ROCm detection and fallback
- Testing on popular AMD GPUs
Windows Support (Critical)
- Full Windows 10/11 support
- Audio capture validation
- Text injection via Windows APIs
- Installer with proper permissions
Polybar Integration (High)
- Status module for X11/i3/bspwm users
- Mirrors Waybar integration pattern

v0.8.0 - Intelligence & Performance

Theme: Smarter transcription and quality-of-life improvements

Features:

Context-Aware Vocabulary
- Detect active file type (.rs, .py, .ts, .md)
- Auto-bias toward relevant technical terms
- Project-specific vocabulary learning
Speculative Decoding
- Draft model integration for 30-50% speedup
- Research correct Whisper API implementation
Mojo-Audio 0.1.2
- Performance improvements
- Better audio preprocessing
Dynamic Prompt Biasing
- UI for adjusting biasing prompts
- Preset vocabularies (Rust dev, Python dev, DevOps, etc.)
- Learn from correction history
General QoL Improvements
- Voice commands ("undo last", "new line", "format code")
- Better error recovery
- Performance optimizations

Completed Features (v0.5.x)

✅ Desktop App CI Builds - Automated Tauri builds for Linux/macOS
✅ Documentation Overhaul - Complete README rewrite
✅ Performance Benchmarking - mojovoice benchmark command
✅ Community Infrastructure - LICENSE, CONTRIBUTING.md
✅ Model Management UI - Download, switch, delete with visual meters
✅ Transcription History - Searchable persistent history
✅ Audio Device Selection - UI-based device picker
✅ Daemon Subcommands - up, down, restart, status, logs

Future Features Under Consideration

Features requiring further research, evaluation, or design decisions before commitment.

Speaker Diarization Integration

Status: Under Review Research Date: 2026-01-23

Overview: Microsoft's VibeVoice-ASR offers unified ASR + speaker diarization + timestamps, but has significant integration barriers:

VibeVoice-ASR Analysis:

✅ Unified pipeline (ASR + diarization + timestamps in one pass)
✅ Long-form support (up to 60 minutes continuous)
✅ Hotword support for domain-specific vocabulary
❌ 9B parameters (vs Whisper Large-v3 Turbo ~1.5B) - significantly higher resource requirements
❌ Python-only implementation (no Rust bindings)
❌ Incompatible architecture (Qwen2.5-based vs Whisper encoder-decoder)
❌ Uses 7.5 Hz speech tokenizers (incompatible with mojo-audio mel spectrogram pipeline)
❌ Limited language support (English/Chinese only vs Whisper's 99 languages)

Alternative Approaches:

pyannote-audio (Rust-friendly via ONNX export)
WhisperX (adds diarization to Whisper, similar integration pattern)
Keep Whisper for ASR, add separate lightweight diarization module

Decision Needed:

Is speaker diarization a core use case for dev-voice?
What performance/resource trade-offs are acceptable?
Should this wait for better Rust-native solutions?

Reference: https://huggingface.co/microsoft/VibeVoice-ASR

Text-to-Speech Integration (Qwen3 TTS)

Status: Under Review Research Date: 2026-01-23

Overview: Qwen's newly released Qwen3-TTS models offer text-to-speech capabilities that could enable voice feedback and accessibility features in mojovoice.

Qwen3-TTS Model Family:

✅ Multiple model sizes (0.6B, 1.7B parameters)
✅ Custom voice support (CustomVoice variants)
✅ Voice design capabilities (VoiceDesign variants)
✅ 12Hz sampling rate
✅ Natural-sounding speech synthesis
⚠️ Very new release (hours old - maturity unknown)
⚠️ Python-first implementation (Rust integration TBD)
⚠️ Resource requirements unclear for real-time use

Potential Use Cases:

Audio feedback (read back transcriptions for verification)
Accessibility features (voice confirmation of commands)
Testing infrastructure (generate synthetic audio for STT pipeline testing)
Future voice assistant features (bidirectional voice interaction)

Decision Needed:

Is TTS a core feature for dev-voice or scope creep?
What's the priority vs existing STT improvements?
Should we wait for more mature Rust bindings?
Which variant/size is optimal for developer workflow use?

Reference: https://huggingface.co/collections/Qwen/qwen3-tts

Ideas & Future Exploration

Advanced Features (Post v0.8.0)

Custom wake word for hands-free mode
Multi-language support testing (Spanish, French, German, etc.)
Noise cancellation preprocessing (DeepFilterNet)
IDE plugins (VSCode, Neovim, JetBrains)
Mobile companion app (trigger from phone)

Community Infrastructure (Ongoing)

🤖 Generated with Claude Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mojovoice Project Roadmap

Version Roadmap

v0.6.0 - UI/UX Overhaul & Platform Maturity

v0.7.0 - Platform Expansion

v0.8.0 - Intelligence & Performance

Completed Features (v0.5.x)

Future Features Under Consideration

Speaker Diarization Integration

Text-to-Speech Integration (Qwen3 TTS)

Ideas & Future Exploration

Advanced Features (Post v0.8.0)

Community Infrastructure (Ongoing)

FilesExpand file tree

roadmap.md

Latest commit

History

roadmap.md

File metadata and controls

mojovoice Project Roadmap

Version Roadmap

v0.6.0 - UI/UX Overhaul & Platform Maturity

v0.7.0 - Platform Expansion

v0.8.0 - Intelligence & Performance

Completed Features (v0.5.x)

Future Features Under Consideration

Speaker Diarization Integration

Text-to-Speech Integration (Qwen3 TTS)

Ideas & Future Exploration

Advanced Features (Post v0.8.0)

Community Infrastructure (Ongoing)