|
| 1 | +# Speech-to-Text Enhancements — Dual-Engine Voice Dictation Upgrade |
| 2 | + |
| 3 | +- Upgraded to **dual-engine ASR**: Web Speech API + Moonshine Base ONNX with consensus scoring |
| 4 | +- Added **auto-punctuation by default** — AI refinement (`aiRefineEnabled = true`) with 5-second timeout |
| 5 | +- Built-in **punctuation fallback** (capitalize + period) when no LLM is loaded |
| 6 | +- Added **streaming partial results** handler — shows `🧠 {text}` in status bar during decoding |
| 7 | +- Expanded voice commands from ~25 to **50+ patterns** with natural ASR-friendly phrases |
| 8 | +- Added **"heading one"/"title"** as primary heading commands (more reliable than "hash") |
| 9 | +- New formatting commands: **strikethrough**, **highlight** (start/end pairs) |
| 10 | +- New structure commands: **"add table"**, **"add link"**, **"add image"**, **"divider"** |
| 11 | +- New editing commands: **"undo"/"take that back"**, **"delete that"/"remove that"** |
| 12 | +- New punctuation: **"ellipsis"**, **"open/close quote"**, **"at sign"**, **"bang"** |
| 13 | +- Added **stronger hallucination filter**: 100-word max, 30%+ non-ASCII rejection |
| 14 | +- Improved **model loading progress**: step-by-step [1/4]…[4/4] with file sizes in MB |
| 15 | +- Added `onnxruntime-web` dependency and Moonshine Medium worker (experimental, kept as reference) |
| 16 | +- Fixed: WASM loading error by pointing `ort.env.wasm.wasmPaths` to jsDelivr CDN |
| 17 | +- Fixed: AI refinement blocking text insertion (added `Promise.race` 5s timeout + double safety net) |
| 18 | +- Updated cheat sheet to show natural phrases as primary commands |
| 19 | +- Updated voice dictation feature description in README |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## Summary |
| 24 | + |
| 25 | +Enhanced the speech-to-text system with auto-punctuation, expanded voice commands with natural ASR-friendly phrases, stronger hallucination filtering, and streaming partial result display. The dual-engine consensus approach (Web Speech API + Moonshine Base ONNX) is retained with improved reliability. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## 1. Auto-Punctuation & AI Refinement |
| 30 | +**Files:** `js/speechToText.js` |
| 31 | +**What:** Enabled `aiRefineEnabled = true` by default. When a Qwen model is loaded, speech text is sent through the LLM for punctuation, capitalization, and grammar cleanup. Added `Promise.race` with 5-second timeout to prevent AI refinement from blocking text insertion. Added `addBasicPunctuation()` fallback that capitalizes first letter and adds period when no LLM is available. |
| 32 | +**Impact:** Every speech segment is now properly punctuated and capitalized, either through LLM refinement or built-in rules. |
| 33 | + |
| 34 | +## 2. Expanded Voice Commands |
| 35 | +**Files:** `js/speechToText.js` |
| 36 | +**What:** Rewrote `applyMarkdownCommands()` with 50+ regex patterns organized by category (headings, formatting, structure, links/media, table, punctuation, editing). Added natural ASR-friendly aliases for every command: "heading one"/"title" instead of just "hash", "bullet"/"add bullet" instead of just "bullet point", "undo"/"take that back" for undo, "delete that" to discard text. Multi-word patterns processed before single-word to prevent partial matches. |
| 37 | +**Impact:** Voice commands are significantly more reliable — ASR engines recognize natural phrases like "heading one" far better than "hash". |
| 38 | + |
| 39 | +## 3. Stronger Hallucination Filter |
| 40 | +**Files:** `js/speechToText.js` |
| 41 | +**What:** Enhanced `isHallucination()` to reject outputs with >100 words (likely gibberish) and outputs where >30% of characters are non-ASCII (garbage multilingual hallucination). |
| 42 | +**Impact:** Prevents garbage model outputs from being inserted into the editor. |
| 43 | + |
| 44 | +## 4. Streaming Partial Results |
| 45 | +**Files:** `js/speechToText.js`, `js/moonshine-medium-worker.js` |
| 46 | +**What:** Added handler for `'partial'` message type from worker — displays `🧠 {text}` in the status bar as tokens are decoded. Worker sends partial results every 3 tokens during autoregressive decoding. |
| 47 | +**Impact:** Users see real-time feedback of what the model is transcribing (when a streaming-capable model is used). |
| 48 | + |
| 49 | +## 5. Model Loading Progress |
| 50 | +**Files:** `js/speechToText.js`, `js/moonshine-medium-worker.js` |
| 51 | +**What:** Progress messages now show step numbers [1/4]…[4/4], download vs. initialization phases, device info (GPU/CPU), and file sizes in MB. Worker reports detected execution provider (WebGPU/WASM) on ready. |
| 52 | +**Impact:** Users see exactly what's happening during the ~300MB model download process. |
| 53 | + |
| 54 | +## 6. Moonshine Medium Worker (Experimental) |
| 55 | +**Files:** `js/moonshine-medium-worker.js` [NEW], `package.json` |
| 56 | +**What:** Created custom ONNX Runtime worker for Moonshine Medium Streaming model with 3-model architecture (encoder, decoder, decoder_with_past), KV cache management, WebGPU auto-detection with WASM fallback. Added `onnxruntime-web` dependency. The community model (`Mazino0/moonshine-streaming-medium-onnx`) produced garbage output, so the system uses proven Moonshine Base via Transformers.js pipeline. |
| 57 | +**Impact:** Worker is architecturally complete and kept as reference for when a reliable medium ONNX model becomes available. |
| 58 | + |
| 59 | +--- |
| 60 | + |
| 61 | +## Files Changed (5 total) |
| 62 | + |
| 63 | +| File | Lines Changed | Type | |
| 64 | +|------|:---:|------| |
| 65 | +| `js/speechToText.js` | +120 −50 | Auto-punctuation, expanded voice commands, hallucination filter, streaming handler | |
| 66 | +| `js/moonshine-medium-worker.js` | +349 | New custom ONNX Runtime worker (experimental reference) | |
| 67 | +| `js/moonshine-worker.js` | +0 −0 | Existing Moonshine Base worker (unchanged, re-added to git) | |
| 68 | +| `package.json` | +1 | Added `onnxruntime-web` dependency | |
| 69 | +| `README.md` | +2 −2 | Updated voice dictation description and release notes | |
0 commit comments