Textagent
diff --git a/‎CHANGELOG-speech-enhancements.md‎
Lines changed: 69 additions & 0 deletions b/‎CHANGELOG-speech-enhancements.md‎
Lines changed: 69 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 3 additions & 3 deletions b/‎README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎ai-worker-gemini.js‎
Lines changed: 26 additions & 7 deletions b/‎ai-worker-gemini.js‎
Lines changed: 26 additions & 7 deletions
diff --git a/‎ai-worker-groq.js‎
Lines changed: 22 additions & 3 deletions b/‎ai-worker-groq.js‎
Lines changed: 22 additions & 3 deletions
diff --git a/‎ai-worker-openrouter.js‎
Lines changed: 21 additions & 3 deletions b/‎ai-worker-openrouter.js‎
Lines changed: 21 additions & 3 deletions
@@ -0,0 +1,69 @@
+# Speech-to-Text Enhancements — Dual-Engine Voice Dictation Upgrade
+
+- Upgraded to **dual-engine ASR**: Web Speech API + Moonshine Base ONNX with consensus scoring
+- Added **auto-punctuation by default** — AI refinement (`aiRefineEnabled = true`) with 5-second timeout
+- Built-in **punctuation fallback** (capitalize + period) when no LLM is loaded
+- Added **streaming partial results** handler — shows `🧠 {text}` in status bar during decoding
+- Expanded voice commands from ~25 to **50+ patterns** with natural ASR-friendly phrases
+- Added **"heading one"/"title"** as primary heading commands (more reliable than "hash")
+- New formatting commands: **strikethrough**, **highlight** (start/end pairs)
+- New structure commands: **"add table"**, **"add link"**, **"add image"**, **"divider"**
+- New editing commands: **"undo"/"take that back"**, **"delete that"/"remove that"**
+- New punctuation: **"ellipsis"**, **"open/close quote"**, **"at sign"**, **"bang"**
+- Added **stronger hallucination filter**: 100-word max, 30%+ non-ASCII rejection
+- Improved **model loading progress**: step-by-step [1/4]…[4/4] with file sizes in MB
+- Added `onnxruntime-web` dependency and Moonshine Medium worker (experimental, kept as reference)
+- Fixed: WASM loading error by pointing `ort.env.wasm.wasmPaths` to jsDelivr CDN
+- Fixed: AI refinement blocking text insertion (added `Promise.race` 5s timeout + double safety net)
+- Updated cheat sheet to show natural phrases as primary commands
+- Updated voice dictation feature description in README
+
+---
+
+## Summary
+
+Enhanced the speech-to-text system with auto-punctuation, expanded voice commands with natural ASR-friendly phrases, stronger hallucination filtering, and streaming partial result display. The dual-engine consensus approach (Web Speech API + Moonshine Base ONNX) is retained with improved reliability.
+
+---
+
+## 1. Auto-Punctuation & AI Refinement
+**Files:** `js/speechToText.js`
+**What:** Enabled `aiRefineEnabled = true` by default. When a Qwen model is loaded, speech text is sent through the LLM for punctuation, capitalization, and grammar cleanup. Added `Promise.race` with 5-second timeout to prevent AI refinement from blocking text insertion. Added `addBasicPunctuation()` fallback that capitalizes first letter and adds period when no LLM is available.
+**Impact:** Every speech segment is now properly punctuated and capitalized, either through LLM refinement or built-in rules.
+
+## 2. Expanded Voice Commands
+**Files:** `js/speechToText.js`
+**What:** Rewrote `applyMarkdownCommands()` with 50+ regex patterns organized by category (headings, formatting, structure, links/media, table, punctuation, editing). Added natural ASR-friendly aliases for every command: "heading one"/"title" instead of just "hash", "bullet"/"add bullet" instead of just "bullet point", "undo"/"take that back" for undo, "delete that" to discard text. Multi-word patterns processed before single-word to prevent partial matches.
+**Impact:** Voice commands are significantly more reliable — ASR engines recognize natural phrases like "heading one" far better than "hash".
+
+## 3. Stronger Hallucination Filter
+**Files:** `js/speechToText.js`
+**What:** Enhanced `isHallucination()` to reject outputs with >100 words (likely gibberish) and outputs where >30% of characters are non-ASCII (garbage multilingual hallucination).
+**Impact:** Prevents garbage model outputs from being inserted into the editor.
+
+## 4. Streaming Partial Results
+**Files:** `js/speechToText.js`, `js/moonshine-medium-worker.js`
+**What:** Added handler for `'partial'` message type from worker — displays `🧠 {text}` in the status bar as tokens are decoded. Worker sends partial results every 3 tokens during autoregressive decoding.
+**Impact:** Users see real-time feedback of what the model is transcribing (when a streaming-capable model is used).
+
+## 5. Model Loading Progress
+**Files:** `js/speechToText.js`, `js/moonshine-medium-worker.js`
+**What:** Progress messages now show step numbers [1/4]…[4/4], download vs. initialization phases, device info (GPU/CPU), and file sizes in MB. Worker reports detected execution provider (WebGPU/WASM) on ready.
+**Impact:** Users see exactly what's happening during the ~300MB model download process.
+
+## 6. Moonshine Medium Worker (Experimental)
+**Files:** `js/moonshine-medium-worker.js` [NEW], `package.json`
+**What:** Created custom ONNX Runtime worker for Moonshine Medium Streaming model with 3-model architecture (encoder, decoder, decoder_with_past), KV cache management, WebGPU auto-detection with WASM fallback. Added `onnxruntime-web` dependency. The community model (`Mazino0/moonshine-streaming-medium-onnx`) produced garbage output, so the system uses proven Moonshine Base via Transformers.js pipeline.
+**Impact:** Worker is architecturally complete and kept as reference for when a reliable medium ONNX model becomes available.
+
+---
+
+## Files Changed (5 total)
+
+| File | Lines Changed | Type |
+|------|:---:|------|
+| `js/speechToText.js` | +120 −50 | Auto-punctuation, expanded voice commands, hallucination filter, streaming handler |
+| `js/moonshine-medium-worker.js` | +349 | New custom ONNX Runtime worker (experimental reference) |
+| `js/moonshine-worker.js` | +0 −0 | Existing Moonshine Base worker (unchanged, re-added to git) |
+| `package.json` | +1 | Added `onnxruntime-web` dependency |
+| `README.md` | +2 −2 | Updated voice dictation description and release notes |
@@ -26,7 +26,7 @@
 | **Writing Modes** | Zen mode (distraction-free fullscreen), Focus mode (dimmed paragraphs), Dark mode, multiple preview themes (GitHub, GitLab, Notion, Dracula, Solarized, Evergreen) |
 | **Rendering** | GitHub-style Markdown, syntax highlighting (180+ languages), LaTeX math (MathJax), Mermaid diagrams (zoom/pan/export), PlantUML diagrams, callout blocks, footnotes, emoji, anchor links |
 | **🤖 AI Assistant** | 3 local Qwen 3.5 sizes (0.8B / 2B / 4B via WebGPU/WASM), Gemini 3.1 Flash Lite, Groq Llama 3.3 70B, OpenRouter — summarize, expand, rephrase, grammar-fix, explain, simplify, auto-complete; AI writing tags (Polish, Formalize, Elaborate, Shorten, Image); enhanced context menu; per-card model selection; concurrent block generation; inline review with accept/reject/regenerate; AI-powered image generation |
-| **🎤 Voice Dictation** | Speech-to-text with Markdown-aware commands — hash headings, bold, italic, lists, code blocks, links, and more |
+| **🎤 Voice Dictation** | Dual-engine speech-to-text (Web Speech API + Moonshine Base ONNX) with consensus scoring; 50+ Markdown-aware voice commands — natural phrases ("heading one", "bold…end bold", "add table", "undo"); auto-punctuation via AI refinement or built-in fallback; hallucination filtering; streaming partial results |
 | **Import** | MD, DOCX, XLSX/XLS, CSV, HTML, JSON, XML, PDF — drag & drop or click to import |
 | **Export** | Markdown, self-contained styled HTML, PDF (smart page-breaks, shared rendering pipeline), LLM Memory (5 formats: XML, JSON, Compact JSON, Markdown, Plain Text + shareable link) |
 | **Sharing** | AES-256-GCM encrypted sharing via Firebase; read-only shared links, optional passphrase protection — decryption key stays in URL fragment (never sent to server) |
@@ -244,7 +244,7 @@ Import files directly — they're auto-converted to Markdown client-side:
 <details open>
 <summary><strong>🎤 Voice Dictation — Speak Your Markdown</strong></summary>
 
-**Hands-free writing with Markdown awareness.** Dictate naturally and use voice commands for headings, bold, italic, lists, code blocks, and links. The speech engine understands Markdown — say "hash hash" for an H2 heading.
+**Hands-free writing with Markdown awareness.** Dual-engine ASR combines Web Speech API and Moonshine Base ONNX with consensus scoring. 50+ voice commands with natural phrases — say "heading one" or "title" for H1, "bold text end bold" for **text**, "add table" for a markdown table, "undo" to take it back. Auto-punctuation adds capitalization and periods, with LLM refinement when a model is loaded.
 
 <img src="public/assets/demos/14_voice_dictation.webp" alt="Voice Dictation — speech-to-text with Markdown-aware commands" width="100%">
 
@@ -443,7 +443,7 @@ TextAgent has undergone significant evolution since its inception. What started
 
 | Date | Commits | Feature / Update |
 |------|---------|-----------------|
-| **2026-03-10** | — | 🔌 **API Response UX & Stock Widget** — 📋 Copy button on API review panel + all preview code blocks (hover-reveal); scrollable review body and preview `pre` blocks (max-height: 400px); API→JS variable pipeline (`window.__API_VARS` auto-injected as parsed JS objects into sandbox); stock chart range expansion (1D/1W/1M/1Y/3Y/5Y); removed broken 52D/52W/52M EMA buttons; replaced CORS-blocked ticker search APIs with Yahoo Finance/TradingView links |
+| **2026-03-11** | — | 🎤 **Speech-to-Text Enhancements** — dual-engine voice dictation (Web Speech API + Moonshine Base ONNX) with consensus scoring; auto-punctuation enabled by default (AI refinement with 5s timeout + built-in capitalize/period fallback); 50+ voice commands with natural ASR-friendly aliases ("heading one"/"title" for H1, "undo"/"take that back", "add table"/"add link", "strikethrough…end strike", "ellipsis"/"open quote"); stronger hallucination filter (100-word max, non-ASCII rejection); streaming partial result display; improved model loading progress [1/4]…[4/4] with file sizes; experimental Moonshine Medium ONNX worker (kept as reference) |
 | **2026-03-10** | — | 📈 **Stock Dashboard** — new Finance template category (3 templates: Stock Watchlist, Crypto Tracker, Market Overview) with live TradingView Advanced Chart widgets and 52-period EMA overlay; dynamic `data-var-prefix` grid engine expands one `stock-card` per non-empty variable; configurable `chartRange`, `chartInterval`, `emaPeriod` via `@variables` table; interactive 1M/1Y/3Y range + 52D/52W/52M EMA toggle buttons; `@variables` block persists after ⚡ Vars for re-editing; JS code block dynamically reads `$(cname*)` variables to generate grid HTML; `data-range`, `data-interval`, `data-ema` forwarded through DOMPurify; 179 Playwright tests pass |
 | **2026-03-10** | — | 🛡️ **CSP Fix for Badges** — added `https://img.shields.io` to the `img-src` directive in `index.html` and `nginx.conf` Content-Security-Policy to allow GitHub license and version badges to render correctly; updated legacy domain to `textagent.github.io`. |
 | **2026-03-10** | — | 🧪 **Toolbar Tags Tests Fix** — fixed 4 failing Playwright tests in `toolbar-tags.spec.js` by updating expected tag syntaxes to the new `@` prefix format (`{{@AI:}}`, `{{@Image:}}`, `{{@Agent:}}`), removing the deprecated `Think` tag test, and resolving a race condition where the test suite executed too fast by explicitly waiting for Phase 3 lazy-loaded modules (`M.formattingActions`) to register; added JSDoc types to silence TypeScript execution errors. |
 
@@ -38,7 +38,7 @@ async function validateApiKey() {
     }
 }
 
-async function generate(taskType, context, userPrompt, messageId, enableThinking = false) {
+async function generate(taskType, context, userPrompt, messageId, enableThinking = false, attachments = []) {
     if (!apiKey) {
         self.postMessage({ type: 'error', message: 'API key not set.', messageId });
         return;
@@ -53,10 +53,29 @@ async function generate(taskType, context, userPrompt, messageId, enableThinking
         const userMessages = messages.filter(m => m.role !== 'system');
 
         const requestBody = {
-            contents: userMessages.map(m => ({
-                role: m.role === 'assistant' ? 'model' : 'user',
-                parts: [{ text: m.content }],
-            })),
+            contents: userMessages.map(m => {
+                const parts = [{ text: m.content }];
+                // For the last user message, add image attachments as inlineData parts
+                if (m.role === 'user' && attachments && attachments.length > 0) {
+                    attachments.forEach(att => {
+                        if (att.type === 'image' && att.data) {
+                            parts.push({
+                                inlineData: {
+                                    mimeType: att.mimeType || 'image/png',
+                                    data: att.data
+                                }
+                            });
+                        } else if (att.type === 'file' && att.textContent) {
+                            // Append text file content as additional context
+                            parts[0].text += '\n\n[Attached File: ' + (att.name || 'file') + ']\n' + att.textContent;
+                        }
+                    });
+                }
+                return {
+                    role: m.role === 'assistant' ? 'model' : 'user',
+                    parts: parts,
+                };
+            }),
             generationConfig: {
                 maxOutputTokens: maxTokens,
                 temperature: 0.7,
@@ -155,11 +174,11 @@ function buildMessages(taskType, context, userPrompt) {
 }
 
 self.addEventListener('message', async (event) => {
-    const { type, taskType, context, userPrompt, messageId, enableThinking } = event.data;
+    const { type, taskType, context, userPrompt, messageId, enableThinking, attachments } = event.data;
     switch (type) {
         case 'setApiKey': apiKey = event.data.apiKey; break;
         case 'load': await validateApiKey(); break;
-        case 'generate': await generate(taskType, context, userPrompt, messageId, enableThinking); break;
+        case 'generate': await generate(taskType, context, userPrompt, messageId, enableThinking, attachments); break;
         case 'ping': self.postMessage({ type: 'pong' }); break;
     }
 });
@@ -84,7 +84,7 @@ async function validateApiKey() {
 /**
  * Generate text via Groq API with SSE streaming
  */
-async function generate(taskType, context, userPrompt, messageId, enableThinking = false) {
+async function generate(taskType, context, userPrompt, messageId, enableThinking = false, attachments = []) {
     if (!apiKey) {
         self.postMessage({
             type: 'error',
@@ -97,6 +97,25 @@ async function generate(taskType, context, userPrompt, messageId, enableThinking
     try {
         const messages = buildMessages(taskType, context, userPrompt);
 
+        // If there are image attachments, convert the last user message to multipart content
+        if (attachments && attachments.length > 0) {
+            const lastUserMsg = messages[messages.length - 1];
+            if (lastUserMsg && lastUserMsg.role === 'user') {
+                const parts = [{ type: 'text', text: lastUserMsg.content }];
+                attachments.forEach(att => {
+                    if (att.type === 'image' && att.data) {
+                        parts.push({
+                            type: 'image_url',
+                            image_url: { url: 'data:' + (att.mimeType || 'image/png') + ';base64,' + att.data }
+                        });
+                    } else if (att.type === 'file' && att.textContent) {
+                        parts[0].text += '\n\n[Attached File: ' + (att.name || 'file') + ']\n' + att.textContent;
+                    }
+                });
+                lastUserMsg.content = parts;
+            }
+        }
+
         let maxTokens = TOKEN_LIMITS[taskType] || 512;
         if (enableThinking) maxTokens = Math.max(maxTokens * 2, 1024);
 
@@ -263,7 +282,7 @@ function buildMessages(taskType, context, userPrompt) {
 
 // Listen for messages from the main thread
 self.addEventListener('message', async (event) => {
-    const { type, taskType, context, userPrompt, messageId, enableThinking } = event.data;
+    const { type, taskType, context, userPrompt, messageId, enableThinking, attachments } = event.data;
 
     switch (type) {
         case 'setApiKey':
@@ -273,7 +292,7 @@ self.addEventListener('message', async (event) => {
             await validateApiKey();
             break;
         case 'generate':
-            await generate(taskType, context, userPrompt, messageId, enableThinking);
+            await generate(taskType, context, userPrompt, messageId, enableThinking, attachments);
             break;
         case 'ping':
             self.postMessage({ type: 'pong' });
 
@@ -49,7 +49,7 @@ async function validateApiKey() {
     }
 }
 
-async function generate(taskType, context, userPrompt, messageId, enableThinking = false) {
+async function generate(taskType, context, userPrompt, messageId, enableThinking = false, attachments = []) {
     if (!apiKey) {
         self.postMessage({ type: 'error', message: 'API key not set.', messageId });
         return;
@@ -59,6 +59,24 @@ async function generate(taskType, context, userPrompt, messageId, enableThinking
         let maxTokens = TOKEN_LIMITS[taskType] || 512;
         if (enableThinking) maxTokens = Math.max(maxTokens * 2, 1024);
 
+        // If there are image attachments, convert the last user message to multipart content
+        if (attachments && attachments.length > 0) {
+            const lastUserMsg = messages[messages.length - 1];
+            if (lastUserMsg && lastUserMsg.role === 'user') {
+                const parts = [{ type: 'text', text: typeof lastUserMsg.content === 'string' ? lastUserMsg.content : '' }];
+                attachments.forEach(att => {
+                    if (att.type === 'image' && att.data) {
+                        parts.push({
+                            type: 'image_url',
+                            image_url: { url: 'data:' + (att.mimeType || 'image/png') + ';base64,' + att.data }
+                        });
+                    } else if (att.type === 'file' && att.textContent) {
+                        parts[0].text += '\n\n[Attached File: ' + (att.name || 'file') + ']\n' + att.textContent;
+                    }
+                });
+                lastUserMsg.content = parts;
+            }
+        }
         const response = await fetch(OPENROUTER_API_URL, {
             method: 'POST',
             headers: {
@@ -150,12 +168,12 @@ function buildMessages(taskType, context, userPrompt) {
 }
 
 self.addEventListener('message', async (event) => {
-    const { type, taskType, context, userPrompt, messageId, enableThinking } = event.data;
+    const { type, taskType, context, userPrompt, messageId, enableThinking, attachments } = event.data;
     switch (type) {
         case 'setApiKey': apiKey = event.data.apiKey; break;
         case 'setModelId': modelId = event.data.modelId; break;
         case 'load': await validateApiKey(); break;
-        case 'generate': await generate(taskType, context, userPrompt, messageId, enableThinking); break;
+        case 'generate': await generate(taskType, context, userPrompt, messageId, enableThinking, attachments); break;
         case 'ping': self.postMessage({ type: 'pong' }); break;
     }
 });