Feature/chat level up#32
Closed
cryptopoly wants to merge 83 commits intostagingfrom
Closed
Conversation
tauri-action 0.6.0 used 'uploadUpdaterJson' which only suppressed latest.json uploads but kept the per-bundle .sig sidecars. The 0.6.2 bump (#5b50bb2) renamed the input to 'includeUpdaterJson', which now also suppresses .sig uploads. Without sigs the publish-manifest job has nothing to combine into latest.json, so the auto-updater never sees new releases. Set includeUpdaterJson: true so .sig files reach the release. The publish-manifest job still wins on the final latest.json because generate-updater-manifest.mjs deletes any existing latest.json asset before uploading its combined version.
…t levels
Twelve composer + thread improvements that close the gap with mainstream
local chat clients while keeping the substrate-first identity intact.
Frontend
- Shared RichMarkdown wrapper with Prism syntax highlighting (oneDark),
per-block copy button + language badge, GFM tables, and KaTeX math
- Collapsible left sidebar with localStorage persistence and expand chevron
- Per-thread session search filtering on title, message body, and reasoning
- Per-thread export to Markdown, JSON, and plain text via dropdown
- Slash-command menu in composer (/clear, /think on|off, /tools on|off,
/model, /cancel, /export md|json|txt) with arrow-key navigation
- Per-message temperature override chip with slider, numeric input,
reset, and per-session localStorage persistence
- Reasoning effort segmented control (Off / Low / Med / High) replacing
the binary thinking toggle, with per-session persistence
Backend
- POST /api/chat/generate/{session_id}/cancel endpoint flips an in-memory
cancel flag; the streaming loop checks between events, breaks early,
persists partial output, and emits a `cancelled: true` SSE chunk
- _build_history_with_reasoning() reattaches stored <think>...</think>
blocks to assistant turns when the thread is in auto thinking mode,
so reasoning-capable models retain context across follow-ups
- ThinkingTokenFilter accepts open_tag / close_tag constructor params;
reasoning_delimiters_for(model_ref) registry helper resolves per-model
overrides (default <think>...</think> preserved for all current models)
- GenerateRequest gains topP, seed, and reasoningEffort fields ready for
the Phase 2 sampler exposure pass to wire end-to-end
Tests
- 232 vitest cases (was 207); 768 pytest cases (was 765)
- New: exportThread.test.ts, sessionSearch.test.ts, slashCommands.test.ts,
test_history_with_reasoning.py, custom-tag ThinkingTokenFilter cases,
cancel-chat endpoint cases
- npx tsc --noEmit clean, npm test green, pytest green
Deferred to Phase 2.2 (full sampler exposure)
- top_p / seed inline chips (fields accepted, plumbing pending)
- reasoning_effort routing into llama-server / MLX worker payloads
- Per-chat reasoning delimiter override UI
The image and video catalogs surface variants like 'FLUX.1 Dev · mflux (MLX)' and 'LTX-2 · distilled (MLX)' that route through mflux or mlx-video on Apple Silicon. Both depend on the mlx wheel which has no Linux or Windows builds, so picking one of those entries on the wrong OS is a guaranteed dead end. Add backend_service/helpers/platform_filter.py with: - is_apple_silicon(system, machine) — pure-function platform check (parameters exposed for tests without monkeypatching). - is_mlx_only_variant(variant) — detects variants by explicit mlxOnly flag, engine == 'mflux' / 'mlx-video', or runtime strings ending in 'mflux (MLX native)' / 'mlx-video (MLX native)'. - filter_mlx_only_families(families, on_apple_silicon=...) — drops MLX-only variants on non-Apple hosts and removes families whose entire variant set was MLX-only. Returns a new list, never mutates. Wire the filter into _image_model_payloads (helpers/images.py) and _video_model_payloads (helpers/video.py) at the end of payload construction so the catalog routes (/api/images/catalog and /api/video/catalog) return only the variants that can run on the current host. Surfaced by the v0.7.2 smoke test on a Windows / RTX 4090 box: FLUX.1 Dev mflux and the LTX-2 MLX variants showed up in the model dropdowns but failed at preload time because mlx isn't installable. Filtering server-side keeps the dropdowns honest without changing any frontend code.
Two related Windows-only bugs surfaced by the v0.7.2 smoke test on an RTX 4090 box: Bug #6 — RTX 4090 reported as 12 GB total GPUMonitor._snapshot_nvidia() shells out to nvidia-smi, and on Windows boxes without it on PATH (driver installed but no CUDA toolkit) it fell through to _fallback_psutil() which returns psutil.virtual_memory().total — system RAM, not VRAM. The image / video safety estimators then read that as the GPU budget and produced 'Likely to crash' warnings on a 24 GB card holding an 11 GB FLUX model. Fix: - Try torch.cuda.get_device_properties(0).total_memory first. When the GPU bundle is installed this is the most reliable source — it reads through the CUDA driver, no PATH needed. - Fall back to nvidia-smi as before. - Drop the psutil fallback. When neither answers we now return {'vram_total_gb': None}, which the TS estimators (utils/images.ts, utils/videos.ts) already treat as 'unknown' via the DEFAULT_*_MEMORY_GB fallbacks. Better an honest 'unknown' than a wrong 12 GB. Bug #7 — Image gen produces gibberish placeholder after install DiffusersImageEngine.probe() uses importlib.util.find_spec to decide between the placeholder engine and the real diffusers pipeline. Once the GPU bundle install lands new packages into the extras dir, importlib's negative-lookup cache still answers None for the new modules until invalidate_caches() is called. The probe kept reporting realGenerationAvailable=False and the generation pipeline returned the SVG placeholder, which lands as a gibberish image when the frontend renders it as data:image/svg+xml. Fix: - probe() now calls importlib.invalidate_caches() before find_spec so newly-installed packages are picked up without a backend restart. - The GPU bundle worker (_gpu_bundle_job_worker) now also calls invalidate_caches and resets the VRAM total cache when it transitions to phase=done, so the immediately-following capabilities snapshot reflects the freshly-importable torch. Tests tests/test_gpu_detection.py — 9 unit tests covering torch.cuda detection, nvidia-smi precedence, the new no-system-RAM fallback path, and the process-lifetime cache. All pass; existing pytest suite still green.
The streaming chat surface previously showed a bare blinking cursor while the model was still ingesting the prompt. On large contexts that's indistinguishable from a hung generation. Surface the phase explicitly. Backend - generate_stream now emits a `phase: prompt_eval` SSE chunk before invoking the runtime, then a `phase: generating` chunk (with `ttftSeconds`) the moment the first token / reasoning fragment arrives. - _stream_assistant_metrics_payload accepts a ttft_seconds kwarg and passes it through to the assistant message metrics so the value persists on the finalised turn. Frontend - New PromptPhaseIndicator component with elapsed-time tick (250 ms) and a phase-specific colour treatment (neutral while ingesting, accent-tinted once tokens flow). - ChatStreamPhase type + StreamCallbacks.onPhase callback in api.ts parser. - useChat seeds the optimistic assistant placeholder with streamPhase: "prompt_eval" so the indicator shows immediately on send, before the backend's first SSE chunk arrives. The phase advances on each onPhase event and clears via the onDone session refresh. - ChatTab renders the indicator above the markdown content while the message is streaming and a phase is set, replacing the blinking cursor for that interval. Tests - src/__tests__/streamPhase.test.ts covers the SSE parser routing prompt_eval / generating events with optional ttftSeconds and ignoring unknown phase strings. Verification: tsc --noEmit clean, npm test 236 / 236, pytest 795
The Tauri 2 NSIS installer is configured with a custom installer hook
file at src-tauri/installer.nsh. The hook intentionally ships as
empty stubs that document the contract the GPU runtime depends on:
%LOCALAPPDATA%\ChaosEngineAI\extras\cp{major}{minor}\site-packages
This directory holds the GPU bundle (torch + diffusers + transformers,
~2.5 GB) that the Image Studio install button writes via pip. Tauri's
default uninstaller leaves the path alone today, but the explicit hook
file makes that intent visible — anyone adding RM logic in the future
gets the comment block as a guardrail.
Changes:
- Add src-tauri/installer.nsh with documented empty pre/post
install + uninstall hooks. NSIS_HOOK_POSTUNINSTALL carries the
preserve-extras contract in a comment so the rule can't drift.
- Wire the hook into src-tauri/tauri.conf.json via
bundle.windows.nsis.installerHooks: ./installer.nsh.
- Add a comment block in src-tauri/src/lib.rs::chaosengine_extras_root
pointing at the NSIS hook so a Rust-side path move doesn't silently
break the Windows-side contract.
- Add tests/test_extras_path.py pinning the
ChaosEngineAI/extras/cp{maj}{min}/site-packages shape so any
future move loud-fails the suite. The Python ABI tag pin matches
sys.version_info against the resolved path.
Surfaced by the v0.7.2 smoke test on Windows: the user reported the
GPU runtime had been wiped after an uninstall + reinstall cycle. The
default uninstaller path doesn't touch the extras tree on the bench
config we ship, but pinning the contract via these hooks + tests
makes the regression visible if anyone adds custom uninstall logic
later.
Tests:
- .venv/bin/python -m pytest tests/test_extras_path.py -v — 3/3 pass
- Pre-existing tests still pass
…ards A laptop crash during chat (caused by an unattended generation hanging on prefill) prompted four watchdog layers that catch the most common runaway failure modes early — before the host wedges, swap-thrashes, or OOM-kills. A. Stuck prompt-eval timeout (frontend) useChat arms a 60-second timer when the backend announces the prompt_eval phase. If the timer fires before the generating phase transition, it calls cancelChatGeneration, aborts the local stream, and surfaces an actionable error explaining the likely causes (oversized context, OOM, thermal throttle). The timer is cleared on phase transition, onDone, onError, and manual cancel so a stale timer can't abort a follow-up turn. B. Pre-flight memory gate (backend) New backend_service/helpers/memory_gate.py with `gate_chat_generation` and `snapshot_memory_signals`. Refuses chat generations when free RAM is below 1 GB OR combined memory pressure exceeds 92%. The refusal is emitted as a regular SSE error chunk so the existing error-handling path renders the message — no new client wiring. Gate exceptions never block legitimate work: a psutil glitch logs a warning and falls through. D. Output-length runaway guard (backend) The streaming loop now aborts when accumulated assistant text exceeds max_tokens × 6 characters (1.5× the budget at ~4 chars/token average). Catches decoder loops on quantised models that ignore the EOS token. Logged separately from user-initiated cancellation. E. Reasoning budget cap (backend) ThinkingTokenFilter now accepts max_reasoning_chars (default 32_000 ≈ 8000 tokens). When inside <think> without a close tag and the cap is reached, the filter force-emits reasoning_done, stops appending to reasoning, and routes leftover bytes to text so the assistant turn finalises instead of streaming reasoning forever. Pass None to disable per-call. Tests - tests/test_memory_gate.py: 6 cases covering pass / refuse / boundary / custom-threshold paths. - tests/test_mlx_worker.py: 4 new ThinkingTokenFilter cases for the budget cap (force-close, disabled-when-None, validation, normal close-tag still works). - Verification: tsc --noEmit clean, npm test 236, pytest 805 (+10 new).
… banners, image/video gates Five additional watchdog layers on top of the Phase 2.0.5 baseline. Each covers a distinct runaway failure mode that was previously invisible to the user until the host became unresponsive. C. Tok/s floor monitor (chat) The streaming loop now samples decode rate over a 30-second rolling window. Falling below 0.3 tok/s for the full window aborts the generation with a thermal-throttle / GPU-stall / worker-deadlock diagnostic. Cheap — chunk count proxies for tokens, no per-tick psutil hit. F. Repetition guard for the llama.cpp path `RunawayGuard` was previously bound to the MLX subprocess. Lifted to `backend_service/runaway_guard.py`; the `mlx_worker` module re- exports the symbol so existing imports (and tests) continue to resolve. The chat stream loop in `state.py` now feeds each chunk through a fresh guard so identical-line repetition or near-duplicate reasoning loops abort the stream within a few hundred tokens. G. Panic banner (chat) The streaming loop samples memory every 5 seconds. When free RAM drops below 0.5 GB OR pressure tops 96%, a `panic` SSE event fires once per turn. The frontend renders a non-blocking red banner with the live numbers and a Cancel affordance. Generation is *not* auto- cancelled — the user decides. H. Image / video pre-flight memory gates Same shape as the chat gate from earlier in 2.0.5. Image gen refuses below 4 GB free / 88% pressure; video below 6 GB / 85% (strictest of the three because diffusion swap-thrash on Apple Silicon historically wedged the host). Routes raise 503 with the human-readable refusal message so existing modal error paths render it without new wiring. I. Thermal pressure banner (chat, macOS) New `helpers/thermal.py` parses `pmset -g therm` output (no sudo required) and classifies into nominal / moderate / critical. The stream loop emits a `thermalWarning` SSE event the first time the classifier returns "critical"; the frontend renders an amber banner distinguishing it from the red memory panic. Linux / Windows return None and the watcher is a no-op there until cross-platform thermal telemetry lands in Phase 3.5. J. Worker liveness probe Deferred — the C tok/s floor catches hung worker scenarios in practice. A dedicated ping/pong probe will land alongside Phase 3 substrate telemetry where worker heartbeats are first-class. Tests - `tests/test_thermal.py` — 8 cases covering the pmset classifier - `tests/test_memory_gate.py` — 6 new cases for image/video gates - `tests/test_runaway_guard.py` — 5 cases for the shared module + alias identity Verification: tsc --noEmit clean, npm test 236, pytest 824 (+19 new).
…hread / ChatComposer
ChatTab.tsx had grown to 1085 lines holding the sidebar, thread header,
message list, and composer all in one component. Phase 2.4 (conversation
branching), 2.5 (in-thread multi-model compare), and 2.9 (@-mention
system) need to swap out individual pieces of that surface — which
isn't tractable while everything is interleaved.
Split into four siblings under `src/features/chat/`:
- ChatSidebar.tsx (148 lines)
Session list with title/body search, pin / delete, warm-model
badges, and the collapse toggle. Owns its own filter call so
parent doesn't have to memo it.
- ChatHeader.tsx (223 lines)
Thread title editor, model picker, export-format dropdown,
runtime summary, document chips, optional sidebar-expand
toggle when the sidebar is collapsed.
- ChatThread.tsx (375 lines)
Message list with reasoning panels, prompt-phase indicator,
panic and thermal banners, tool-call cards, citations, and
the per-turn metrics fold-out. Drag-drop forwards files via
`onChatFileDrop`.
- ChatComposer.tsx (283 lines)
Image previews, slash-command popover, textarea (with arrow-
key + Tab + Esc handling), thinking-effort segmented control,
temperature chip, tools toggle, send / stop / clear buttons.
ChatTab.tsx (388 lines, -64%) is now a composition root that owns:
- sidebar collapse state
- session search query
- slash-command match list and selection cursor
- per-thread temperature override (with localStorage glue)
- per-thread reasoning effort level (with localStorage glue)
Children receive narrow prop slices. No behaviour change — every
existing flow (export menu, slash commands, temp chip, panic banner,
thermal warning, sidebar collapse) renders identically and exercises
the same handlers.
Verification: tsc --noEmit clean, npm test 236, pytest 824.
…ty / seed / mirostat / json_schema / reasoning_effort Closes the Phase 1.10 + 1.12 deferrals. Per-thread sampler overrides now flow end-to-end from the SamplerPanel popover through GenerateRequest, RuntimeController, both engine implementations, and out to llama-server / mlx-lm. Backend - GenerateRequest gains topP, topK, minP, repeatPenalty, mirostatMode, mirostatTau, mirostatEta, seed, jsonSchema (already had reasoningEffort from Phase 1.12). Each defaults to None so the backend's defaults stay in force when the UI sends no override. - New helper `_apply_sampler_kwargs` in inference.py merges the override dict + reasoning_effort + json_schema into a llama-server /v1/chat/completions payload. JSON schema is wrapped in the OpenAI structured-outputs `response_format` envelope. - New helper `_build_sampler_overrides` in state.py projects a GenerateRequest into the snake_case dict the engines consume. - BaseInferenceEngine, LlamaCppEngine, MLXWorkerEngine, RemoteOpenAIEngine, and RuntimeController gain `samplers`, `reasoning_effort`, `json_schema` kwargs end-to-end. - mlx_worker `_build_mlx_sampler` calls `make_sampler` with whatever Phase 2.2 sampler subset the installed mlx-lm version supports (top_p, top_k, min_p), filtered via signature inspection so older mlx-lm builds fall back gracefully. Frontend - New `SamplerPanel` component: popover with numeric inputs for each Phase 2.2 sampler plus a mirostat mode selector that only reveals tau / eta when modes 1/2 are picked. Override badge shows count. - `samplerOverrides.ts`: storage helpers (`readSamplerOverrides`, `writeSamplerOverrides`, `samplerPayload`) keyed by session id with defensive sanitisation against corrupt blobs. - `ChatTab` owns the override state, persists on every change. - `ChatComposer` renders the panel next to the temperature chip. - `useChat` reads from the same localStorage key when assembling stream payloads — single source of truth. Tests - tests/test_sampler_payload.py: 9 cases across `_apply_sampler_kwargs` and `_build_sampler_overrides` covering pass-through, none-skip, unknown-key ignore, json_schema envelope, reasoning_effort plumbing. - src/features/chat/__tests__/samplerOverrides.test.ts: 9 cases for storage round-trip, sanitisation, per-session scoping, payload projection. Inline localStorage shim works around the node-only vitest environment. - tests/test_backend_service.py FakeRuntime fixture extended to accept the new kwargs so existing chat-completion tests still pass. Verification: tsc --noEmit clean, vitest 245 (+9), pytest 833 (+9). Deferred to follow-up sprint - DRY sampler (llama-server supports it but the API is fiddly and benefits from per-context-length tuning) - XTC sampler (still new; few models have published settings) - Free-form GBNF grammars (json_schema covers the common case)
Loaded models now declare what they can do — vision, tools, reasoning,
coding, agents, audio, video, multilingual — and the chat surface uses
those declarations to render capability badges and gate composer
affordances. Picking a text-only model hides the image attach button;
picking a non-tool-capable model hides the Tools toggle; picking a
non-reasoning model hides the thinking effort segmented control.
Backend
- New `backend_service/catalog/capabilities.py`:
* `ModelCapabilities` dataclass with eight typed boolean flags plus
a free-form `tags` tuple preserving the catalog's original strings
so the UI can render badges without re-deriving them.
* `_CAPABILITY_TO_FLAG` maps catalog strings ("vision", "tool-use",
"thinking", "multilingual", etc.) to the typed fields.
* `resolve_capabilities(ref, canonical_repo)` walks the curated
`MODEL_FAMILIES` catalog. Variant match wins; falls back to
family-level entry when a quantised fork's ref doesn't match a
variant directly. Heuristic substring sniff covers refs the
catalog hasn't been updated for (vl/llava → vision; r1/think →
reasoning; coder → coding; instruct/-it/chat → tools).
- `LoadedModelInfo.to_dict()` now resolves capabilities lazily on
every snapshot. Lazy resolution avoids a migration on the dataclass
and the dict shape is stable across runtimes.
Frontend
- `LoadedModel` type gains `capabilities?: ModelCapabilities | null`.
- `ChatHeader` renders capability badges next to the Ready pill when
the active thread's loaded model has resolved capabilities. Each
badge has a hover-title explaining the flag.
- `ChatComposer` reads `loadedModelCapabilities` and conditionally
renders the image-attach button, thinking effort control, and Tools
toggle. When capabilities are absent (unknown model), every
affordance stays visible — the gate never hides UI based on missing
data.
- App.tsx threads `workspace.runtime.loadedModel?.capabilities` down to
ChatTab, which forwards to ChatHeader and ChatComposer.
Tests
- tests/test_capabilities.py — 10 cases: empty fallback, catalog match,
canonical-repo match, three heuristic paths (vision/reasoning/coder),
instruct → tools, dataclass dict round-trip, family-level fallback
for non-variant forks, None inputs.
Verification: tsc --noEmit clean, vitest 245, pytest 843 (+10).
Unblocks Phase 2.12 (mid-thread model swap) — the capability deltas
between models become visible at swap time, so the user can see "this
model loses tools support" before committing.
The composer now exposes a "Send next via..." dropdown that lets the user pick a different warm model for the upcoming turn without changing the thread's default. After the turn finishes the dropdown clears and the next plain message reverts to the session's default model. Useful for quickly testing a theory on a smaller model, then having the larger one carry the conversation back. Backend - New `oneTurnOverride: bool = False` field on GenerateRequest. When True, state.generate_stream skips persisting the runtime's loaded model identity (`model`, `modelRef`, `canonicalRepo`, `modelSource`, `modelPath`, `modelBackend`) onto the session. Other fields (cache strategy, context, thinking mode, samplers) still persist so the picked model's runtime profile is reflected on this turn. - Default False preserves existing behaviour where sending with a different model permanently switches the thread. - Both call sites in state.py (the agent path and the streaming generate path) honour the flag. Frontend - New `MidThreadSwapMenu` component: dropdown of warm models excluding the session default, with a clear-override affordance and an inline "Cancel override" button on the trigger when active. Surfaces only warm models so the swap is instantaneous — cold model picks belong in the existing My Models flow. - `useChat` owns the override state (`oneTurnOverride: WarmModel | null`) with public setter; clears in onDone after a successful turn so the one-turn semantics holds even if the user forgets to clear manually. - Stream payload assembly: when the override differs from the session default, payload's modelRef / modelName / backend swap to the override's identity and `oneTurnOverride: true` is set so the backend doesn't persist the swap. - ChatTab forwards override state + setter through to ChatComposer. - ChatComposer renders the menu next to the Tools toggle, gated by the same busy-state predicate as other composer affordances. - App.tsx wires `chat.oneTurnOverride` + `chat.setOneTurnOverride` to ChatTab. Tests - tests/test_one_turn_override.py — 6 cases covering default False, explicit True/False, coexistence with model field payload, and direct contract on the persist-guard reading the attribute. Verification: tsc --noEmit clean, vitest 245, pytest 849 (+6). Capability badges (from Phase 2.11) update automatically when the override loads — the user sees "this swap loses Vision" or "this swap gains Reasoning" via the existing badge row in ChatHeader.
Two regressions reported after 0793282 shipped to a real user. 1. Memory gate too aggressive The Phase 2.0.5-B gate refused chat at 92%+ pressure, but macOS unified memory routinely sits at 90-97% during normal use because the kernel aggressively compresses pages. Models that ran fine on the previous build were being blocked. Pressure ceilings raised: chat 92→98%, image 88→95%, video 85→92%. `available_gb` is now the primary signal — pressure is a backstop for genuine OOM-imminent scenarios. Tests updated. 2. Image attachment silently dropped on MLX path The MLX worker subprocess never wired vision input through — `request.get("images")` is unreferenced in mlx_worker.py. Pre- existing limitation, surfaced because the user attached an image to a Gemma-4 turn routed via TurboQuant (MLX), and the model hallucinated an answer about a different image entirely. Fix has two layers: - `resolve_capabilities` now accepts an `engine` argument and demotes `supportsVision` to False for MLX / TurboQuant routes. The composer's auto-gate (Phase 2.11) then hides the image-attach button on MLX-loaded threads, so the UI can't create a misleading "attached but ignored" state. - `state.generate_stream` strips `request.images` and logs a loud warning when the active engine is MLX, even if a legacy client somehow bypassed the composer gate. Belt-and-braces. When the catalog says a model supports vision but the engine demotes the flag, the original "vision" tag still appears in `capabilities.tags` so the badge row can show "vision via llama.cpp" once that path is wired (Phase 2.6 follow-up). Tests - tests/test_memory_gate.py: 4 cases updated for new thresholds + 1 new "high-pressure with headroom passes" regression guard. - tests/test_capabilities.py: 4 new cases — engine demotes vision for MLX / TurboQuant, llama.cpp keeps vision, no-engine kwarg preserves catalog defaults. Verification: pytest 854, vitest 245, tsc --noEmit clean. Note on PDF attach (also reported): drag-drop → uploadSessionDocument → chunked + indexed flow inspected end-to-end and the wiring looks intact. Likely a different bug surfacing under the same memory-gate refusal; if it persists after this hotfix, capture backend logs at upload time and we'll trace from there.
User retest with `gemma-3-27b-it-qat-4bit` on the native llama.cpp path showed the model still hallucinating about an unattached image. Tracing surfaced a second pre-existing limitation: `_resolve_gguf_path` in inference.py explicitly excludes mmproj projector files when picking which GGUF to launch llama-server with, so the server never receives `--mmproj` and silently drops image_url parts. Vision is therefore broken on every runtime today — MLX (no image plumbing in worker subprocess) and llama.cpp (mmproj never loaded). The previous engine-only gate from 72ab7c4 didn't catch the llama.cpp path, so the regression report was correct. Fix - New `LoadedModelInfo.visionEnabled: bool = False` field. Stays False on every load until proper mmproj wiring lands (Phase 2.6+ work). - `resolve_capabilities` now takes a `vision_enabled` kwarg (default False). Even when the catalog says a model supports vision, the typed `supportsVision` flag is False unless the runtime confirms mmproj is actually loaded. Catalog `tags` keep "vision" so the badge row can render "vision via mmproj (not yet wired)" later. - `state.generate_stream` strip-and-warn check now keys on `visionEnabled` rather than the engine name, so the same protection applies regardless of route. - `LoadedModelInfo.to_dict` now emits `visionEnabled` so the frontend can read the runtime ground truth. Tests - tests/test_capabilities.py: 6 cases updated / added — catalog-match cases now opt in via `vision_enabled=True`, llama.cpp-without-runtime-proof and engine-unset cases now assert False, MLX engine demotion still fires even when vision_enabled=True (belt-and-braces for any future engine bug). Verification: pytest 855, vitest 245, tsc clean. Composer auto-gating now hides the paperclip on every loaded model until the mmproj loader lands, so the silent-hallucination class of bug is closed end-to-end. Vision restoration is a separate piece of work (probe for mmproj sibling at GGUF resolve time, pass --mmproj, flip visionEnabled=True at load).
…+ cosine retrieval
Replaces the keyword-only TF-IDF retrieval with semantic embedding
ranking when an llama.cpp embedding GGUF is available. llama-embedding
ships in every llama.cpp build (macOS / Linux / Windows), so the same
runtime serves chat and embeddings — no MLX-only path, no
cross-platform fork.
Backend
- New `backend_service/rag/` module:
* `embedding_client.py` — subprocess wrapper around the llama.cpp
`llama-embedding` CLI. Discovers the binary via env override
(`CHAOSENGINE_LLAMA_EMBEDDING`) or PATH. Discovers the model via
env override (`CHAOSENGINE_EMBEDDING_MODEL`) or
`<dataDir>/embeddings/*.gguf` convention. Passes
`--embd-output-format json --embd-normalize 2 -f /dev/stdin` and
parses the OpenAI-shaped envelope. `parse_embedding_output` is a
pure helper so the parser is unit-testable without subprocess
fixtures.
* `vector_store.py` — append + cosine-similarity search.
No new dep beyond numpy (already part of the chat runtime, but
the inner loop is plain Python so even environments without
numpy work). JSON-round-trippable so DocumentIndex can persist
it alongside its existing TF-IDF state.
- `helpers/documents.DocumentIndex`:
* `add_document(..., embedding_client=...)` — when the client is
supplied, embeds each chunk and appends to a parallel
`_embeddings` VectorStore. Embedding failures fall back silently
so the lexical path always succeeds.
* `search(..., embedding_client=...)` — when the embedding store
is populated and a query embedding succeeds, ranking blends
semantic 70% / BM25 30%. When the client is missing or the query
embed errors out, the search transparently falls back to the
legacy TF-IDF + BM25 60/40 hybrid. Either way the public shape
of the returned dict is identical.
* `remove_document` keeps the dense store in lockstep so chunk
deletion stays consistent.
- `state._retrieve_session_context` — resolves the embedding client
per call (pickup new models without a restart), passes it to
`add_document` + `search`. Existing TF-IDF behaviour is preserved
for users without an embedding model installed; first-class
semantic kicks in the moment one is dropped into
`<dataDir>/embeddings/`.
Tests
- `tests/test_rag_embeddings.py`:
* 9 cases on `parse_embedding_output` covering every realistic
malformed-output path so `EmbeddingClientUnavailable` fires
instead of returning bogus vectors.
* 9 cases on `VectorStore` covering identical / orthogonal /
ranked search, dim mismatch, empty input, remove-indices
lockstep, dict round-trip, and zero-query handling.
* 4 cases on `resolve_embedding_client` covering env override,
data-dir convention, and the no-binary fallback.
Verification: tsc --noEmit clean, vitest 245, pytest 877 (+22 new).
Cross-platform notes
- macOS: ships llama-embedding via Homebrew alongside llama-server,
picked up automatically on PATH.
- Linux: same llama.cpp build path; the binary lives next to
llama-server in any local build directory.
- Windows: same again — the bundled llama.cpp release zip includes
llama-embedding.exe. The env override exists for users with
custom builds.
Embedding model
- This commit deliberately does NOT bundle an embedding GGUF in the
app payload (~30-80 MB depending on model). Users drop one into
`<dataDir>/embeddings/` (e.g. `bge-small-en-v1.5.Q4_K_M.gguf`) and
semantic retrieval lights up automatically. A bundled default is a
separate distribution decision that lives in `scripts/stage-runtime.mjs`.
Deferred for follow-up
- Settings UI affordance to download / pick an embedding model.
- Workspace-scoped indexing (Phase 3.7) — RAG docs shared across
threads in a workspace rather than per-session.
… flag flip
Closes the silent-image-drop limitation that the hotfix v2 commit
gated against. When the user loads a vision-capable GGUF whose
mmproj projector lives alongside the main weights, llama-server now
gets `--mmproj <file>` on startup and the runtime sets
`LoadedModelInfo.visionEnabled = True`. The capability resolver picks
up the flag, the composer's image-attach button reappears, and image
input flows end-to-end through llama-server's native multimodal path.
Backend
- New `_resolve_mmproj_path` helper: scans the main GGUF's parent
directory (and one level up, for snapshot-style HF caches) for
`*mmproj*.gguf` siblings. Picks the largest match — the
full-precision projector outperforms a quantised one when both
are present.
- `LlamaCppEngine._build_command` returns a 4-tuple now —
`(command, runtime_note, fell_back_to_native, mmproj_path)`. When
the binary advertises `--mmproj` support (help-text gate) and a
sibling projector is found, the flag is appended to the command
and the path is propagated up to `load_model`.
- `load_model` flips `LoadedModelInfo.visionEnabled` based on the
resolved mmproj path. Models without a sibling projector load
unchanged with `visionEnabled=False`, preserving the hotfix's
protective behaviour.
Tests
- tests/test_mmproj_vision.py — 10 new cases:
* 7 cover the resolver: None / nonexistent inputs return None,
same-dir match wins, descriptive filenames match by substring,
largest projector wins on ties, sibling-directory walker fires
one level up.
* 3 cover the capability flip: vision_enabled=False keeps the flag
demoted, vision_enabled=True promotes when the catalog has the
"vision" tag, MLX engine demotes regardless (mlx-vlm not yet
wired).
- tests/test_inference.py — every existing `_build_command` mock
updated to match the new 4-tuple signature (4 fixtures across 8
test sites).
Verification: tsc --noEmit clean, vitest 245, pytest 887 (+10 new).
User experience
- Drop a vision-capable GGUF (e.g. gemma-3-27b-it-qat-4bit with its
matching mmproj into the same folder) and load via My Models.
The Vision badge in ChatHeader turns green, the paperclip
reappears in the composer, and attached images now reach the
model. No regression for text-only models — they continue to
load with `visionEnabled=False` and the gate stays in force.
Deferred
- mlx-vlm wiring for MLX-routed vision (separate effort; needs the
vision projector loaded on the worker subprocess side).
- Auto-download of the matching mmproj when a user loads a vision
model whose projector isn't local yet.
Adds first-class Model Context Protocol support so the chat agent
loop can dispatch tools provided by external MCP servers alongside
the in-tree built-ins. Stdio transport only for first ship; SSE /
WebSocket transports are forward-compatible extensions.
Backend
- New `backend_service/mcp/` package:
* `client.McpClient` — JSON-RPC 2.0 over a subprocess pipe.
Supports the bare-minimum slice of MCP needed for tool work
(`initialize` / `notifications/initialized` / `tools/list` /
`tools/call`); resources, prompts, sampling, and roots are
accepted but not surfaced. Stdout drained in a worker thread
so reads never block the calling thread on a busy server.
Tolerates non-JSON log lines servers occasionally emit on
stdout. Configurable per-RPC timeout (default 30 s) plus a
longer initialize timeout (default 15 s).
* `client.McpServerConfig` — frozen dataclass mirroring the
standard mcp-clients config blob (`id`, `command`, `args`,
`env`, `enabled`). Round-trips through dict for settings
persistence with strict validation.
* `client._parse_json_rpc_line` and `_flatten_tool_result` are
pure helpers exported for unit testing — the round-trip
fixture also covers them via subprocess.
* `tool_adapter.McpTool` wraps one remote tool as a `BaseTool`
so the existing agent loop dispatches it without changes.
Tool names are munged through `_safe_name` (`mcp__<server>__<tool>`)
to satisfy OpenAI function-calling identifier rules. Errors
from `client.call_tool` are converted to text so the agent
loop's existing tool-result path handles them — no exception
surface change.
* `loader.load_mcp_tools(configs, log=...)` is the high-level
entry point. Spawns each enabled server, runs the handshake,
enumerates tools, and returns `(list[McpTool], list[McpClient])`
for the caller to register and own respectively. A misbehaving
server is isolated — its client is closed and skipped, the
log callback fires, the rest proceed normally.
- `tools.BaseTool` gains a `provenance` property defaulting to
`"builtin"`. `McpTool` overrides it to `"mcp:<server-id>"`.
- `tools.ToolRegistry.replace_mcp_tools(tools)` replaces only the
MCP-sourced registrations, leaving built-ins untouched. Called
whenever the user updates `mcpServers` in settings or the app
starts up.
- `models.UpdateSettingsRequest` gains an `mcpServers` field
(`list[McpServerConfigRequest]`) so the existing settings
patch route persists configs without new endpoints. Each entry
carries `id`, `command`, `args`, `env`, `enabled`.
- `/api/tools` route now emits a `provenance` field per tool so
the upcoming UI badge can render Built-in vs MCP source.
Stability fix bundled
- `_resolve_mmproj_path` now uses bounded directory iteration
instead of `Path.rglob`. macOS test rigs exposed a case where
the GGUF's grandparent dir was a system-cache root; rglob
raised `OSError: Result too large` mid-scandir and broke the
full pytest run. Bounded scan covers the same HF snapshot
layouts (parent dir + immediate sibling dirs of grandparent)
without recursing into unrelated trees.
Tests
- tests/test_mcp_client.py — 30 cases covering:
* 6 cases on `_parse_json_rpc_line` (valid response, empty,
log lines, bad JSON, non-JSON-RPC objects, arrays).
* 5 cases on `_flatten_tool_result` (text concat, isError
prefix, mixed content, empty list, non-dict input).
* 6 cases on `McpServerConfig.from_dict` (round-trip + every
rejection path).
* 3 cases on `_safe_name` (basic format, sanitisation, empty
placeholders).
* 4 cases on `McpTool` (proxy to client, error → text,
provenance tag format, fallback description).
* 3 cases on `McpClient` round-trip via a Python `-c`
fake-server fixture (initialize → list → call,
pre-init-list raises, unknown command raises).
* 3 cases on `load_mcp_tools` (healthy server, disabled
server, isolation across one bad + one good server).
Verification: tsc --noEmit clean, vitest 245, pytest 917 (+30 new).
Deferred follow-ups
- Settings UI for managing mcpServers (drag-drop config import,
per-server enable toggle, status pill). Backend payload field
ready; routes already accept; just needs frontend.
- Auto-spawn at app startup. The infrastructure is in place
(`loader.load_mcp_tools`); plugging into `state.startup` is a
small follow-up that needs a settings-load + lifecycle decision
on hot-reload behaviour.
- SSE / WebSocket transports for hosted MCP servers. Stdio
covers every local server published today.
…down / image
ToolCallCard previously dumped every result as a collapsible JSON
block. Tools now opt in to a typed output protocol so the UI renders
web search hits as a clickable table, file reads as syntax-
highlighted code, and MCP image responses inline. Tools that haven't
migrated keep the JSON fallback unchanged.
Backend
- New `StructuredToolOutput` dataclass + `BaseTool.execute_structured`
optional method returning `(text, render_as, data)`. Default impl
returns None — legacy `execute(...) -> str` path stays active for
every tool that hasn't opted in.
- `agent._execute_tool_call` calls `execute_structured` first; on
None falls back to `execute`. The structured payload is captured
on `ToolCallResult.render_as` + `data` and propagated through
both the streaming `tool_call_result` event and the final
`metrics["toolCalls"]` payload.
- Built-in tool migrations:
* `WebSearchTool` returns a `table` with columns `["#", "Title",
"URL", "Snippet"]` and rows derived from the same DDG results
the legacy text summary uses. Empty queries / no results /
network failures render as `markdown` so the user sees an
actionable error.
* `FileReaderTool` renders `.md` / `.markdown` / `.rst` files as
rendered markdown and every other supported extension as
syntax-highlighted code with the file's extension as the
language hint. Errors render as markdown.
* `CalculatorTool` renders the `expr = result` line as a code
block (text language) so it sits in monospace alongside other
code outputs.
* `CodeExecutorTool` renders the captured stdout/stderr as code
plus carries the source code separately under
`data.sourceCode` for a future "show what was executed" UI.
- New `McpClient.call_tool_raw` — returns the unflattened MCP
`tools/call` envelope so adapters can inspect content parts.
`McpTool.execute_structured` now uses it to render single-image
MCP responses inline (`renderAs: "image"` with a base64 data
URI) and multi-part responses as markdown.
- Backend payload through to the SSE stream gains `renderAs` +
`data` fields per tool call; legacy clients ignore them.
Frontend
- `ToolCallInfo` type gains `renderAs?: ToolRenderAs | null` and
`data?: Record<string, unknown> | null`. Old payloads (no
`renderAs`) keep working — the renderer falls back to the
plain-text pre block.
- `ToolCallCard` switches on `renderAs`:
* `table` → HTML `<table>` with header row + URL columns
rendered as clickable links.
* `code` → existing `CodeBlock` (Phase 1 syntax highlighter)
with the language hint from `data.language`.
* `markdown` → `RichMarkdown` for nicely-rendered prose / errors.
* `image` → inline `<img>` with `src` from `data.src`.
* `json` (default / fallback) → legacy collapsible pre block.
- New CSS for `.tool-output-table` (with title + clickable links),
`.tool-output-markdown`, `.tool-output-image`.
Tests
- tests/test_structured_tool_output.py — 11 cases:
* Calculator structured `code` render + markdown error path +
legacy `execute` text unchanged.
* FileReader Python → code with language=py, markdown → markdown,
unknown extension → code with language=ext, error → markdown.
* WebSearch table with 4-column header + 2-row body, empty
query → markdown error, no results → markdown message.
* Default `BaseTool.execute_structured` returns None so non-
migrated tools take the legacy text path.
Verification: tsc --noEmit clean, vitest 245, pytest 928 (+11 new).
Pairs naturally with Phase 2.10 — MCP servers that return image
content parts now render inline rather than getting stringified to
JSON. Future tools can declare `renderAs: "chart"` to plug into a
plotting helper without disturbing the dispatch logic.
Adds Msty-style fork-from-here. Each assistant message now carries a
fork action; clicking it deep-copies the thread up to that point
into a new session and lands the user there for divergent
continuation. Parent linkage is preserved on the fork so the sidebar
can show a relationship hint and future merge / diff features have
the tie.
Backend
- New `ChaosEngineState.fork_session(source_id, fork_at, title?)`:
* Looks up the source session under the same lock that owns the
sessions list, raises `ValueError` for unknown id or out-of-
range index.
* Deep-copies messages [0..forkAtMessageIndex] so mutating the
fork's messages can never bleed into the parent.
* Carries the source's runtime profile (model, cache strategy,
cache bits, fp16 layers, fused attention, fit-in-memory,
context tokens, speculative decoding, dflash draft model,
tree budget, thinking mode) so the fork resumes on the same
config the parent was using.
* Tags `parentSessionId` + `forkedAtMessageIndex` for sidebar
rendering and downstream features.
* Inserts at the top of the sessions list so the user sees the
fork immediately.
* Persists via the existing `_persist_sessions` path.
- New `ForkSessionRequest` Pydantic model
(`forkAtMessageIndex >= 0`, optional `title <= 200 chars`).
- New route `POST /api/chat/sessions/{session_id}/fork` returning
the same shape as `create_session`.
Frontend
- `ChatSession` type gains `parentSessionId?` + `forkedAtMessageIndex?`.
- `api.ts` adds `forkChatSession(sourceSessionId, forkAtMessageIndex, title?)`.
- `useChat.handleForkAtMessage(index)`:
* Calls the API, upserts the new session into workspace state,
swaps `activeChatId` to the fork, and sets the title draft.
* Errors surface via the standard chat error path so the user
sees a clear message if the backend rejects.
- `ChatThread` adds a fork-icon button next to retry on assistant
message hover actions. Branch-shaped SVG icon, monochrome.
- `ChatTab` + `App.tsx` thread the new prop down (`onForkAtMessage`).
- `ChatSidebar` renders a `⑂ fork` purple badge on sessions that
carry `parentSessionId`, with a hover tooltip showing the
forked-at message index. CSS lands in `styles.css`.
Tests
- tests/test_fork_session.py — 10 cases:
* Messages copied up to (and including) the chosen index.
* Parent linkage (`parentSessionId` + `forkedAtMessageIndex`)
preserved.
* Runtime profile (model, modelRef, thinkingMode) carries.
* Default title combines source title + " (fork)"; explicit
title overrides cleanly.
* Fork inserts at the top of the session list.
* Deep-copy isolation: mutating fork messages doesn't touch
the parent.
* Unknown source id raises `ValueError`.
* Out-of-range index (positive + negative) raises.
Verification: tsc --noEmit clean, vitest 245, pytest 938 (+10 new).
Pairs naturally with Phase 2.5 (in-thread multi-model compare):
forking lets users branch the same prompt to two different models
and continue both threads in parallel. The compare view can then
show side-by-side rendering keyed off `parentSessionId`.
Adds an instant compare affordance: pick another warm model from the
assistant message's action bar, get a sibling response for the same
prompt rendered as a card under the original answer. The override
model must already be loaded; we never auto-reload to avoid surprises.
Backend
- state.add_message_variant: re-runs the user prompt at index-1 against
the currently-loaded model, attaches to messages[index].variants
- POST /api/chat/sessions/{id}/variants route + AddVariantRequest
- Tests cover happy path + index/role/runtime guards (8 cases)
Frontend
- ChatMessageVariant type + ChatMessage.variants
- addMessageVariant in api.ts + useChat.handleAddVariant exported
- VariantPickerButton (warm-model dropdown, current ref excluded)
- VariantCard inside ChatThread renders model name, tok/s, response
time, optional reasoning panel, markdown body
- Props wired through ChatTab + App.tsx
- CSS for picker popover and variant stack
…ckers Capabilities are already resolved server-side for the loaded model; this surfaces the same flags before-load so users can tell which options support vision / tools / reasoning / code etc. at a glance. Frontend resolver mirrors the backend one-to-one: catalog tags win, ref-name heuristics fill in for non-catalog entries. No backend change needed — catalog tags ship on featuredModels already. Changes - utils/capabilities.ts: resolveCapabilities + emptyCapabilities - ChatModelOption.capabilities populated from matched catalog variant - ModelLaunchModal renders capability badges on selected card + every list option - VariantPickerButton renders Vision / Tools / Reasoning / Code hints next to each warm model - 7 unit tests cover catalog precedence, heuristic fallback, case normalisation, unknown-tag preservation, empty input
The sampler chain (top_p / top_k / min_p / repeat_penalty / seed / mirostat) shipped in earlier work; backend already accepts a `jsonSchema` field that llama-server enforces via `response_format: json_schema`. This commit lights it up in the UI. - SamplerOverrides.jsonSchemaText: raw textarea content, persisted per session so mid-type drafts survive remounts - SamplerPanel renders a JSON-schema textarea with live parse validation (red error / muted ok hint) - samplerPayload + useChat readSamplerPayload parse the schema text at send-time; malformed input drops out silently rather than blocking the request - 7 new round-trip tests (parse / drop array / empty handling / unparseable text preserved across writes) llama.cpp applies the schema; mlx-lm ignores it (out of scope for the worker subprocess). DRY / XTC / GBNF stay deferred per existing comment in models/__init__.py.
Templates can now declare variables and seed presets. The Use in
Chat button on a variable-bearing template opens a fill-form that
substitutes {{name}} placeholders before the prompt reaches the
composer. Preset model ref + preset samplers persist alongside the
template and are surfaced as badges in the detail view (composer
auto-apply lands in a follow-up).
Backend
- helpers/prompts.py: variables / presetSamplers / presetModelRef
fields on create + update; _normalise_variables drops malformed
entries and dedupes by name
- extract_placeholders + apply_variables for {{name}} substitution
with bool / number / None coercion and unknown-name preservation
- PromptTemplateRequest extended; existing CRUD routes accept the
new fields without breaking older clients
- 9 new tests: extraction order, substitution coercion, missing
names preserved, preset persistence, update preserves untouched
preset fields, malformed variable entries dropped
Frontend
- PromptVariable type + PromptTemplate.variables / presetModelRef /
presetSamplers
- Editor: variables (JSON array), preset model ref, preset
samplers (JSON object) with placeholders
- Detail view shows preset model + variable count badges
- Fill form renders typed inputs (textarea / number / checkbox),
live preview of resolved prompt, Apply to chat hands the
substituted text to the composer
- applyVariables mirror of backend helper (bool / null / unknown
semantics identical)
The /v1/chat/completions stub auto-loaded a model and accepted only temperature + max_tokens; external scripts couldn't tune sampling. This commit lights up the standard OpenAI sampler fields end-to-end and adds /v1/embeddings via the bundled Phase 2.6 GGUF model. Backend - OpenAIChatCompletionRequest: top_p, top_k (extension), frequency_penalty, presence_penalty, seed, stop, response_format - _LLAMA_SAMPLER_KEYS extended with frequency_penalty / presence_penalty / stop so _apply_sampler_kwargs forwards them on the llama path - state.openai_chat_completion builds a samplers dict + extracts json_schema from response_format.json_schema.schema; passes both to runtime.generate / stream_generate - New OpenAIEmbeddingsRequest + state.openai_embeddings: - Routes through resolve_embedding_client (Phase 2.6) - Returns 503 with actionable detail when no model is wired - Honours `dimensions` parameter for truncation - POST /v1/embeddings registered alongside existing /v1/* routes Tests (3 new — 958 passing total) - Sampler fields reach the runtime via last_generate_kwargs - Empty sampler set → samplers=None, json_schema=None - /v1/embeddings 503s cleanly with no embedding client wired
The plan's catalog browser entry asked for size + arch + VRAM-fit
hints in a built-in HF browser. The HF search backend already
exists at /api/models/search; this commit lights up the per-variant
fit-vs-available-memory hint so users know whether a model will
load before clicking Download.
Three buckets:
- Fits (estimate ≤ 70% available — comfortable, green)
- Tight (estimate ≤ 100% available — yellow, may need to free RAM)
- Too big (estimate > available — red, suggest a smaller quant)
The hint is optimistic by design: TurboQuant / ChaosEngine cache
compression can reclaim ~50% of the listed estimate, so "Tight" is
still a usable signal rather than a hard block. The detailed tooltip
spells out the exact numbers and remediation.
Changes
- OnlineModelsTab: memoryFitBucket helper exported for testing;
per-row badge inside the existing memory cell
- App.tsx threads workspace.system.availableMemoryGb through
- styles.css: memory-fit-badge--{comfortable,tight,over}
- 7 unit tests cover bucket boundaries + null-safety
…h gap User-reported regressions: 1. First reasoning paragraph appeared visually separated from the rest — reasoning models tend to emit "First thought.\n\nMore..." which the markdown renderer turns into two paragraphs with a tall margin between them. 2. Wanted a collapsible streaming view that shows only 1-2 lines of the running thought rather than the whole panel auto-opening. Changes - ReasoningPanel default to collapsed during streaming; user can expand explicitly. The expand decision sticks until streaming ends. - Multi-line preview when collapsed mid-stream: last 2 non-empty lines joined with " · ", clamped to 2 visual lines via CSS. - tidyReasoningForDisplay strips leading whitespace and collapses the *first* `\n\n` to a single newline so the first thought sits flush against subsequent content. Mid-stream paragraph breaks preserved. - CSS tightens .reasoning-panel__content paragraph margins from the default ~16px to 6px, making the trace read as one continuous stream without losing structure. - Chevron tints accent-strong while streaming so users notice the panel is interactive. 10 new unit tests for tidyReasoningForDisplay + lastLines covering boundary conditions: empty input, leading whitespace, first-gap collapse, mid-stream gap preservation, single-line passthrough.
Surfaces the substrate decisions the runtime made for each assistant turn — engine, cache strategy, DDTree budget, accepted-token rate, runtime warnings — as a strip of inline chips above the existing collapsible Model Details fold-out. Operators can now tell at a glance whether a turn went MLX vs llama.cpp, ChaosEngine vs TurboQuant, and how aggressively speculative decoding ran. The data already lands on every assistant message via inference.py and mlx_worker.py; this commit just renders it. No backend change. Changes - SubstrateRoutingBadge component: builds chips from GenerationMetrics with separate keys for engine / cache / spec / acceptance / warn - ChatThread renders the badge above the metrics <details> for any assistant message that has metrics - styles.css: substrate-chip + tone variants (default / accent / warn) - 9 unit tests cover empty input, engine fallback to backend, cache label synthesis, DDTree on/off, acceptance rate gating, runtime note truncation
Signature differentiator: lets operators flip cache compression strategy (TurboQuant / ChaosEngine / Native) and bit width per turn without touching launch settings. Backend already accepts the fields on every GenerateRequest and reloads the runtime transparently when the requested strategy / bits don't match what's loaded — no engine-side change needed. Frontend - KvStrategyChip: composer popover listing all advertised cache strategies with bit-range buttons. Active strategy highlighted; unavailable strategies render greyed with a tooltip explaining the gap. - kvStrategyOverride helper: read / write per-session blob to localStorage, mirrored from samplerOverrides shape. - ChatTab owns the override state with cross-session persistence; ChatComposer renders the chip alongside SamplerPanel + temp. - useChat reads the override at send-time; falls through to the active runtime profile when no override is set. - App.tsx threads workspace.system.availableCacheStrategies through. - styles.css: kv-chip + popover variants. 8 unit tests cover round-trip, malformed-input handling, null clearing, per-session scoping.
Adds a structured inspection helper that runs at prompt-render time and detects known chat-template quirks: - Gemma family (Gemma-1 → Gemma-4) reject system role entirely; the helper flags this and the fold-system-into-first-user fix is now applied automatically by mlx_worker before apply_chat_template fires - ChatML templates that omit add_generation_prompt handling get surfaced as a runtime warning (template renders truncated prompts, model continues the user turn instead of replying) - Templates that hard-code an assistant prefix while also branching on add_generation_prompt get flagged for double-prefix The report's `to_runtime_note()` returns a single line that threads through the existing runtime_note channel and shows up on the Phase 3.4 substrate badge so users see "auto-fixed: Gemma family — fold system into first user" without poking around. Tests - 15 unit tests cover Gemma family detection, fold idempotency, preservation of conversation order across the fold, missing / empty templates, ChatML detection, runtime-note formatting mlx_worker._build_prompt_text now takes an optional model_ref so the inspection runs only when we know which family we're rendering for. Llama.cpp side opaque (template parsed inside llama-server) so detection there is a follow-up.
Captures CPU %, GPU %, available RAM, and thermal state at each turn's stream finalisation. Renders below the substrate routing badge as a compact perf-chip strip with tone variants (warn for high CPU / low RAM, alert for tok/s under 1 or thermal critical). Backend - helpers/perf.py: snapshot_perf_telemetry() returns a typed PerfTelemetry blob, all fields optional. CPU + memory via psutil, thermal via existing pmset reader (Phase 2.0.5-I), GPU via the dashboard's _detect_gpu_utilization - _stream_assistant_metrics_payload attaches `perfTelemetry` when any field samples non-null; samplers fail silently so a sampler bug never blocks turn finalisation - 6 unit tests cover the dataclass shape + psutil/thermal failure fallthrough Frontend - GenerationMetrics.perfTelemetry typed - ChatPerfStrip component renders chips: tok/s, CPU, GPU, free RAM, thermal — each with tone classification (default / warn / alert) so users glance at colour for hot spots - ChatThread renders the strip below the substrate badge for any assistant message that has metrics - styles.css: perf-chip + tone variants - 10 unit tests cover chip composition + tone thresholds + null handling macOS gets the full set today (thermal works); Windows / Linux fall through to None on thermal until per-OS samplers land.
Replaces the deterministic per-family template-suffix enhancer with a
small MLX-native instruction model that auto-rewrites short prompts
into the structured format each image / video DiT was trained on.
Backend (backend_service/helpers/prompt_enhancer.py):
- _EnhancerSingleton caches the loaded mlx_lm model + tokenizer in a
process-level RLock-guarded singleton so the first call pays the
~3s load cost and subsequent calls reuse it.
- Default model: mlx-community/Qwen2.5-0.5B-Instruct-4bit (~700 MB).
Already cached on most dev boxes (FU-002 spike used it). Override
via the ``modelId`` field on POST /api/prompt/enhance.
- Per-family system prompts anchor the rewrite to the DiT's
training distribution: wan / ltx / hunyuan / flux / sdxl / sd3 /
default. ``family_for(repo)`` matches longest-prefix-wins.
- Failure modes (non-Apple platform, mlx_lm missing, model not
cached, generation crash, shorter-than-input rewrite) all return
the original prompt + a runtimeNote so the user sees why. Caller
falls back to the legacy template suffix.
- Strips surrounding quotes from the model output (some 0.5B chat
models wrap responses in quotation marks).
Endpoint (backend_service/routes/prompts.py):
- POST /api/prompt/enhance with body {prompt, repo, modelId?,
maxTokens?} → {enhanced, note, modelUsed, family}.
Frontend (src/components/PromptEnhanceButton.tsx + Studio tabs):
- "Enhance" pill button next to the Prompt label in both
ImageStudioTab + VideoStudioTab. Click → POST /api/prompt/enhance
with the current prompt + the selected variant's repo. On success
replaces the textarea via the parent setter; on fallback (no
rewrite) surfaces the runtimeNote as a button tooltip.
- Disabled when no prompt typed or no model selected.
Live smoke 2026-05-04 against cached Qwen2.5-0.5B-4bit:
- 6-word "a fluffy cat on a windowsill" prompt:
FLUX.1-dev: 16-word cinematic rewrite (3.2s cold load)
LTX-Video: 8-word with "tracked shot" (0.11s warm)
Wan2.2: 13-word with "soft lighting" (0.12s warm)
SDXL-Turbo: 171-word verbose rewrite (0.78s warm) — model
doesn't always honour the "<50 words" instruction
but output is still usable; user can edit
- Empty-prompt input → empty output + no note (graceful no-op)
- Singleton hits warm cache after first load (verified)
Tests: 16 unit tests (family mapping × 7 + enhance happy path +
load-failure fallback + crash fallback + shorter-than-input reject +
quote stripping + dataclass frozen). 1252 pytest pass / 1 skip,
zero regressions. tsc clean, 331 vitest pass.
The original WanInstallPanel.tsx that listed every supported Wan repo
in Discover was removed when the catalog tabs were rolled back to v0.7.2,
which orphaned the FU-025 backend endpoints (POST /api/setup/install-mlx-video-wan,
GET /api/setup/install-mlx-video-wan/status, GET /api/setup/mlx-video-wan/inventory)
plus the api.ts client funcs (startWanInstall / getWanInstallStatus /
getWanInventory).
This restores the install surface as a contextual single-repo panel
inside VideoStudioTab — when the user picks a Wan-AI variant on Apple
Silicon, the panel checks if the converted MLX dir for THAT specific
repo is on disk and either shows a "Ready" chip or an "Install" button.
Self-contained component owns its own polling so VideoStudioTab's state
hook stays clean.
Files:
- new src/components/WanRuntimeInstaller.tsx — fetches inventory on
mount, scoped to a single repo, polls /status at 1.5 Hz while a
job is running, mirrors the LongLive install pattern. Renders a
minimal log line (phase + percent + message) inline rather than
pulling in the full InstallLogPanel — the LongLive variant doesn't
accept WanInstallJobState in its union.
- src/features/video/VideoStudioTab.tsx — renders the installer
when ``isWanRepo && isAppleSiliconHost && !mlxVideoMissing``.
Gating order: install mlx-video pip package first (existing
flow), THEN convert the Wan checkpoint.
- src/styles.css — terminal-style log panel, "Ready" chip, install
button styling matched to the surrounding Studio actions.
tsc clean, 331 vitest pass. Backend endpoints unchanged. The
WanRuntimeInstaller fires the same convert pipeline that the live
2026-05-04 smoke validated end-to-end (FU-009 close-out commit
bcf88de).
… v0.1.5.1)
Adapts ddtree.py to the new target_ops adapter pattern dflash-mlx
introduced in v0.1.5. Old runtime top-level primitives:
target_forward_with_hidden_states -> target_ops.forward_with_hidden_capture
extract_context_feature_from_dict -> target_ops.extract_context_feature
make_target_cache -> target_ops.make_cache
_target_embed_tokens -> target_ops.embed_tokens
_target_text_model -> target_ops.text_model
_lm_head_logits -> target_ops.logits_from_hidden
ContextOnlyDraftKVCache moved off ``dflash_mlx.runtime`` onto
``dflash_mlx.model``. ``create_attention_mask`` re-imported from
``mlx_lm.models.base`` (dflash dropped the runtime re-export).
``trim_cache_to`` was removed entirely — replaced with a local
``_trim_cache_to`` shim in ddtree.py that calls each cache entry's
own ``.rollback()`` / ``.trim()`` / ``.crop()`` based on what the
type exposes.
Adapter resolved once at the top of ``generate_ddtree_mlx`` via
``resolve_target_ops(target_model)`` so we don't pay repeated
backend lookups in the decode loop.
Live smoke against ``mlx-community/Qwen2.5-0.5B-Instruct-4bit``:
- target_ops backend = qwen_gdn, family = pure_attention
- forward+capture, embed_tokens, text_model, logits_from_hidden,
extract_context_feature, _trim_cache_to all working
Tests: 1252 pass / 1 skip, zero regressions vs the 0.1.4.1 baseline.
Gains over 0.1.4.1:
* draft model quantization with Metal MMA kernels
* branchless Metal kernels + fused draft KV projections
* long-context runtime diagnostics
Apple Silicon dev box can't exercise these live — wiring is in place
so a Windows / Linux CUDA pull validates end-to-end.
FU-023 Nunchaku / SVDQuant 4-bit weight quant on CUDA:
- _try_load_nunchaku_transformer helper in image_runtime.py preferred
over NF4 / int8wo when device == "cuda" + nunchakuRepo pinned +
nunchaku importable. Falls back cleanly otherwise.
- _nunchaku_transformer_class_for_repo registry maps base repo to
NunchakuFluxTransformer2dModel / NunchakuQwenImageTransformer2DModel
/ NunchakuSD3Transformer2DModel / NunchakuSanaTransformer2DModel /
NunchakuPixArtSigmaTransformer2DModel.
- ImageGenerationConfig / Request: nunchakuRepo, nunchakuFile fields.
VideoGenerationConfig / Request: same fields parked for upstream
Wan / HunyuanVideo / LTX wrappers (FLUX + Qwen-Image only in v1.2.1).
- Catalog rows: FLUX.1 Dev × svdq-int4-flux.1-dev, FLUX.1 Schnell ×
svdq-int4-flux.1-schnell. ~3× over NF4 on a 4090 with quality near
bf16; sub-second 4-step gen on Schnell.
- Setup install: nunchaku>=1.2.1 in _INSTALLABLE_PIP_PACKAGES.
- Variant key extends with nunchaku=... so toggling rebuilds.
FU-024 FP8 layerwise casting (CUDA SM 8.9+ Ada / Hopper / Blackwell):
- _maybe_enable_fp8_layerwise helper calls transformer.
enable_layerwise_casting(storage_dtype=…, compute_dtype=bf16)
post-load.
- Family-correct fp8: E5M2 for HunyuanVideo (matches upstream model
card), E4M3 for FLUX / Wan / Qwen-Image / SD3 / LTX.
- Compute-capability gate refuses pre-Ada GPUs since hardware fp8
isn't there + cast slows wall-time vs bf16.
- Graceful no-op when transformer.enable_layerwise_casting missing
(UNet pipelines / old diffusers); error → runtimeNote.
- Fields wired through both ImageGenerationConfig + VideoGeneration
Config + Request models + frontend hooks + types. Default off.
FU-027 NVIDIA/kvpress (foundation only):
- kvpress>=0.5.3 added to _INSTALLABLE_PIP_PACKAGES so the Setup tab
can pre-stage the wheel.
- Integration code lands separately under cache_compression/kvpress.py
once we pick an adapter shape — upstream exposes ``presses`` per
technique (SnapKV / TOVA / KIVI / pyramid) + a ``Pipeline`` wrapper.
Tests: 1250 pass / 1 skip / 2 deselected (pre-existing memory-pressure
flakes unrelated to this change), 331 vitest pass, tsc clean.
Mirrors the FU-018 previewVae checkbox pattern. Backend + hooks + types already plumbed in bc12d5c — only the UI render was missing. Image Studio: checkbox under previewVae, copy explains CUDA Ada+ gate. Video Studio: checkbox after previewVae with the same gate explanation. App.tsx threads the hook setters through. tsc clean, 331 vitest pass.
Mirrors the macOS shell scripts. Same env-var contracts, same install
destination ($HOME\.chaosengine\bin\), same version-tracking shape so
the existing Setup-page detector works on both platforms.
scripts/build-llama-turbo.ps1
- clones TheTom/llama-cpp-turboquant @ feature/turboquant-kv-cache
- cmake configure with GGML_CUDA=ON when nvcc is on PATH
- builds llama-server + llama-cli
- installs as llama-server-turbo.exe
- probes both Release\ and root build output dirs (multi-config
vs Ninja generator)
scripts/build-sdcpp.ps1
- clones leejet/stable-diffusion.cpp @ master
- cmake configure with SD_CUBLAS=ON when nvcc is on PATH
- builds sd-cli target (upstream renamed sd -> sd-cli around master-590)
- installs as sd.exe (legacy filename so the runtime resolver keeps
working without a rename)
Both honor CHAOSENGINE_BIN_DIR / *_NO_CUDA env-var overrides for CI.
Static link (BUILD_SHARED_LIBS=OFF) so installed binaries don't drag
a .dll trail.
Windows PowerShell reads .ps1 files as Windows-1252 by default. The em-dash (U+2014) bytes encoded as UTF-8 (0xE2 0x80 0x94) get mis-decoded to "âEUR"" which the parser sees as ``unexpected token``. Stripping the em-dashes from throw messages avoids the encoding pitfall and keeps the script working without a BOM. Reported by Cryptopoly running .\scripts\build-llama-turbo.ps1 on a fresh Windows pull.
Without -G, cmake defaulted to "NMake Makefiles" which only works inside a Visual Studio Developer Command Prompt. Vanilla PowerShell runs died with "Running 'nmake' '-?' failed" before any compile started, even with VS 2022 Build Tools installed. Probe in order: CHAOSENGINE_LLAMA_TURBO_GENERATOR override, then Ninja if on PATH, then "Visual Studio 17 2022" + -A x64 (cmake locates VS via vswhere from outside a developer prompt as long as the build tools are installed -- which the script header already lists as a prerequisite).
CMake refuses to switch generators in an existing build directory: "Does not match the generator used previously: NMake Makefiles". Users who hit the previous default-NMake failure on Windows then re-ran the script after the generator-selection fix and got blocked by their own stale build/CMakeCache.txt with a hand-deletion instruction. Detect the cached CMAKE_GENERATOR line, compare it to the generator we picked this run, and wipe build/ when they differ. Same-generator re-runs keep their incremental cache.
Select-String -SimpleMatch disables regex, which made the leading ^ in '^CMAKE_GENERATOR:INTERNAL=' a literal character. The pattern never matched any line, the if block silently skipped, and users re-running the script after the previous failed NMake attempt still hit "Does not match the generator used previously: NMake Makefiles". Drop -SimpleMatch so the regex anchor works, take only the first match (CMAKE_GENERATOR_INSTANCE etc. share the prefix), and trim trailing whitespace from the cached value before comparing.
CMake's "could not find any instance of Visual Studio" error is technically correct but easy to misread as a script bug, especially on CUDA hosts: nvcc was detected successfully, so users assume the toolchain is fine. nvcc proxies to cl.exe on Windows, so CUDA without MSVC cannot compile anything regardless. Probe via vswhere for a VC.Tools.x86.x64 installation before kicking off cmake configure. When missing, throw a clear message with both download links (full Community and the smaller Build Tools), the required workload name, and the next-step instruction. Successful detection logs the resolved install path so users see which VS copy CMake will actually pick.
Microsoft's installer often flags VS 2022 Build Tools installs as isComplete=0 (some optional component is missing) even when cl.exe works fine. vswhere -latest WITHOUT -all silently excludes those, and so does CMake's own internal probe -- which is why a fully functional install can still produce "could not find any instance of Visual Studio" from cmake configure. Switch the pre-flight probe from a -requires component filter to a -find for cl.exe under VC\Tools\MSVC, with -all so isComplete=0 installs come back. Pick the highest cl.exe version, walk back to the install root, and pass it to CMake explicitly via -DCMAKE_GENERATOR_INSTANCE=<path> so cmake doesn't repeat the same filter and reject the same install.
CMake's "Visual Studio 17 2022" generator rejects a path-only CMAKE_GENERATOR_INSTANCE when the install isn't in the Visual Studio Installer's known-instances registry, with: "instance is not known to the Visual Studio Installer, and no version= field was given" isComplete=0 installs are filtered out of that registry, so the fix from the previous commit (pass the cl.exe-derived path) still landed in the same wall. Pull installationVersion from `vswhere ... -format json` for the matched install, format the value as "<path>,version=<x>", and hand that to CMake. Falls back to bare path when the version lookup fails.
CMake's CUDA detection ("ggml/src/ggml-cuda/CMakeLists.txt:58
enable_language") fails with "No CUDA toolset found" when the
CUDA installer's MSBuild integration files
(CUDA <ver>.props/.targets/.xml + Nvda.Build.CudaTasks.<ver>.dll)
aren't present in the Visual Studio BuildCustomizations directory.
This is the default state when CUDA was installed before Visual
Studio, or when "Visual Studio Integration" was unticked during
the CUDA install.
Add a Sync-CudaVsIntegration helper that:
- locates the CUDA source via $env:CUDA_PATH
- resolves the VS BuildCustomizations target from the install
root we already detected via vswhere
- skips when up to date
- copies missing files directly, falling back to a UAC-elevated
Start-Process powershell -Verb RunAs when the target dir
refuses our writes (Program Files is admin-only)
- prints a manual one-liner if even the elevated copy fails
Called between VS detection and cmake configure when GGML_CUDA is
on, so the build no longer dies on the first CUDA-language probe.
Two follow-on bugs from the previous CUDA sync commit: 1. Copy-Item -LiteralPath does NOT support wildcards. The elevated "Copy-Item -LiteralPath '...\*' ..." treated * as a literal filename, silently copied nothing, and exited 0 -- so the script reported "files copied (elevated)" while the target dir stayed empty. Switched the elevated payload to per-file Copy-Item commands built from the missing list, and added a verify step inside the elevated session plus a re-check from the parent shell so a no-op success can no longer slip through. 2. CMake caches CUDA-language detection in build/CMakeCache.txt. When the integration files are installed AFTER a failed configure, CMake re-runs enable_language(CUDA) but its compiler ID test result was cached and not re-tested -- so even the second run with files in place still printed "No CUDA toolset found." Sync-CudaVsIntegration now returns $true when it actually copied something, and the cache-invalidation block wipes build/ for that reason in addition to a generator change.
build-sdcpp.ps1 hit the same NMake-default + isComplete=0 + "No CUDA toolset" wall as build-llama-turbo.ps1 did. Rather than duplicate ~150 lines of toolchain plumbing, lift the four helpers (generator selection, VS install probe, CUDA VS-integration sync, stale-cache wipe) into scripts/lib/windows-msvc-cuda.ps1 and dot-source from both builders. Both scripts now share: - Resolve-CmakeWindowsBuildContext (env override -> Ninja -> VS 2022) - Sync-CudaVsIntegration (UAC-elevated copy of CUDA .props/.targets) - Get-CmakeWindowsConfigureArgs (-G/-A/-DCMAKE_GENERATOR_INSTANCE) - Invoke-CmakeStaleCacheWipe (generator change + post-CUDA-install) build-sdcpp.ps1 picks up: NMake-fallback fix, isComplete=0 install acceptance, version=<x> on CMAKE_GENERATOR_INSTANCE, automatic CUDA integration copy with UAC fallback, and stale-cache invalidation. Per-script overrides keep their distinct env names: CHAOSENGINE_LLAMA_TURBO_GENERATOR vs CHAOSENGINE_SDCPP_GENERATOR.
On Windows, pip.exe refuses to upgrade itself with: ERROR: To modify pip, please run the following command: <python> -m pip install --upgrade pip because it cannot overwrite its own running .exe shim. The bare `.venv\Scripts\pip install --upgrade pip` call in build.ps1 hit this every time and aborted the whole build before any other Python deps installed. Switch all four pip invocations in build.ps1 to `python -m pip` via a $VenvPython variable. python.exe holds the file handle and can replace pip cleanly. No behavior change beyond unblocking the upgrade step.
Two related fixes for the "CogVideoX 2B won't load on a 24 GB 4090"
report.
1. Diffusers' lazy-import wrapper hides the real cause of T5
encoder failures. The user saw:
"Failed to import diffusers.pipelines.cogvideo.pipeline_cogvideox
because of the following error (look up to see its traceback):
Could not import module 'T5EncoderModel'."
The actual underlying chain on this user's machine was:
transformers.quantizers -> torchao.utils -> torch.utils._pytree.
register_constant attribute missing (torch 2.6.0+cpu, torchao
wants >= torch 2.11)
plus the broader signal that the GPU bundle ended up installing
the +cpu torch wheel on a CUDA host.
Add backend_service/helpers/video_runtime_diagnostics.py with
diagnose_diffusers_lazy_import_error(). Probes the dep chain
(torch, sentencepiece, protobuf, transformers.quantizers,
transformers) and surfaces the first concrete failure with a
Setup-page hint. Two specialised paths come first:
* +cpu torch on a CUDA host -> "Install CUDA torch"
* torchao + torch < 2.11 mismatch -> "re-run Install GPU runtime
or uninstall torchao"
Wire it into the /api/video/preload route so the row banner gets
actionable text instead of the diffusers wrapper. Also log the
full traceback at backend so future diagnostics aren't lost.
2. CogVideoX 2B's catalog runtimeFootprintGb of 19.0 was the
worst-case fp32 figure. bf16 + standard placement is ~13 GB on
CUDA, ~15 GB on MPS. The 24 GB 4090 case (budget = 24 * 0.7 =
16.8 GB) was tripping "danger -- would crash" on a config that
actually fits. Right-size CogVideoX 2B / 5B / 1.5-5b with explicit
runtimeFootprintCudaGb + runtimeFootprintMpsGb numbers reflecting
the real bf16 path.
Also rewrite the assessVideoGenerationSafety message for the
"model footprint > budget" branch. The runtime auto-engages
sequential CPU offload when .to(device) OOMs (see
video_runtime.py::_ensure_pipeline), so "would crash the backend"
was wrong -- generation succeeds but each step is a few times
slower. Match the test on the stable bits ("resident",
"sequential CPU offload", "smaller model") so future copy edits
don't keep breaking it.
Two unrelated UX fixes that share a root pattern: defaults that lie.
1. The frontend default ``emptyLaunchPreferences.maxTokens = 512`` was
wildly out of sync with the backend default of 4096 (matched in
models/__init__.py LaunchPreferencesRequest, GenerateRequest, and
the runaway guard in state.py at maxTokens * 6 chars). A user who
sent their first chat message before opening Settings got their
answer cut off mid-output around 3000 chars -- exactly what the
"JS solar system, last property reads `diameter: '50,72`" report
showed. Bump the seed value to 4096; the slider range was already
256-32768 so power users could already opt up but new users were
silently capped 8x lower than the backend expected.
2. The Studio chips lit up "Real engine ready" + "Device: cuda
(expected)" purely from nvidia-smi presence, with no check that
the installed torch wheel was actually CUDA-built. A user with a
broken install (4090 + ``torch 2.6.0+cpu``) saw nothing but green
while every generation silently ran on CPU at a fraction of GPU
speed. The torchInstallWarning probe in helpers/gpu.py reads
``torch/version.py`` directly -- not dist-info, because pip
leaves stale ``torch-X.Y.Z+cu124.dist-info`` next to a later
``+cpu`` install -- and reports a one-line warning when:
* torch is +cpu but nvidia-smi present (the user's case)
* torch missing entirely on a CUDA host
* torch missing entirely on Apple Silicon
Plumbed through VideoRuntimeStatus and ImageRuntimeStatus
(without importing torch -- safe to call from probe() despite
Windows DLL-lock concerns). Studios render it as a red callout
above the chip row plus a "CPU fallback" danger badge so the
warning is visible before any model loads.
Tests: src/utils/__tests__/videos.test.ts (60/60), tsc clean. The
3 image and 1 setup-route test failures are pre-existing on this
branch (Windows path separators + image footprint estimator) and
not touched by this change.
Bug: persistent launch-settings panel pushed only ``paramsB`` into ``previewControls`` when ``previewVariant`` changed; ``numLayers`` / ``numHeads`` / ``numKvHeads`` / ``hiddenSize`` stayed at the ``emptyPreview`` defaults (all zero). Native f16 cache estimate is ``2 * num_layers * num_kv_heads * head_dim * ctx * 2 bytes`` -- with any factor at 0 the result collapses to ~0 GB. The Studio's "Performance Preview" then showed Cache 0.0 GB / Speed 0.0 tok/s / Quality 0.0% and the "Fits Easily" badge fired on models that don't actually fit (e.g. Qwen3.6-27B-GGUF Q4_K_M at 256K context, which needs ~32 GB KV cache + 16 GB weights = ~48 GB on a 64 GB box). Fix: when ``previewVariant.paramsB`` changes, also derive ``numLayers / hiddenSize / numHeads / numKvHeads`` via ``estimateArchFromParams`` and push the full set into ``previewControls``. Mirrors the existing launch-modal effect at App.tsx line 822. Reported by Cryptopoly: load failure on Qwen3.6-27B-GGUF despite GUI claiming "Fits Easily" -- root cause was the false Fits Easily badge hiding the actual context-cache pressure. Tests: 1252 pytest pass / 1 skip / 2 deselected (pre-existing flakes), 331 vitest pass, tsc clean.
The "FULL CONTEXT MAY NOT FIT" preview compared the optimized KV
cache against system RAM (totalMemoryGb) only. On a CUDA host that's
the wrong constraint: llama.cpp puts the KV cache on the GPU when
ngl=999 (the default for offload-capable models), so a 60 GB f16
cache on a 24 GB 4090 OOMs the GPU long before system RAM (64 GB)
starts to matter. The user reported this directly -- "is the warning
measuring the limit on the system memory 64 GB instead of GPU
memory?"
Changes:
* helpers/system.py: include ``gpuVramTotalGb`` in the system
snapshot. Reuses the existing get_device_vram_total_gb() probe
in helpers/gpu.py. Stays None on Apple Silicon (unified memory
is already in totalMemoryGb -- reporting it again would
double-count and produce nonsense like "60 GB > 24 GB VRAM" on
a 64 GB Mac).
* SystemStats.gpuVramTotalGb in src/types.ts.
* getCacheFitStatus(optimizedCacheGb, diskGb, totalGb, bits,
gpuVramGb?): when a discrete GPU is reported, compare the
cache against 0.85 * gpuVramGb FIRST. If it overflows VRAM,
return a "Cache won't fit GPU" warning that names the actual
VRAM ceiling and recommends RotorQuant / TurboQuant or lower
context. The system-RAM check still runs as a fallback for
Apple Silicon and CPU-only hosts.
* Plumb gpuVramTotalGb through every PerformancePreview consumer:
PerformancePreview, RuntimeControls, ModelLaunchModal,
LaunchModal, App.tsx, CompareView, ConversionTab,
BenchmarkRunTab.
For the user's exact case (Qwen3.6-35B-A3B GGUF, 256K context,
native f16, 24 GB 4090): the warning now correctly reads "60 GB
KV cache larger than 24 GB GPU VRAM ... pick a compressed
strategy" instead of pointing at the 64 GB system RAM ceiling that
isn't actually the binding constraint.
Tests: cache.test.ts (20/20), tsc clean, python services +
backend tests pass. Snapshot smoke confirms gpuVramTotalGb=23.99
on the dev host.
…/ChaosEngineAI into feature/chat-level-up
The previous commit wired diagnose_diffusers_lazy_import_error() into /api/video/preload only. The user's report shows the same diffusers wrapper firing on the GENERATE path (CogVideoX 2B was already preloaded, the lazy import only triggered when generate() actually invoked the T5 text encoder). The route re-raised the opaque "Could not import module 'T5EncoderModel'" message unchanged. Wire the same diagnostic into: * /api/video/generate (both RuntimeError and Exception branches) * /api/images/generate (Exception branch) Also bump the logged traceback from the last 500 chars to the last 2000 -- the chain that breaks T5 (transformers.quantizers -> torchao.utils -> torch.utils._pytree.register_constant) goes deeper than 500 chars and was getting truncated mid-frame in the log. For the user's exact runtime state, the diagnostic now surfaces: "torch is installed as a CPU-only wheel (2.6.0+cpu) even though an NVIDIA GPU is present. Generation will run on CPU at a fraction of GPU speed. Open Settings > Setup and click Install CUDA torch, then Restart Backend." instead of the opaque T5EncoderModel wrapper.
Two issues with the GPU acceleration warning the user just spotted:
1. Image Studio showed the red "GPU acceleration not active" banner;
Video Studio did not -- both have an NVIDIA GPU + +cpu torch, so
both should warn.
Root cause: my earlier replace_all Edit on video_runtime.py only
matched the *placeholder* return path (16-space indent) and
missed the success-path return (12-space indent) at line 961.
On a host where torch was importable but +cpu, the success path
ran with realGenerationAvailable=True and never set
torchInstallWarning -- so the field came back null and the
banner silently dropped. Add it explicitly to the success-path
VideoRuntimeStatus return so both code paths emit the warning.
2. Both warnings just told the user to "Open Settings > Setup and
click Install CUDA torch", which works but requires navigation.
Add an inline "Install CUDA torch" button right inside the
warning callout that fires the existing handleInstallCudaTorch
handler from App.tsx (already wired to /api/setup/install-cuda-torch).
Button only renders when the warning is specifically the "+cpu
wheel" case; for "torch missing entirely", the existing larger
"Install GPU runtime" primary action below the chip row covers
it without duplicating buttons.
Plumbed onInstallCudaTorch + installingCudaTorch as new optional
props through ImageStudioTab and VideoStudioTab. Spinner state
("Installing CUDA torch...") replaces the button text while the
~30-60s install runs.
Tests: vitest video + cache (80/80), tsc clean.
The inline "Install CUDA torch" button I added in 25bbe0c spun and showed a one-line success/failure summary, but no terminal output for debugging. Users hitting "No CUDA wheel for this Python" or pip resolver clashes had no way to see which CUDA index (cu124 / cu126 / cu128 / cu121) was tried and what pip actually said -- they had to open the backend Logs tab and grep. Add a CudaTorchLogPanel component that mirrors the visual shape of InstallLogPanel (single scrollable terminal, [ OK ] / [FAIL] markers per attempt, target-dir / Python / index-url meta line) but is keyed off the CudaTorchInstallResult shape returned by /api/setup/install-cuda-torch -- the endpoint is synchronous and returns the full attempts array on completion, so the panel only needs to show a final state, not stream a phase lifecycle. Behaviour: * Collapsed by default on success, auto-opens on failure * Same pip-noise filter and 80-line tail cap as InstallLogPanel (resolver complaints from unrelated installed packages get dropped from the displayed log but stay in the raw output for backend support) * Suppresses itself when there's nothing to render Plumb the raw CudaTorchInstallResult from App.tsx down through ImageStudioTab and VideoStudioTab as a new optional prop. The existing reduced ``cudaTorchResult`` summary shape stays as-is so the App-level diagnostic banner doesn't need to change. tsc clean. The 3 failing tests in src/utils/__tests__/images.test.ts are pre-existing on this branch, unrelated to this change (they fail on origin/feature/chat-level-up too).
The user clicked the inline Install CUDA torch button and saw the spinner stop, the warning text stay the same, and no log panel appear. Backend logs (chaosengine-backend-8876.log) confirm the /api/setup/install-cuda-torch endpoint never logged a request -- the network call either failed silently or never went out, and our catch path threw away the raw result so CudaTorchLogPanel had nothing to render. The user couldn't tell whether the install was running, finished, or never reached the backend. Four fixes that share the goal of making this self-explanatory: 1. Always synthesize a CudaTorchInstallResult on exception. Build a minimal failed-attempt result carrying the catch's error message so CudaTorchLogPanel renders a [FAIL] entry instead of an empty collapse. Whatever went wrong (network error, 5xx, timeout, CORS) now appears in the panel verbatim. 2. Auto-refresh image + video runtime status after the install handler returns (success OR failure). The pre-install probe is cached and the warning text stayed stale -- "Install CUDA torch" reappeared next to a button that just ran, making it look like the click did nothing. The probe re-run flips torchInstallWarning to its current value and the banner self-updates. 3. Detect "module 'torch' has no attribute 'cuda'" in the lazy-import diagnostic. This shows up when torch is half-installed (the wheel swap purged torch's C extension but the new install failed mid-way, leaving torch importable but with torch.cuda missing). The new pattern translates to "torch is partially broken -- re-run Install CUDA torch and Restart Backend". 4. Morph the warning callout into a "Restart Backend to activate" prompt when the install succeeds and requiresRestart=true. Same single-banner slot, just three states (post-install restart / GPU acceleration warning / nothing) so we never stack two banners. The Restart Backend button reuses the existing onRestartServer handler. CudaTorchLogPanel rides along in the restart prompt so the user can still inspect what pip actually did before clicking restart. Also two adjacent fixes that shipped in the same pass: 5. Image runtime's generate-failure demotion path now preserves torchInstallWarning. Previously the moment a generation failed, activeEngine flipped to "placeholder" and the fresh ImageRuntimeStatus dropped the warning -- so the user saw the "Install GPU runtime" callout (wrong remedy when torch IS installed but +cpu) instead of "Install CUDA torch" (right remedy). Recompute the warning in the fallback status so the banner stays accurate through demotion. 6. Add .gitattributes pinning text-file line endings (Cargo.toml / tauri.conf.json / *.json / *.toml / *.ts / *.py to LF; *.ps1 to CRLF for Windows-native authoring). Stops Windows users on default core.autocrlf from seeing phantom Cargo.toml / tauri.conf.json modifications every checkout (which is what prompted "do we need to add these to gitignore?" -- no, they should stay tracked, the CRLF diff was the noise). Tests: vitest cache + videos (80/80), tsc clean, python video + backend tests pass. Diagnostic helper smoke-tests both new and existing patterns correctly.
Drops the convert-to-MLX button from the chat My Models page (action no longer relevant on Windows builds) and adds 32px of right padding to .library-row-actions so the remaining chat / server / reveal / delete icons don't sit flush against the panel edge.
cryptopoly
added a commit
that referenced
this pull request
May 6, 2026
Merge pull request #32 from cryptopoly/feature/chat-level-up
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.