Skip to content

Merge/pr 32 into staging redo#34

Merged
cryptopoly merged 85 commits intostagingfrom
merge/pr-32-into-staging-redo
May 6, 2026
Merged

Merge/pr 32 into staging redo#34
cryptopoly merged 85 commits intostagingfrom
merge/pr-32-into-staging-redo

Conversation

@cryptopoly
Copy link
Copy Markdown
Owner

No description provided.

cryptopoly added 30 commits May 1, 2026 08:05
tauri-action 0.6.0 used 'uploadUpdaterJson' which only suppressed
latest.json uploads but kept the per-bundle .sig sidecars. The 0.6.2
bump (#5b50bb2) renamed the input to 'includeUpdaterJson', which now
also suppresses .sig uploads. Without sigs the publish-manifest job
has nothing to combine into latest.json, so the auto-updater never
sees new releases.

Set includeUpdaterJson: true so .sig files reach the release. The
publish-manifest job still wins on the final latest.json because
generate-updater-manifest.mjs deletes any existing latest.json
asset before uploading its combined version.
…t levels

Twelve composer + thread improvements that close the gap with mainstream
local chat clients while keeping the substrate-first identity intact.

Frontend
- Shared RichMarkdown wrapper with Prism syntax highlighting (oneDark),
  per-block copy button + language badge, GFM tables, and KaTeX math
- Collapsible left sidebar with localStorage persistence and expand chevron
- Per-thread session search filtering on title, message body, and reasoning
- Per-thread export to Markdown, JSON, and plain text via dropdown
- Slash-command menu in composer (/clear, /think on|off, /tools on|off,
  /model, /cancel, /export md|json|txt) with arrow-key navigation
- Per-message temperature override chip with slider, numeric input,
  reset, and per-session localStorage persistence
- Reasoning effort segmented control (Off / Low / Med / High) replacing
  the binary thinking toggle, with per-session persistence

Backend
- POST /api/chat/generate/{session_id}/cancel endpoint flips an in-memory
  cancel flag; the streaming loop checks between events, breaks early,
  persists partial output, and emits a `cancelled: true` SSE chunk
- _build_history_with_reasoning() reattaches stored <think>...</think>
  blocks to assistant turns when the thread is in auto thinking mode,
  so reasoning-capable models retain context across follow-ups
- ThinkingTokenFilter accepts open_tag / close_tag constructor params;
  reasoning_delimiters_for(model_ref) registry helper resolves per-model
  overrides (default <think>...</think> preserved for all current models)
- GenerateRequest gains topP, seed, and reasoningEffort fields ready for
  the Phase 2 sampler exposure pass to wire end-to-end

Tests
- 232 vitest cases (was 207); 768 pytest cases (was 765)
- New: exportThread.test.ts, sessionSearch.test.ts, slashCommands.test.ts,
  test_history_with_reasoning.py, custom-tag ThinkingTokenFilter cases,
  cancel-chat endpoint cases
- npx tsc --noEmit clean, npm test green, pytest green

Deferred to Phase 2.2 (full sampler exposure)
- top_p / seed inline chips (fields accepted, plumbing pending)
- reasoning_effort routing into llama-server / MLX worker payloads
- Per-chat reasoning delimiter override UI
The image and video catalogs surface variants like 'FLUX.1 Dev · mflux
(MLX)' and 'LTX-2 · distilled (MLX)' that route through mflux or
mlx-video on Apple Silicon. Both depend on the mlx wheel which has no
Linux or Windows builds, so picking one of those entries on the wrong
OS is a guaranteed dead end.

Add backend_service/helpers/platform_filter.py with:

- is_apple_silicon(system, machine) — pure-function platform check
  (parameters exposed for tests without monkeypatching).
- is_mlx_only_variant(variant) — detects variants by explicit mlxOnly
  flag, engine == 'mflux' / 'mlx-video', or runtime strings ending
  in 'mflux (MLX native)' / 'mlx-video (MLX native)'.
- filter_mlx_only_families(families, on_apple_silicon=...) — drops
  MLX-only variants on non-Apple hosts and removes families whose
  entire variant set was MLX-only. Returns a new list, never mutates.

Wire the filter into _image_model_payloads (helpers/images.py) and
_video_model_payloads (helpers/video.py) at the end of payload
construction so the catalog routes (/api/images/catalog and
/api/video/catalog) return only the variants that can run on the
current host.

Surfaced by the v0.7.2 smoke test on a Windows / RTX 4090 box: FLUX.1
Dev mflux and the LTX-2 MLX variants showed up in the model dropdowns
but failed at preload time because mlx isn't installable. Filtering
server-side keeps the dropdowns honest without changing any frontend
code.
Two related Windows-only bugs surfaced by the v0.7.2 smoke test on
an RTX 4090 box:

Bug #6 — RTX 4090 reported as 12 GB total
  GPUMonitor._snapshot_nvidia() shells out to nvidia-smi, and on
  Windows boxes without it on PATH (driver installed but no CUDA
  toolkit) it fell through to _fallback_psutil() which returns
  psutil.virtual_memory().total — system RAM, not VRAM. The image /
  video safety estimators then read that as the GPU budget and
  produced 'Likely to crash' warnings on a 24 GB card holding an
  11 GB FLUX model.

  Fix:
  - Try torch.cuda.get_device_properties(0).total_memory first.
    When the GPU bundle is installed this is the most reliable
    source — it reads through the CUDA driver, no PATH needed.
  - Fall back to nvidia-smi as before.
  - Drop the psutil fallback. When neither answers we now return
    {'vram_total_gb': None}, which the TS estimators
    (utils/images.ts, utils/videos.ts) already treat as 'unknown'
    via the DEFAULT_*_MEMORY_GB fallbacks. Better an honest
    'unknown' than a wrong 12 GB.

Bug #7 — Image gen produces gibberish placeholder after install
  DiffusersImageEngine.probe() uses importlib.util.find_spec to
  decide between the placeholder engine and the real diffusers
  pipeline. Once the GPU bundle install lands new packages into
  the extras dir, importlib's negative-lookup cache still answers
  None for the new modules until invalidate_caches() is called.
  The probe kept reporting realGenerationAvailable=False and the
  generation pipeline returned the SVG placeholder, which lands as
  a gibberish image when the frontend renders it as data:image/svg+xml.

  Fix:
  - probe() now calls importlib.invalidate_caches() before
    find_spec so newly-installed packages are picked up without a
    backend restart.
  - The GPU bundle worker (_gpu_bundle_job_worker) now also calls
    invalidate_caches and resets the VRAM total cache when it
    transitions to phase=done, so the immediately-following
    capabilities snapshot reflects the freshly-importable torch.

Tests
  tests/test_gpu_detection.py — 9 unit tests covering
  torch.cuda detection, nvidia-smi precedence, the new
  no-system-RAM fallback path, and the process-lifetime cache.
  All pass; existing pytest suite still green.
The streaming chat surface previously showed a bare blinking cursor while
the model was still ingesting the prompt. On large contexts that's
indistinguishable from a hung generation. Surface the phase explicitly.

Backend
- generate_stream now emits a `phase: prompt_eval` SSE chunk before
  invoking the runtime, then a `phase: generating` chunk (with
  `ttftSeconds`) the moment the first token / reasoning fragment arrives.
- _stream_assistant_metrics_payload accepts a ttft_seconds kwarg and
  passes it through to the assistant message metrics so the value
  persists on the finalised turn.

Frontend
- New PromptPhaseIndicator component with elapsed-time tick (250 ms)
  and a phase-specific colour treatment (neutral while ingesting,
  accent-tinted once tokens flow).
- ChatStreamPhase type + StreamCallbacks.onPhase callback in api.ts
  parser.
- useChat seeds the optimistic assistant placeholder with
  streamPhase: "prompt_eval" so the indicator shows immediately on send,
  before the backend's first SSE chunk arrives. The phase advances on
  each onPhase event and clears via the onDone session refresh.
- ChatTab renders the indicator above the markdown content while the
  message is streaming and a phase is set, replacing the blinking
  cursor for that interval.

Tests
- src/__tests__/streamPhase.test.ts covers the SSE parser routing
  prompt_eval / generating events with optional ttftSeconds and
  ignoring unknown phase strings.

Verification: tsc --noEmit clean, npm test 236 / 236, pytest 795
The Tauri 2 NSIS installer is configured with a custom installer hook
file at src-tauri/installer.nsh. The hook intentionally ships as
empty stubs that document the contract the GPU runtime depends on:

    %LOCALAPPDATA%\ChaosEngineAI\extras\cp{major}{minor}\site-packages

This directory holds the GPU bundle (torch + diffusers + transformers,
~2.5 GB) that the Image Studio install button writes via pip. Tauri's
default uninstaller leaves the path alone today, but the explicit hook
file makes that intent visible — anyone adding RM logic in the future
gets the comment block as a guardrail.

Changes:
- Add src-tauri/installer.nsh with documented empty pre/post
  install + uninstall hooks. NSIS_HOOK_POSTUNINSTALL carries the
  preserve-extras contract in a comment so the rule can't drift.
- Wire the hook into src-tauri/tauri.conf.json via
  bundle.windows.nsis.installerHooks: ./installer.nsh.
- Add a comment block in src-tauri/src/lib.rs::chaosengine_extras_root
  pointing at the NSIS hook so a Rust-side path move doesn't silently
  break the Windows-side contract.
- Add tests/test_extras_path.py pinning the
  ChaosEngineAI/extras/cp{maj}{min}/site-packages shape so any
  future move loud-fails the suite. The Python ABI tag pin matches
  sys.version_info against the resolved path.

Surfaced by the v0.7.2 smoke test on Windows: the user reported the
GPU runtime had been wiped after an uninstall + reinstall cycle. The
default uninstaller path doesn't touch the extras tree on the bench
config we ship, but pinning the contract via these hooks + tests
makes the regression visible if anyone adds custom uninstall logic
later.

Tests:
- .venv/bin/python -m pytest tests/test_extras_path.py -v — 3/3 pass
- Pre-existing tests still pass
…ards

A laptop crash during chat (caused by an unattended generation hanging on
prefill) prompted four watchdog layers that catch the most common runaway
failure modes early — before the host wedges, swap-thrashes, or OOM-kills.

A. Stuck prompt-eval timeout (frontend)
   useChat arms a 60-second timer when the backend announces the
   prompt_eval phase. If the timer fires before the generating phase
   transition, it calls cancelChatGeneration, aborts the local stream,
   and surfaces an actionable error explaining the likely causes
   (oversized context, OOM, thermal throttle). The timer is cleared on
   phase transition, onDone, onError, and manual cancel so a stale
   timer can't abort a follow-up turn.

B. Pre-flight memory gate (backend)
   New backend_service/helpers/memory_gate.py with `gate_chat_generation`
   and `snapshot_memory_signals`. Refuses chat generations when free
   RAM is below 1 GB OR combined memory pressure exceeds 92%. The
   refusal is emitted as a regular SSE error chunk so the existing
   error-handling path renders the message — no new client wiring.
   Gate exceptions never block legitimate work: a psutil glitch logs a
   warning and falls through.

D. Output-length runaway guard (backend)
   The streaming loop now aborts when accumulated assistant text
   exceeds max_tokens × 6 characters (1.5× the budget at ~4 chars/token
   average). Catches decoder loops on quantised models that ignore the
   EOS token. Logged separately from user-initiated cancellation.

E. Reasoning budget cap (backend)
   ThinkingTokenFilter now accepts max_reasoning_chars (default
   32_000 ≈ 8000 tokens). When inside <think> without a close tag and
   the cap is reached, the filter force-emits reasoning_done, stops
   appending to reasoning, and routes leftover bytes to text so the
   assistant turn finalises instead of streaming reasoning forever.
   Pass None to disable per-call.

Tests
- tests/test_memory_gate.py: 6 cases covering pass / refuse / boundary
  / custom-threshold paths.
- tests/test_mlx_worker.py: 4 new ThinkingTokenFilter cases for the
  budget cap (force-close, disabled-when-None, validation, normal
  close-tag still works).
- Verification: tsc --noEmit clean, npm test 236, pytest 805 (+10 new).
… banners, image/video gates

Five additional watchdog layers on top of the Phase 2.0.5 baseline. Each
covers a distinct runaway failure mode that was previously invisible to
the user until the host became unresponsive.

C. Tok/s floor monitor (chat)
   The streaming loop now samples decode rate over a 30-second rolling
   window. Falling below 0.3 tok/s for the full window aborts the
   generation with a thermal-throttle / GPU-stall / worker-deadlock
   diagnostic. Cheap — chunk count proxies for tokens, no per-tick
   psutil hit.

F. Repetition guard for the llama.cpp path
   `RunawayGuard` was previously bound to the MLX subprocess. Lifted to
   `backend_service/runaway_guard.py`; the `mlx_worker` module re-
   exports the symbol so existing imports (and tests) continue to
   resolve. The chat stream loop in `state.py` now feeds each chunk
   through a fresh guard so identical-line repetition or near-duplicate
   reasoning loops abort the stream within a few hundred tokens.

G. Panic banner (chat)
   The streaming loop samples memory every 5 seconds. When free RAM
   drops below 0.5 GB OR pressure tops 96%, a `panic` SSE event fires
   once per turn. The frontend renders a non-blocking red banner with
   the live numbers and a Cancel affordance. Generation is *not* auto-
   cancelled — the user decides.

H. Image / video pre-flight memory gates
   Same shape as the chat gate from earlier in 2.0.5. Image gen
   refuses below 4 GB free / 88% pressure; video below 6 GB / 85%
   (strictest of the three because diffusion swap-thrash on Apple
   Silicon historically wedged the host). Routes raise 503 with the
   human-readable refusal message so existing modal error paths
   render it without new wiring.

I. Thermal pressure banner (chat, macOS)
   New `helpers/thermal.py` parses `pmset -g therm` output (no sudo
   required) and classifies into nominal / moderate / critical. The
   stream loop emits a `thermalWarning` SSE event the first time the
   classifier returns "critical"; the frontend renders an amber banner
   distinguishing it from the red memory panic. Linux / Windows return
   None and the watcher is a no-op there until cross-platform thermal
   telemetry lands in Phase 3.5.

J. Worker liveness probe
   Deferred — the C tok/s floor catches hung worker scenarios in
   practice. A dedicated ping/pong probe will land alongside Phase 3
   substrate telemetry where worker heartbeats are first-class.

Tests
- `tests/test_thermal.py` — 8 cases covering the pmset classifier
- `tests/test_memory_gate.py` — 6 new cases for image/video gates
- `tests/test_runaway_guard.py` — 5 cases for the shared module +
  alias identity

Verification: tsc --noEmit clean, npm test 236, pytest 824 (+19 new).
…hread / ChatComposer

ChatTab.tsx had grown to 1085 lines holding the sidebar, thread header,
message list, and composer all in one component. Phase 2.4 (conversation
branching), 2.5 (in-thread multi-model compare), and 2.9 (@-mention
system) need to swap out individual pieces of that surface — which
isn't tractable while everything is interleaved.

Split into four siblings under `src/features/chat/`:

  - ChatSidebar.tsx (148 lines)
    Session list with title/body search, pin / delete, warm-model
    badges, and the collapse toggle. Owns its own filter call so
    parent doesn't have to memo it.

  - ChatHeader.tsx (223 lines)
    Thread title editor, model picker, export-format dropdown,
    runtime summary, document chips, optional sidebar-expand
    toggle when the sidebar is collapsed.

  - ChatThread.tsx (375 lines)
    Message list with reasoning panels, prompt-phase indicator,
    panic and thermal banners, tool-call cards, citations, and
    the per-turn metrics fold-out. Drag-drop forwards files via
    `onChatFileDrop`.

  - ChatComposer.tsx (283 lines)
    Image previews, slash-command popover, textarea (with arrow-
    key + Tab + Esc handling), thinking-effort segmented control,
    temperature chip, tools toggle, send / stop / clear buttons.

ChatTab.tsx (388 lines, -64%) is now a composition root that owns:
  - sidebar collapse state
  - session search query
  - slash-command match list and selection cursor
  - per-thread temperature override (with localStorage glue)
  - per-thread reasoning effort level (with localStorage glue)

Children receive narrow prop slices. No behaviour change — every
existing flow (export menu, slash commands, temp chip, panic banner,
thermal warning, sidebar collapse) renders identically and exercises
the same handlers.

Verification: tsc --noEmit clean, npm test 236, pytest 824.
…ty / seed / mirostat / json_schema / reasoning_effort

Closes the Phase 1.10 + 1.12 deferrals. Per-thread sampler overrides
now flow end-to-end from the SamplerPanel popover through
GenerateRequest, RuntimeController, both engine implementations,
and out to llama-server / mlx-lm.

Backend
- GenerateRequest gains topP, topK, minP, repeatPenalty, mirostatMode,
  mirostatTau, mirostatEta, seed, jsonSchema (already had reasoningEffort
  from Phase 1.12). Each defaults to None so the backend's defaults stay
  in force when the UI sends no override.
- New helper `_apply_sampler_kwargs` in inference.py merges the override
  dict + reasoning_effort + json_schema into a llama-server
  /v1/chat/completions payload. JSON schema is wrapped in the
  OpenAI structured-outputs `response_format` envelope.
- New helper `_build_sampler_overrides` in state.py projects a
  GenerateRequest into the snake_case dict the engines consume.
- BaseInferenceEngine, LlamaCppEngine, MLXWorkerEngine,
  RemoteOpenAIEngine, and RuntimeController gain `samplers`,
  `reasoning_effort`, `json_schema` kwargs end-to-end.
- mlx_worker `_build_mlx_sampler` calls `make_sampler` with whatever
  Phase 2.2 sampler subset the installed mlx-lm version supports
  (top_p, top_k, min_p), filtered via signature inspection so older
  mlx-lm builds fall back gracefully.

Frontend
- New `SamplerPanel` component: popover with numeric inputs for each
  Phase 2.2 sampler plus a mirostat mode selector that only reveals
  tau / eta when modes 1/2 are picked. Override badge shows count.
- `samplerOverrides.ts`: storage helpers (`readSamplerOverrides`,
  `writeSamplerOverrides`, `samplerPayload`) keyed by session id with
  defensive sanitisation against corrupt blobs.
- `ChatTab` owns the override state, persists on every change.
- `ChatComposer` renders the panel next to the temperature chip.
- `useChat` reads from the same localStorage key when assembling
  stream payloads — single source of truth.

Tests
- tests/test_sampler_payload.py: 9 cases across `_apply_sampler_kwargs`
  and `_build_sampler_overrides` covering pass-through, none-skip,
  unknown-key ignore, json_schema envelope, reasoning_effort plumbing.
- src/features/chat/__tests__/samplerOverrides.test.ts: 9 cases for
  storage round-trip, sanitisation, per-session scoping, payload
  projection. Inline localStorage shim works around the node-only
  vitest environment.
- tests/test_backend_service.py FakeRuntime fixture extended to accept
  the new kwargs so existing chat-completion tests still pass.

Verification: tsc --noEmit clean, vitest 245 (+9), pytest 833 (+9).

Deferred to follow-up sprint
- DRY sampler (llama-server supports it but the API is fiddly and
  benefits from per-context-length tuning)
- XTC sampler (still new; few models have published settings)
- Free-form GBNF grammars (json_schema covers the common case)
Loaded models now declare what they can do — vision, tools, reasoning,
coding, agents, audio, video, multilingual — and the chat surface uses
those declarations to render capability badges and gate composer
affordances. Picking a text-only model hides the image attach button;
picking a non-tool-capable model hides the Tools toggle; picking a
non-reasoning model hides the thinking effort segmented control.

Backend
- New `backend_service/catalog/capabilities.py`:
  * `ModelCapabilities` dataclass with eight typed boolean flags plus
    a free-form `tags` tuple preserving the catalog's original strings
    so the UI can render badges without re-deriving them.
  * `_CAPABILITY_TO_FLAG` maps catalog strings ("vision", "tool-use",
    "thinking", "multilingual", etc.) to the typed fields.
  * `resolve_capabilities(ref, canonical_repo)` walks the curated
    `MODEL_FAMILIES` catalog. Variant match wins; falls back to
    family-level entry when a quantised fork's ref doesn't match a
    variant directly. Heuristic substring sniff covers refs the
    catalog hasn't been updated for (vl/llava → vision; r1/think →
    reasoning; coder → coding; instruct/-it/chat → tools).
- `LoadedModelInfo.to_dict()` now resolves capabilities lazily on
  every snapshot. Lazy resolution avoids a migration on the dataclass
  and the dict shape is stable across runtimes.

Frontend
- `LoadedModel` type gains `capabilities?: ModelCapabilities | null`.
- `ChatHeader` renders capability badges next to the Ready pill when
  the active thread's loaded model has resolved capabilities. Each
  badge has a hover-title explaining the flag.
- `ChatComposer` reads `loadedModelCapabilities` and conditionally
  renders the image-attach button, thinking effort control, and Tools
  toggle. When capabilities are absent (unknown model), every
  affordance stays visible — the gate never hides UI based on missing
  data.
- App.tsx threads `workspace.runtime.loadedModel?.capabilities` down to
  ChatTab, which forwards to ChatHeader and ChatComposer.

Tests
- tests/test_capabilities.py — 10 cases: empty fallback, catalog match,
  canonical-repo match, three heuristic paths (vision/reasoning/coder),
  instruct → tools, dataclass dict round-trip, family-level fallback
  for non-variant forks, None inputs.

Verification: tsc --noEmit clean, vitest 245, pytest 843 (+10).

Unblocks Phase 2.12 (mid-thread model swap) — the capability deltas
between models become visible at swap time, so the user can see "this
model loses tools support" before committing.
The composer now exposes a "Send next via..." dropdown that lets the
user pick a different warm model for the upcoming turn without
changing the thread's default. After the turn finishes the dropdown
clears and the next plain message reverts to the session's default
model. Useful for quickly testing a theory on a smaller model, then
having the larger one carry the conversation back.

Backend
- New `oneTurnOverride: bool = False` field on GenerateRequest. When
  True, state.generate_stream skips persisting the runtime's loaded
  model identity (`model`, `modelRef`, `canonicalRepo`, `modelSource`,
  `modelPath`, `modelBackend`) onto the session. Other fields (cache
  strategy, context, thinking mode, samplers) still persist so the
  picked model's runtime profile is reflected on this turn.
- Default False preserves existing behaviour where sending with a
  different model permanently switches the thread.
- Both call sites in state.py (the agent path and the streaming
  generate path) honour the flag.

Frontend
- New `MidThreadSwapMenu` component: dropdown of warm models excluding
  the session default, with a clear-override affordance and an
  inline "Cancel override" button on the trigger when active. Surfaces
  only warm models so the swap is instantaneous — cold model picks
  belong in the existing My Models flow.
- `useChat` owns the override state (`oneTurnOverride: WarmModel | null`)
  with public setter; clears in onDone after a successful turn so the
  one-turn semantics holds even if the user forgets to clear manually.
- Stream payload assembly: when the override differs from the session
  default, payload's modelRef / modelName / backend swap to the
  override's identity and `oneTurnOverride: true` is set so the
  backend doesn't persist the swap.
- ChatTab forwards override state + setter through to ChatComposer.
- ChatComposer renders the menu next to the Tools toggle, gated by
  the same busy-state predicate as other composer affordances.
- App.tsx wires `chat.oneTurnOverride` + `chat.setOneTurnOverride`
  to ChatTab.

Tests
- tests/test_one_turn_override.py — 6 cases covering default
  False, explicit True/False, coexistence with model field
  payload, and direct contract on the persist-guard reading the
  attribute.

Verification: tsc --noEmit clean, vitest 245, pytest 849 (+6).

Capability badges (from Phase 2.11) update automatically when the
override loads — the user sees "this swap loses Vision" or "this
swap gains Reasoning" via the existing badge row in ChatHeader.
Two regressions reported after 0793282 shipped to a real user.

1. Memory gate too aggressive
   The Phase 2.0.5-B gate refused chat at 92%+ pressure, but macOS
   unified memory routinely sits at 90-97% during normal use because
   the kernel aggressively compresses pages. Models that ran fine on
   the previous build were being blocked.

   Pressure ceilings raised: chat 92→98%, image 88→95%, video 85→92%.
   `available_gb` is now the primary signal — pressure is a backstop
   for genuine OOM-imminent scenarios. Tests updated.

2. Image attachment silently dropped on MLX path
   The MLX worker subprocess never wired vision input through —
   `request.get("images")` is unreferenced in mlx_worker.py. Pre-
   existing limitation, surfaced because the user attached an image
   to a Gemma-4 turn routed via TurboQuant (MLX), and the model
   hallucinated an answer about a different image entirely.

   Fix has two layers:
     - `resolve_capabilities` now accepts an `engine` argument and
       demotes `supportsVision` to False for MLX / TurboQuant
       routes. The composer's auto-gate (Phase 2.11) then hides the
       image-attach button on MLX-loaded threads, so the UI can't
       create a misleading "attached but ignored" state.
     - `state.generate_stream` strips `request.images` and logs a
       loud warning when the active engine is MLX, even if a legacy
       client somehow bypassed the composer gate. Belt-and-braces.

   When the catalog says a model supports vision but the engine
   demotes the flag, the original "vision" tag still appears in
   `capabilities.tags` so the badge row can show "vision via
   llama.cpp" once that path is wired (Phase 2.6 follow-up).

Tests
- tests/test_memory_gate.py: 4 cases updated for new thresholds + 1
  new "high-pressure with headroom passes" regression guard.
- tests/test_capabilities.py: 4 new cases — engine demotes vision
  for MLX / TurboQuant, llama.cpp keeps vision, no-engine kwarg
  preserves catalog defaults.

Verification: pytest 854, vitest 245, tsc --noEmit clean.

Note on PDF attach (also reported): drag-drop → uploadSessionDocument
→ chunked + indexed flow inspected end-to-end and the wiring looks
intact. Likely a different bug surfacing under the same memory-gate
refusal; if it persists after this hotfix, capture backend logs at
upload time and we'll trace from there.
User retest with `gemma-3-27b-it-qat-4bit` on the native llama.cpp
path showed the model still hallucinating about an unattached image.
Tracing surfaced a second pre-existing limitation: `_resolve_gguf_path`
in inference.py explicitly excludes mmproj projector files when
picking which GGUF to launch llama-server with, so the server never
receives `--mmproj` and silently drops image_url parts.

Vision is therefore broken on every runtime today — MLX (no image
plumbing in worker subprocess) and llama.cpp (mmproj never loaded).
The previous engine-only gate from 72ab7c4 didn't catch the
llama.cpp path, so the regression report was correct.

Fix
- New `LoadedModelInfo.visionEnabled: bool = False` field. Stays False
  on every load until proper mmproj wiring lands (Phase 2.6+ work).
- `resolve_capabilities` now takes a `vision_enabled` kwarg (default
  False). Even when the catalog says a model supports vision, the
  typed `supportsVision` flag is False unless the runtime confirms
  mmproj is actually loaded. Catalog `tags` keep "vision" so the
  badge row can render "vision via mmproj (not yet wired)" later.
- `state.generate_stream` strip-and-warn check now keys on
  `visionEnabled` rather than the engine name, so the same protection
  applies regardless of route.
- `LoadedModelInfo.to_dict` now emits `visionEnabled` so the
  frontend can read the runtime ground truth.

Tests
- tests/test_capabilities.py: 6 cases updated / added —
  catalog-match cases now opt in via `vision_enabled=True`,
  llama.cpp-without-runtime-proof and engine-unset cases now
  assert False, MLX engine demotion still fires even when
  vision_enabled=True (belt-and-braces for any future engine bug).

Verification: pytest 855, vitest 245, tsc clean.

Composer auto-gating now hides the paperclip on every loaded model
until the mmproj loader lands, so the silent-hallucination class of
bug is closed end-to-end. Vision restoration is a separate piece of
work (probe for mmproj sibling at GGUF resolve time, pass --mmproj,
flip visionEnabled=True at load).
…+ cosine retrieval

Replaces the keyword-only TF-IDF retrieval with semantic embedding
ranking when an llama.cpp embedding GGUF is available. llama-embedding
ships in every llama.cpp build (macOS / Linux / Windows), so the same
runtime serves chat and embeddings — no MLX-only path, no
cross-platform fork.

Backend
- New `backend_service/rag/` module:
  * `embedding_client.py` — subprocess wrapper around the llama.cpp
    `llama-embedding` CLI. Discovers the binary via env override
    (`CHAOSENGINE_LLAMA_EMBEDDING`) or PATH. Discovers the model via
    env override (`CHAOSENGINE_EMBEDDING_MODEL`) or
    `<dataDir>/embeddings/*.gguf` convention. Passes
    `--embd-output-format json --embd-normalize 2 -f /dev/stdin` and
    parses the OpenAI-shaped envelope. `parse_embedding_output` is a
    pure helper so the parser is unit-testable without subprocess
    fixtures.
  * `vector_store.py` — append + cosine-similarity search.
    No new dep beyond numpy (already part of the chat runtime, but
    the inner loop is plain Python so even environments without
    numpy work). JSON-round-trippable so DocumentIndex can persist
    it alongside its existing TF-IDF state.

- `helpers/documents.DocumentIndex`:
  * `add_document(..., embedding_client=...)` — when the client is
    supplied, embeds each chunk and appends to a parallel
    `_embeddings` VectorStore. Embedding failures fall back silently
    so the lexical path always succeeds.
  * `search(..., embedding_client=...)` — when the embedding store
    is populated and a query embedding succeeds, ranking blends
    semantic 70% / BM25 30%. When the client is missing or the query
    embed errors out, the search transparently falls back to the
    legacy TF-IDF + BM25 60/40 hybrid. Either way the public shape
    of the returned dict is identical.
  * `remove_document` keeps the dense store in lockstep so chunk
    deletion stays consistent.

- `state._retrieve_session_context` — resolves the embedding client
  per call (pickup new models without a restart), passes it to
  `add_document` + `search`. Existing TF-IDF behaviour is preserved
  for users without an embedding model installed; first-class
  semantic kicks in the moment one is dropped into
  `<dataDir>/embeddings/`.

Tests
- `tests/test_rag_embeddings.py`:
  * 9 cases on `parse_embedding_output` covering every realistic
    malformed-output path so `EmbeddingClientUnavailable` fires
    instead of returning bogus vectors.
  * 9 cases on `VectorStore` covering identical / orthogonal /
    ranked search, dim mismatch, empty input, remove-indices
    lockstep, dict round-trip, and zero-query handling.
  * 4 cases on `resolve_embedding_client` covering env override,
    data-dir convention, and the no-binary fallback.

Verification: tsc --noEmit clean, vitest 245, pytest 877 (+22 new).

Cross-platform notes
- macOS: ships llama-embedding via Homebrew alongside llama-server,
  picked up automatically on PATH.
- Linux: same llama.cpp build path; the binary lives next to
  llama-server in any local build directory.
- Windows: same again — the bundled llama.cpp release zip includes
  llama-embedding.exe. The env override exists for users with
  custom builds.

Embedding model
- This commit deliberately does NOT bundle an embedding GGUF in the
  app payload (~30-80 MB depending on model). Users drop one into
  `<dataDir>/embeddings/` (e.g. `bge-small-en-v1.5.Q4_K_M.gguf`) and
  semantic retrieval lights up automatically. A bundled default is a
  separate distribution decision that lives in `scripts/stage-runtime.mjs`.

Deferred for follow-up
- Settings UI affordance to download / pick an embedding model.
- Workspace-scoped indexing (Phase 3.7) — RAG docs shared across
  threads in a workspace rather than per-session.
… flag flip

Closes the silent-image-drop limitation that the hotfix v2 commit
gated against. When the user loads a vision-capable GGUF whose
mmproj projector lives alongside the main weights, llama-server now
gets `--mmproj <file>` on startup and the runtime sets
`LoadedModelInfo.visionEnabled = True`. The capability resolver picks
up the flag, the composer's image-attach button reappears, and image
input flows end-to-end through llama-server's native multimodal path.

Backend
- New `_resolve_mmproj_path` helper: scans the main GGUF's parent
  directory (and one level up, for snapshot-style HF caches) for
  `*mmproj*.gguf` siblings. Picks the largest match — the
  full-precision projector outperforms a quantised one when both
  are present.
- `LlamaCppEngine._build_command` returns a 4-tuple now —
  `(command, runtime_note, fell_back_to_native, mmproj_path)`. When
  the binary advertises `--mmproj` support (help-text gate) and a
  sibling projector is found, the flag is appended to the command
  and the path is propagated up to `load_model`.
- `load_model` flips `LoadedModelInfo.visionEnabled` based on the
  resolved mmproj path. Models without a sibling projector load
  unchanged with `visionEnabled=False`, preserving the hotfix's
  protective behaviour.

Tests
- tests/test_mmproj_vision.py — 10 new cases:
  * 7 cover the resolver: None / nonexistent inputs return None,
    same-dir match wins, descriptive filenames match by substring,
    largest projector wins on ties, sibling-directory walker fires
    one level up.
  * 3 cover the capability flip: vision_enabled=False keeps the flag
    demoted, vision_enabled=True promotes when the catalog has the
    "vision" tag, MLX engine demotes regardless (mlx-vlm not yet
    wired).
- tests/test_inference.py — every existing `_build_command` mock
  updated to match the new 4-tuple signature (4 fixtures across 8
  test sites).

Verification: tsc --noEmit clean, vitest 245, pytest 887 (+10 new).

User experience
- Drop a vision-capable GGUF (e.g. gemma-3-27b-it-qat-4bit with its
  matching mmproj into the same folder) and load via My Models.
  The Vision badge in ChatHeader turns green, the paperclip
  reappears in the composer, and attached images now reach the
  model. No regression for text-only models — they continue to
  load with `visionEnabled=False` and the gate stays in force.

Deferred
- mlx-vlm wiring for MLX-routed vision (separate effort; needs the
  vision projector loaded on the worker subprocess side).
- Auto-download of the matching mmproj when a user loads a vision
  model whose projector isn't local yet.
Adds first-class Model Context Protocol support so the chat agent
loop can dispatch tools provided by external MCP servers alongside
the in-tree built-ins. Stdio transport only for first ship; SSE /
WebSocket transports are forward-compatible extensions.

Backend
- New `backend_service/mcp/` package:
  * `client.McpClient` — JSON-RPC 2.0 over a subprocess pipe.
    Supports the bare-minimum slice of MCP needed for tool work
    (`initialize` / `notifications/initialized` / `tools/list` /
    `tools/call`); resources, prompts, sampling, and roots are
    accepted but not surfaced. Stdout drained in a worker thread
    so reads never block the calling thread on a busy server.
    Tolerates non-JSON log lines servers occasionally emit on
    stdout. Configurable per-RPC timeout (default 30 s) plus a
    longer initialize timeout (default 15 s).
  * `client.McpServerConfig` — frozen dataclass mirroring the
    standard mcp-clients config blob (`id`, `command`, `args`,
    `env`, `enabled`). Round-trips through dict for settings
    persistence with strict validation.
  * `client._parse_json_rpc_line` and `_flatten_tool_result` are
    pure helpers exported for unit testing — the round-trip
    fixture also covers them via subprocess.
  * `tool_adapter.McpTool` wraps one remote tool as a `BaseTool`
    so the existing agent loop dispatches it without changes.
    Tool names are munged through `_safe_name` (`mcp__<server>__<tool>`)
    to satisfy OpenAI function-calling identifier rules. Errors
    from `client.call_tool` are converted to text so the agent
    loop's existing tool-result path handles them — no exception
    surface change.
  * `loader.load_mcp_tools(configs, log=...)` is the high-level
    entry point. Spawns each enabled server, runs the handshake,
    enumerates tools, and returns `(list[McpTool], list[McpClient])`
    for the caller to register and own respectively. A misbehaving
    server is isolated — its client is closed and skipped, the
    log callback fires, the rest proceed normally.

- `tools.BaseTool` gains a `provenance` property defaulting to
  `"builtin"`. `McpTool` overrides it to `"mcp:<server-id>"`.
- `tools.ToolRegistry.replace_mcp_tools(tools)` replaces only the
  MCP-sourced registrations, leaving built-ins untouched. Called
  whenever the user updates `mcpServers` in settings or the app
  starts up.
- `models.UpdateSettingsRequest` gains an `mcpServers` field
  (`list[McpServerConfigRequest]`) so the existing settings
  patch route persists configs without new endpoints. Each entry
  carries `id`, `command`, `args`, `env`, `enabled`.
- `/api/tools` route now emits a `provenance` field per tool so
  the upcoming UI badge can render Built-in vs MCP source.

Stability fix bundled
- `_resolve_mmproj_path` now uses bounded directory iteration
  instead of `Path.rglob`. macOS test rigs exposed a case where
  the GGUF's grandparent dir was a system-cache root; rglob
  raised `OSError: Result too large` mid-scandir and broke the
  full pytest run. Bounded scan covers the same HF snapshot
  layouts (parent dir + immediate sibling dirs of grandparent)
  without recursing into unrelated trees.

Tests
- tests/test_mcp_client.py — 30 cases covering:
  * 6 cases on `_parse_json_rpc_line` (valid response, empty,
    log lines, bad JSON, non-JSON-RPC objects, arrays).
  * 5 cases on `_flatten_tool_result` (text concat, isError
    prefix, mixed content, empty list, non-dict input).
  * 6 cases on `McpServerConfig.from_dict` (round-trip + every
    rejection path).
  * 3 cases on `_safe_name` (basic format, sanitisation, empty
    placeholders).
  * 4 cases on `McpTool` (proxy to client, error → text,
    provenance tag format, fallback description).
  * 3 cases on `McpClient` round-trip via a Python `-c`
    fake-server fixture (initialize → list → call,
    pre-init-list raises, unknown command raises).
  * 3 cases on `load_mcp_tools` (healthy server, disabled
    server, isolation across one bad + one good server).

Verification: tsc --noEmit clean, vitest 245, pytest 917 (+30 new).

Deferred follow-ups
- Settings UI for managing mcpServers (drag-drop config import,
  per-server enable toggle, status pill). Backend payload field
  ready; routes already accept; just needs frontend.
- Auto-spawn at app startup. The infrastructure is in place
  (`loader.load_mcp_tools`); plugging into `state.startup` is a
  small follow-up that needs a settings-load + lifecycle decision
  on hot-reload behaviour.
- SSE / WebSocket transports for hosted MCP servers. Stdio
  covers every local server published today.
…down / image

ToolCallCard previously dumped every result as a collapsible JSON
block. Tools now opt in to a typed output protocol so the UI renders
web search hits as a clickable table, file reads as syntax-
highlighted code, and MCP image responses inline. Tools that haven't
migrated keep the JSON fallback unchanged.

Backend
- New `StructuredToolOutput` dataclass + `BaseTool.execute_structured`
  optional method returning `(text, render_as, data)`. Default impl
  returns None — legacy `execute(...) -> str` path stays active for
  every tool that hasn't opted in.
- `agent._execute_tool_call` calls `execute_structured` first; on
  None falls back to `execute`. The structured payload is captured
  on `ToolCallResult.render_as` + `data` and propagated through
  both the streaming `tool_call_result` event and the final
  `metrics["toolCalls"]` payload.
- Built-in tool migrations:
  * `WebSearchTool` returns a `table` with columns `["#", "Title",
    "URL", "Snippet"]` and rows derived from the same DDG results
    the legacy text summary uses. Empty queries / no results /
    network failures render as `markdown` so the user sees an
    actionable error.
  * `FileReaderTool` renders `.md` / `.markdown` / `.rst` files as
    rendered markdown and every other supported extension as
    syntax-highlighted code with the file's extension as the
    language hint. Errors render as markdown.
  * `CalculatorTool` renders the `expr = result` line as a code
    block (text language) so it sits in monospace alongside other
    code outputs.
  * `CodeExecutorTool` renders the captured stdout/stderr as code
    plus carries the source code separately under
    `data.sourceCode` for a future "show what was executed" UI.
- New `McpClient.call_tool_raw` — returns the unflattened MCP
  `tools/call` envelope so adapters can inspect content parts.
  `McpTool.execute_structured` now uses it to render single-image
  MCP responses inline (`renderAs: "image"` with a base64 data
  URI) and multi-part responses as markdown.
- Backend payload through to the SSE stream gains `renderAs` +
  `data` fields per tool call; legacy clients ignore them.

Frontend
- `ToolCallInfo` type gains `renderAs?: ToolRenderAs | null` and
  `data?: Record<string, unknown> | null`. Old payloads (no
  `renderAs`) keep working — the renderer falls back to the
  plain-text pre block.
- `ToolCallCard` switches on `renderAs`:
  * `table` → HTML `<table>` with header row + URL columns
    rendered as clickable links.
  * `code` → existing `CodeBlock` (Phase 1 syntax highlighter)
    with the language hint from `data.language`.
  * `markdown` → `RichMarkdown` for nicely-rendered prose / errors.
  * `image` → inline `<img>` with `src` from `data.src`.
  * `json` (default / fallback) → legacy collapsible pre block.
- New CSS for `.tool-output-table` (with title + clickable links),
  `.tool-output-markdown`, `.tool-output-image`.

Tests
- tests/test_structured_tool_output.py — 11 cases:
  * Calculator structured `code` render + markdown error path +
    legacy `execute` text unchanged.
  * FileReader Python → code with language=py, markdown → markdown,
    unknown extension → code with language=ext, error → markdown.
  * WebSearch table with 4-column header + 2-row body, empty
    query → markdown error, no results → markdown message.
  * Default `BaseTool.execute_structured` returns None so non-
    migrated tools take the legacy text path.

Verification: tsc --noEmit clean, vitest 245, pytest 928 (+11 new).

Pairs naturally with Phase 2.10 — MCP servers that return image
content parts now render inline rather than getting stringified to
JSON. Future tools can declare `renderAs: "chart"` to plug into a
plotting helper without disturbing the dispatch logic.
Adds Msty-style fork-from-here. Each assistant message now carries a
fork action; clicking it deep-copies the thread up to that point
into a new session and lands the user there for divergent
continuation. Parent linkage is preserved on the fork so the sidebar
can show a relationship hint and future merge / diff features have
the tie.

Backend
- New `ChaosEngineState.fork_session(source_id, fork_at, title?)`:
  * Looks up the source session under the same lock that owns the
    sessions list, raises `ValueError` for unknown id or out-of-
    range index.
  * Deep-copies messages [0..forkAtMessageIndex] so mutating the
    fork's messages can never bleed into the parent.
  * Carries the source's runtime profile (model, cache strategy,
    cache bits, fp16 layers, fused attention, fit-in-memory,
    context tokens, speculative decoding, dflash draft model,
    tree budget, thinking mode) so the fork resumes on the same
    config the parent was using.
  * Tags `parentSessionId` + `forkedAtMessageIndex` for sidebar
    rendering and downstream features.
  * Inserts at the top of the sessions list so the user sees the
    fork immediately.
  * Persists via the existing `_persist_sessions` path.

- New `ForkSessionRequest` Pydantic model
  (`forkAtMessageIndex >= 0`, optional `title <= 200 chars`).
- New route `POST /api/chat/sessions/{session_id}/fork` returning
  the same shape as `create_session`.

Frontend
- `ChatSession` type gains `parentSessionId?` + `forkedAtMessageIndex?`.
- `api.ts` adds `forkChatSession(sourceSessionId, forkAtMessageIndex, title?)`.
- `useChat.handleForkAtMessage(index)`:
  * Calls the API, upserts the new session into workspace state,
    swaps `activeChatId` to the fork, and sets the title draft.
  * Errors surface via the standard chat error path so the user
    sees a clear message if the backend rejects.
- `ChatThread` adds a fork-icon button next to retry on assistant
  message hover actions. Branch-shaped SVG icon, monochrome.
- `ChatTab` + `App.tsx` thread the new prop down (`onForkAtMessage`).
- `ChatSidebar` renders a `⑂ fork` purple badge on sessions that
  carry `parentSessionId`, with a hover tooltip showing the
  forked-at message index. CSS lands in `styles.css`.

Tests
- tests/test_fork_session.py — 10 cases:
  * Messages copied up to (and including) the chosen index.
  * Parent linkage (`parentSessionId` + `forkedAtMessageIndex`)
    preserved.
  * Runtime profile (model, modelRef, thinkingMode) carries.
  * Default title combines source title + " (fork)"; explicit
    title overrides cleanly.
  * Fork inserts at the top of the session list.
  * Deep-copy isolation: mutating fork messages doesn't touch
    the parent.
  * Unknown source id raises `ValueError`.
  * Out-of-range index (positive + negative) raises.

Verification: tsc --noEmit clean, vitest 245, pytest 938 (+10 new).

Pairs naturally with Phase 2.5 (in-thread multi-model compare):
forking lets users branch the same prompt to two different models
and continue both threads in parallel. The compare view can then
show side-by-side rendering keyed off `parentSessionId`.
Adds an instant compare affordance: pick another warm model from the
assistant message's action bar, get a sibling response for the same
prompt rendered as a card under the original answer. The override
model must already be loaded; we never auto-reload to avoid surprises.

Backend
- state.add_message_variant: re-runs the user prompt at index-1 against
  the currently-loaded model, attaches to messages[index].variants
- POST /api/chat/sessions/{id}/variants route + AddVariantRequest
- Tests cover happy path + index/role/runtime guards (8 cases)

Frontend
- ChatMessageVariant type + ChatMessage.variants
- addMessageVariant in api.ts + useChat.handleAddVariant exported
- VariantPickerButton (warm-model dropdown, current ref excluded)
- VariantCard inside ChatThread renders model name, tok/s, response
  time, optional reasoning panel, markdown body
- Props wired through ChatTab + App.tsx
- CSS for picker popover and variant stack
…ckers

Capabilities are already resolved server-side for the loaded model;
this surfaces the same flags before-load so users can tell which
options support vision / tools / reasoning / code etc. at a glance.

Frontend resolver mirrors the backend one-to-one: catalog tags win,
ref-name heuristics fill in for non-catalog entries. No backend
change needed — catalog tags ship on featuredModels already.

Changes
- utils/capabilities.ts: resolveCapabilities + emptyCapabilities
- ChatModelOption.capabilities populated from matched catalog variant
- ModelLaunchModal renders capability badges on selected card +
  every list option
- VariantPickerButton renders Vision / Tools / Reasoning / Code
  hints next to each warm model
- 7 unit tests cover catalog precedence, heuristic fallback,
  case normalisation, unknown-tag preservation, empty input
The sampler chain (top_p / top_k / min_p / repeat_penalty / seed /
mirostat) shipped in earlier work; backend already accepts a
`jsonSchema` field that llama-server enforces via
`response_format: json_schema`. This commit lights it up in the UI.

- SamplerOverrides.jsonSchemaText: raw textarea content, persisted
  per session so mid-type drafts survive remounts
- SamplerPanel renders a JSON-schema textarea with live parse
  validation (red error / muted ok hint)
- samplerPayload + useChat readSamplerPayload parse the schema text
  at send-time; malformed input drops out silently rather than
  blocking the request
- 7 new round-trip tests (parse / drop array / empty handling /
  unparseable text preserved across writes)

llama.cpp applies the schema; mlx-lm ignores it (out of scope for
the worker subprocess). DRY / XTC / GBNF stay deferred per
existing comment in models/__init__.py.
Templates can now declare variables and seed presets. The Use in
Chat button on a variable-bearing template opens a fill-form that
substitutes {{name}} placeholders before the prompt reaches the
composer. Preset model ref + preset samplers persist alongside the
template and are surfaced as badges in the detail view (composer
auto-apply lands in a follow-up).

Backend
- helpers/prompts.py: variables / presetSamplers / presetModelRef
  fields on create + update; _normalise_variables drops malformed
  entries and dedupes by name
- extract_placeholders + apply_variables for {{name}} substitution
  with bool / number / None coercion and unknown-name preservation
- PromptTemplateRequest extended; existing CRUD routes accept the
  new fields without breaking older clients
- 9 new tests: extraction order, substitution coercion, missing
  names preserved, preset persistence, update preserves untouched
  preset fields, malformed variable entries dropped

Frontend
- PromptVariable type + PromptTemplate.variables / presetModelRef /
  presetSamplers
- Editor: variables (JSON array), preset model ref, preset
  samplers (JSON object) with placeholders
- Detail view shows preset model + variable count badges
- Fill form renders typed inputs (textarea / number / checkbox),
  live preview of resolved prompt, Apply to chat hands the
  substituted text to the composer
- applyVariables mirror of backend helper (bool / null / unknown
  semantics identical)
The /v1/chat/completions stub auto-loaded a model and accepted only
temperature + max_tokens; external scripts couldn't tune sampling.
This commit lights up the standard OpenAI sampler fields end-to-end
and adds /v1/embeddings via the bundled Phase 2.6 GGUF model.

Backend
- OpenAIChatCompletionRequest: top_p, top_k (extension),
  frequency_penalty, presence_penalty, seed, stop, response_format
- _LLAMA_SAMPLER_KEYS extended with frequency_penalty / presence_penalty
  / stop so _apply_sampler_kwargs forwards them on the llama path
- state.openai_chat_completion builds a samplers dict + extracts
  json_schema from response_format.json_schema.schema; passes both
  to runtime.generate / stream_generate
- New OpenAIEmbeddingsRequest + state.openai_embeddings:
  - Routes through resolve_embedding_client (Phase 2.6)
  - Returns 503 with actionable detail when no model is wired
  - Honours `dimensions` parameter for truncation
- POST /v1/embeddings registered alongside existing /v1/* routes

Tests (3 new — 958 passing total)
- Sampler fields reach the runtime via last_generate_kwargs
- Empty sampler set → samplers=None, json_schema=None
- /v1/embeddings 503s cleanly with no embedding client wired
The plan's catalog browser entry asked for size + arch + VRAM-fit
hints in a built-in HF browser. The HF search backend already
exists at /api/models/search; this commit lights up the per-variant
fit-vs-available-memory hint so users know whether a model will
load before clicking Download.

Three buckets:
- Fits     (estimate ≤ 70% available — comfortable, green)
- Tight    (estimate ≤ 100% available — yellow, may need to free RAM)
- Too big  (estimate > available — red, suggest a smaller quant)

The hint is optimistic by design: TurboQuant / ChaosEngine cache
compression can reclaim ~50% of the listed estimate, so "Tight" is
still a usable signal rather than a hard block. The detailed tooltip
spells out the exact numbers and remediation.

Changes
- OnlineModelsTab: memoryFitBucket helper exported for testing;
  per-row badge inside the existing memory cell
- App.tsx threads workspace.system.availableMemoryGb through
- styles.css: memory-fit-badge--{comfortable,tight,over}
- 7 unit tests cover bucket boundaries + null-safety
…h gap

User-reported regressions:
1. First reasoning paragraph appeared visually separated from the
   rest — reasoning models tend to emit "First thought.\n\nMore..."
   which the markdown renderer turns into two paragraphs with a tall
   margin between them.
2. Wanted a collapsible streaming view that shows only 1-2 lines of
   the running thought rather than the whole panel auto-opening.

Changes
- ReasoningPanel default to collapsed during streaming; user can
  expand explicitly. The expand decision sticks until streaming ends.
- Multi-line preview when collapsed mid-stream: last 2 non-empty
  lines joined with " · ", clamped to 2 visual lines via CSS.
- tidyReasoningForDisplay strips leading whitespace and collapses
  the *first* `\n\n` to a single newline so the first thought sits
  flush against subsequent content. Mid-stream paragraph breaks
  preserved.
- CSS tightens .reasoning-panel__content paragraph margins from the
  default ~16px to 6px, making the trace read as one continuous
  stream without losing structure.
- Chevron tints accent-strong while streaming so users notice the
  panel is interactive.

10 new unit tests for tidyReasoningForDisplay + lastLines covering
boundary conditions: empty input, leading whitespace, first-gap
collapse, mid-stream gap preservation, single-line passthrough.
Surfaces the substrate decisions the runtime made for each assistant
turn — engine, cache strategy, DDTree budget, accepted-token rate,
runtime warnings — as a strip of inline chips above the existing
collapsible Model Details fold-out. Operators can now tell at a
glance whether a turn went MLX vs llama.cpp, ChaosEngine vs
TurboQuant, and how aggressively speculative decoding ran.

The data already lands on every assistant message via inference.py
and mlx_worker.py; this commit just renders it. No backend change.

Changes
- SubstrateRoutingBadge component: builds chips from GenerationMetrics
  with separate keys for engine / cache / spec / acceptance / warn
- ChatThread renders the badge above the metrics <details> for any
  assistant message that has metrics
- styles.css: substrate-chip + tone variants (default / accent / warn)
- 9 unit tests cover empty input, engine fallback to backend, cache
  label synthesis, DDTree on/off, acceptance rate gating, runtime
  note truncation
Signature differentiator: lets operators flip cache compression
strategy (TurboQuant / ChaosEngine / Native) and bit width per
turn without touching launch settings. Backend already accepts the
fields on every GenerateRequest and reloads the runtime
transparently when the requested strategy / bits don't match
what's loaded — no engine-side change needed.

Frontend
- KvStrategyChip: composer popover listing all advertised cache
  strategies with bit-range buttons. Active strategy highlighted;
  unavailable strategies render greyed with a tooltip explaining
  the gap.
- kvStrategyOverride helper: read / write per-session blob to
  localStorage, mirrored from samplerOverrides shape.
- ChatTab owns the override state with cross-session persistence;
  ChatComposer renders the chip alongside SamplerPanel + temp.
- useChat reads the override at send-time; falls through to the
  active runtime profile when no override is set.
- App.tsx threads workspace.system.availableCacheStrategies through.
- styles.css: kv-chip + popover variants.

8 unit tests cover round-trip, malformed-input handling, null
clearing, per-session scoping.
Adds a structured inspection helper that runs at prompt-render
time and detects known chat-template quirks:

- Gemma family (Gemma-1 → Gemma-4) reject system role entirely;
  the helper flags this and the fold-system-into-first-user fix
  is now applied automatically by mlx_worker before
  apply_chat_template fires
- ChatML templates that omit add_generation_prompt handling get
  surfaced as a runtime warning (template renders truncated
  prompts, model continues the user turn instead of replying)
- Templates that hard-code an assistant prefix while also
  branching on add_generation_prompt get flagged for double-prefix

The report's `to_runtime_note()` returns a single line that
threads through the existing runtime_note channel and shows up on
the Phase 3.4 substrate badge so users see "auto-fixed: Gemma
family — fold system into first user" without poking around.

Tests
- 15 unit tests cover Gemma family detection, fold idempotency,
  preservation of conversation order across the fold, missing /
  empty templates, ChatML detection, runtime-note formatting

mlx_worker._build_prompt_text now takes an optional model_ref so
the inspection runs only when we know which family we're rendering
for. Llama.cpp side opaque (template parsed inside llama-server)
so detection there is a follow-up.
Captures CPU %, GPU %, available RAM, and thermal state at each
turn's stream finalisation. Renders below the substrate routing
badge as a compact perf-chip strip with tone variants (warn for
high CPU / low RAM, alert for tok/s under 1 or thermal critical).

Backend
- helpers/perf.py: snapshot_perf_telemetry() returns a typed
  PerfTelemetry blob, all fields optional. CPU + memory via
  psutil, thermal via existing pmset reader (Phase 2.0.5-I), GPU
  via the dashboard's _detect_gpu_utilization
- _stream_assistant_metrics_payload attaches `perfTelemetry` when
  any field samples non-null; samplers fail silently so a sampler
  bug never blocks turn finalisation
- 6 unit tests cover the dataclass shape + psutil/thermal failure
  fallthrough

Frontend
- GenerationMetrics.perfTelemetry typed
- ChatPerfStrip component renders chips: tok/s, CPU, GPU, free
  RAM, thermal — each with tone classification (default / warn /
  alert) so users glance at colour for hot spots
- ChatThread renders the strip below the substrate badge for any
  assistant message that has metrics
- styles.css: perf-chip + tone variants
- 10 unit tests cover chip composition + tone thresholds + null
  handling

macOS gets the full set today (thermal works); Windows / Linux
fall through to None on thermal until per-OS samplers land.
cryptopoly added 29 commits May 5, 2026 08:13
The original WanInstallPanel.tsx that listed every supported Wan repo
in Discover was removed when the catalog tabs were rolled back to v0.7.2,
which orphaned the FU-025 backend endpoints (POST /api/setup/install-mlx-video-wan,
GET /api/setup/install-mlx-video-wan/status, GET /api/setup/mlx-video-wan/inventory)
plus the api.ts client funcs (startWanInstall / getWanInstallStatus /
getWanInventory).

This restores the install surface as a contextual single-repo panel
inside VideoStudioTab — when the user picks a Wan-AI variant on Apple
Silicon, the panel checks if the converted MLX dir for THAT specific
repo is on disk and either shows a "Ready" chip or an "Install" button.
Self-contained component owns its own polling so VideoStudioTab's state
hook stays clean.

Files:
  - new src/components/WanRuntimeInstaller.tsx — fetches inventory on
    mount, scoped to a single repo, polls /status at 1.5 Hz while a
    job is running, mirrors the LongLive install pattern. Renders a
    minimal log line (phase + percent + message) inline rather than
    pulling in the full InstallLogPanel — the LongLive variant doesn't
    accept WanInstallJobState in its union.
  - src/features/video/VideoStudioTab.tsx — renders the installer
    when ``isWanRepo && isAppleSiliconHost && !mlxVideoMissing``.
    Gating order: install mlx-video pip package first (existing
    flow), THEN convert the Wan checkpoint.
  - src/styles.css — terminal-style log panel, "Ready" chip, install
    button styling matched to the surrounding Studio actions.

tsc clean, 331 vitest pass. Backend endpoints unchanged. The
WanRuntimeInstaller fires the same convert pipeline that the live
2026-05-04 smoke validated end-to-end (FU-009 close-out commit
bcf88de).
… v0.1.5.1)

Adapts ddtree.py to the new target_ops adapter pattern dflash-mlx
introduced in v0.1.5. Old runtime top-level primitives:

  target_forward_with_hidden_states  -> target_ops.forward_with_hidden_capture
  extract_context_feature_from_dict  -> target_ops.extract_context_feature
  make_target_cache                  -> target_ops.make_cache
  _target_embed_tokens               -> target_ops.embed_tokens
  _target_text_model                 -> target_ops.text_model
  _lm_head_logits                    -> target_ops.logits_from_hidden

ContextOnlyDraftKVCache moved off ``dflash_mlx.runtime`` onto
``dflash_mlx.model``. ``create_attention_mask`` re-imported from
``mlx_lm.models.base`` (dflash dropped the runtime re-export).
``trim_cache_to`` was removed entirely — replaced with a local
``_trim_cache_to`` shim in ddtree.py that calls each cache entry's
own ``.rollback()`` / ``.trim()`` / ``.crop()`` based on what the
type exposes.

Adapter resolved once at the top of ``generate_ddtree_mlx`` via
``resolve_target_ops(target_model)`` so we don't pay repeated
backend lookups in the decode loop.

Live smoke against ``mlx-community/Qwen2.5-0.5B-Instruct-4bit``:
  - target_ops backend = qwen_gdn, family = pure_attention
  - forward+capture, embed_tokens, text_model, logits_from_hidden,
    extract_context_feature, _trim_cache_to all working

Tests: 1252 pass / 1 skip, zero regressions vs the 0.1.4.1 baseline.

Gains over 0.1.4.1:
  * draft model quantization with Metal MMA kernels
  * branchless Metal kernels + fused draft KV projections
  * long-context runtime diagnostics
Apple Silicon dev box can't exercise these live — wiring is in place
so a Windows / Linux CUDA pull validates end-to-end.

FU-023 Nunchaku / SVDQuant 4-bit weight quant on CUDA:
  - _try_load_nunchaku_transformer helper in image_runtime.py preferred
    over NF4 / int8wo when device == "cuda" + nunchakuRepo pinned +
    nunchaku importable. Falls back cleanly otherwise.
  - _nunchaku_transformer_class_for_repo registry maps base repo to
    NunchakuFluxTransformer2dModel / NunchakuQwenImageTransformer2DModel
    / NunchakuSD3Transformer2DModel / NunchakuSanaTransformer2DModel /
    NunchakuPixArtSigmaTransformer2DModel.
  - ImageGenerationConfig / Request: nunchakuRepo, nunchakuFile fields.
    VideoGenerationConfig / Request: same fields parked for upstream
    Wan / HunyuanVideo / LTX wrappers (FLUX + Qwen-Image only in v1.2.1).
  - Catalog rows: FLUX.1 Dev × svdq-int4-flux.1-dev, FLUX.1 Schnell ×
    svdq-int4-flux.1-schnell. ~3× over NF4 on a 4090 with quality near
    bf16; sub-second 4-step gen on Schnell.
  - Setup install: nunchaku>=1.2.1 in _INSTALLABLE_PIP_PACKAGES.
  - Variant key extends with nunchaku=... so toggling rebuilds.

FU-024 FP8 layerwise casting (CUDA SM 8.9+ Ada / Hopper / Blackwell):
  - _maybe_enable_fp8_layerwise helper calls transformer.
    enable_layerwise_casting(storage_dtype=…, compute_dtype=bf16)
    post-load.
  - Family-correct fp8: E5M2 for HunyuanVideo (matches upstream model
    card), E4M3 for FLUX / Wan / Qwen-Image / SD3 / LTX.
  - Compute-capability gate refuses pre-Ada GPUs since hardware fp8
    isn't there + cast slows wall-time vs bf16.
  - Graceful no-op when transformer.enable_layerwise_casting missing
    (UNet pipelines / old diffusers); error → runtimeNote.
  - Fields wired through both ImageGenerationConfig + VideoGeneration
    Config + Request models + frontend hooks + types. Default off.

FU-027 NVIDIA/kvpress (foundation only):
  - kvpress>=0.5.3 added to _INSTALLABLE_PIP_PACKAGES so the Setup tab
    can pre-stage the wheel.
  - Integration code lands separately under cache_compression/kvpress.py
    once we pick an adapter shape — upstream exposes ``presses`` per
    technique (SnapKV / TOVA / KIVI / pyramid) + a ``Pipeline`` wrapper.

Tests: 1250 pass / 1 skip / 2 deselected (pre-existing memory-pressure
flakes unrelated to this change), 331 vitest pass, tsc clean.
Mirrors the FU-018 previewVae checkbox pattern. Backend + hooks +
types already plumbed in bc12d5c — only the UI render was missing.

Image Studio: checkbox under previewVae, copy explains CUDA Ada+ gate.
Video Studio: checkbox after previewVae with the same gate explanation.

App.tsx threads the hook setters through. tsc clean, 331 vitest pass.
Mirrors the macOS shell scripts. Same env-var contracts, same install
destination ($HOME\.chaosengine\bin\), same version-tracking shape so
the existing Setup-page detector works on both platforms.

  scripts/build-llama-turbo.ps1
    - clones TheTom/llama-cpp-turboquant @ feature/turboquant-kv-cache
    - cmake configure with GGML_CUDA=ON when nvcc is on PATH
    - builds llama-server + llama-cli
    - installs as llama-server-turbo.exe
    - probes both Release\ and root build output dirs (multi-config
      vs Ninja generator)

  scripts/build-sdcpp.ps1
    - clones leejet/stable-diffusion.cpp @ master
    - cmake configure with SD_CUBLAS=ON when nvcc is on PATH
    - builds sd-cli target (upstream renamed sd -> sd-cli around master-590)
    - installs as sd.exe (legacy filename so the runtime resolver keeps
      working without a rename)

Both honor CHAOSENGINE_BIN_DIR / *_NO_CUDA env-var overrides for CI.
Static link (BUILD_SHARED_LIBS=OFF) so installed binaries don't drag
a .dll trail.
Windows PowerShell reads .ps1 files as Windows-1252 by default. The
em-dash (U+2014) bytes encoded as UTF-8 (0xE2 0x80 0x94) get mis-decoded
to "âEUR"" which the parser sees as ``unexpected token``. Stripping the
em-dashes from throw messages avoids the encoding pitfall and keeps the
script working without a BOM.

Reported by Cryptopoly running .\scripts\build-llama-turbo.ps1 on a
fresh Windows pull.
Without -G, cmake defaulted to "NMake Makefiles" which only works
inside a Visual Studio Developer Command Prompt. Vanilla PowerShell
runs died with "Running 'nmake' '-?' failed" before any compile
started, even with VS 2022 Build Tools installed.

Probe in order: CHAOSENGINE_LLAMA_TURBO_GENERATOR override, then
Ninja if on PATH, then "Visual Studio 17 2022" + -A x64 (cmake
locates VS via vswhere from outside a developer prompt as long as
the build tools are installed -- which the script header already
lists as a prerequisite).
CMake refuses to switch generators in an existing build directory:
"Does not match the generator used previously: NMake Makefiles".
Users who hit the previous default-NMake failure on Windows then
re-ran the script after the generator-selection fix and got blocked
by their own stale build/CMakeCache.txt with a hand-deletion
instruction.

Detect the cached CMAKE_GENERATOR line, compare it to the generator
we picked this run, and wipe build/ when they differ. Same-generator
re-runs keep their incremental cache.
Select-String -SimpleMatch disables regex, which made the leading ^
in '^CMAKE_GENERATOR:INTERNAL=' a literal character. The pattern
never matched any line, the if block silently skipped, and users
re-running the script after the previous failed NMake attempt still
hit "Does not match the generator used previously: NMake Makefiles".

Drop -SimpleMatch so the regex anchor works, take only the first
match (CMAKE_GENERATOR_INSTANCE etc. share the prefix), and trim
trailing whitespace from the cached value before comparing.
CMake's "could not find any instance of Visual Studio" error is
technically correct but easy to misread as a script bug, especially
on CUDA hosts: nvcc was detected successfully, so users assume the
toolchain is fine. nvcc proxies to cl.exe on Windows, so CUDA
without MSVC cannot compile anything regardless.

Probe via vswhere for a VC.Tools.x86.x64 installation before kicking
off cmake configure. When missing, throw a clear message with both
download links (full Community and the smaller Build Tools), the
required workload name, and the next-step instruction. Successful
detection logs the resolved install path so users see which VS
copy CMake will actually pick.
Microsoft's installer often flags VS 2022 Build Tools installs as
isComplete=0 (some optional component is missing) even when cl.exe
works fine. vswhere -latest WITHOUT -all silently excludes those,
and so does CMake's own internal probe -- which is why a fully
functional install can still produce "could not find any instance
of Visual Studio" from cmake configure.

Switch the pre-flight probe from a -requires component filter to a
-find for cl.exe under VC\Tools\MSVC, with -all so isComplete=0
installs come back. Pick the highest cl.exe version, walk back to
the install root, and pass it to CMake explicitly via
-DCMAKE_GENERATOR_INSTANCE=<path> so cmake doesn't repeat the same
filter and reject the same install.
CMake's "Visual Studio 17 2022" generator rejects a path-only
CMAKE_GENERATOR_INSTANCE when the install isn't in the Visual
Studio Installer's known-instances registry, with:
  "instance is not known to the Visual Studio Installer, and no
   version= field was given"
isComplete=0 installs are filtered out of that registry, so the
fix from the previous commit (pass the cl.exe-derived path) still
landed in the same wall.

Pull installationVersion from `vswhere ... -format json` for the
matched install, format the value as "<path>,version=<x>", and
hand that to CMake. Falls back to bare path when the version
lookup fails.
CMake's CUDA detection ("ggml/src/ggml-cuda/CMakeLists.txt:58
enable_language") fails with "No CUDA toolset found" when the
CUDA installer's MSBuild integration files
(CUDA <ver>.props/.targets/.xml + Nvda.Build.CudaTasks.<ver>.dll)
aren't present in the Visual Studio BuildCustomizations directory.
This is the default state when CUDA was installed before Visual
Studio, or when "Visual Studio Integration" was unticked during
the CUDA install.

Add a Sync-CudaVsIntegration helper that:
  - locates the CUDA source via $env:CUDA_PATH
  - resolves the VS BuildCustomizations target from the install
    root we already detected via vswhere
  - skips when up to date
  - copies missing files directly, falling back to a UAC-elevated
    Start-Process powershell -Verb RunAs when the target dir
    refuses our writes (Program Files is admin-only)
  - prints a manual one-liner if even the elevated copy fails

Called between VS detection and cmake configure when GGML_CUDA is
on, so the build no longer dies on the first CUDA-language probe.
Two follow-on bugs from the previous CUDA sync commit:

1. Copy-Item -LiteralPath does NOT support wildcards. The elevated
   "Copy-Item -LiteralPath '...\*' ..." treated * as a literal
   filename, silently copied nothing, and exited 0 -- so the script
   reported "files copied (elevated)" while the target dir stayed
   empty. Switched the elevated payload to per-file Copy-Item
   commands built from the missing list, and added a verify step
   inside the elevated session plus a re-check from the parent
   shell so a no-op success can no longer slip through.

2. CMake caches CUDA-language detection in build/CMakeCache.txt.
   When the integration files are installed AFTER a failed
   configure, CMake re-runs enable_language(CUDA) but its compiler
   ID test result was cached and not re-tested -- so even the
   second run with files in place still printed "No CUDA toolset
   found." Sync-CudaVsIntegration now returns $true when it
   actually copied something, and the cache-invalidation block
   wipes build/ for that reason in addition to a generator change.
build-sdcpp.ps1 hit the same NMake-default + isComplete=0 +
"No CUDA toolset" wall as build-llama-turbo.ps1 did. Rather than
duplicate ~150 lines of toolchain plumbing, lift the four helpers
(generator selection, VS install probe, CUDA VS-integration sync,
stale-cache wipe) into scripts/lib/windows-msvc-cuda.ps1 and
dot-source from both builders.

Both scripts now share:
  - Resolve-CmakeWindowsBuildContext (env override -> Ninja -> VS 2022)
  - Sync-CudaVsIntegration (UAC-elevated copy of CUDA .props/.targets)
  - Get-CmakeWindowsConfigureArgs (-G/-A/-DCMAKE_GENERATOR_INSTANCE)
  - Invoke-CmakeStaleCacheWipe (generator change + post-CUDA-install)

build-sdcpp.ps1 picks up: NMake-fallback fix, isComplete=0 install
acceptance, version=<x> on CMAKE_GENERATOR_INSTANCE, automatic CUDA
integration copy with UAC fallback, and stale-cache invalidation.

Per-script overrides keep their distinct env names:
CHAOSENGINE_LLAMA_TURBO_GENERATOR vs CHAOSENGINE_SDCPP_GENERATOR.
On Windows, pip.exe refuses to upgrade itself with:

  ERROR: To modify pip, please run the following command:
  <python> -m pip install --upgrade pip

because it cannot overwrite its own running .exe shim. The bare
`.venv\Scripts\pip install --upgrade pip` call in build.ps1 hit
this every time and aborted the whole build before any other
Python deps installed.

Switch all four pip invocations in build.ps1 to `python -m pip`
via a $VenvPython variable. python.exe holds the file handle and
can replace pip cleanly. No behavior change beyond unblocking the
upgrade step.
Two related fixes for the "CogVideoX 2B won't load on a 24 GB 4090"
report.

1. Diffusers' lazy-import wrapper hides the real cause of T5
   encoder failures. The user saw:

     "Failed to import diffusers.pipelines.cogvideo.pipeline_cogvideox
      because of the following error (look up to see its traceback):
      Could not import module 'T5EncoderModel'."

   The actual underlying chain on this user's machine was:

     transformers.quantizers -> torchao.utils -> torch.utils._pytree.
     register_constant attribute missing (torch 2.6.0+cpu, torchao
     wants >= torch 2.11)

   plus the broader signal that the GPU bundle ended up installing
   the +cpu torch wheel on a CUDA host.

   Add backend_service/helpers/video_runtime_diagnostics.py with
   diagnose_diffusers_lazy_import_error(). Probes the dep chain
   (torch, sentencepiece, protobuf, transformers.quantizers,
   transformers) and surfaces the first concrete failure with a
   Setup-page hint. Two specialised paths come first:
     * +cpu torch on a CUDA host -> "Install CUDA torch"
     * torchao + torch < 2.11 mismatch -> "re-run Install GPU runtime
       or uninstall torchao"

   Wire it into the /api/video/preload route so the row banner gets
   actionable text instead of the diffusers wrapper. Also log the
   full traceback at backend so future diagnostics aren't lost.

2. CogVideoX 2B's catalog runtimeFootprintGb of 19.0 was the
   worst-case fp32 figure. bf16 + standard placement is ~13 GB on
   CUDA, ~15 GB on MPS. The 24 GB 4090 case (budget = 24 * 0.7 =
   16.8 GB) was tripping "danger -- would crash" on a config that
   actually fits. Right-size CogVideoX 2B / 5B / 1.5-5b with explicit
   runtimeFootprintCudaGb + runtimeFootprintMpsGb numbers reflecting
   the real bf16 path.

   Also rewrite the assessVideoGenerationSafety message for the
   "model footprint > budget" branch. The runtime auto-engages
   sequential CPU offload when .to(device) OOMs (see
   video_runtime.py::_ensure_pipeline), so "would crash the backend"
   was wrong -- generation succeeds but each step is a few times
   slower. Match the test on the stable bits ("resident",
   "sequential CPU offload", "smaller model") so future copy edits
   don't keep breaking it.
Two unrelated UX fixes that share a root pattern: defaults that lie.

1. The frontend default ``emptyLaunchPreferences.maxTokens = 512`` was
   wildly out of sync with the backend default of 4096 (matched in
   models/__init__.py LaunchPreferencesRequest, GenerateRequest, and
   the runaway guard in state.py at maxTokens * 6 chars). A user who
   sent their first chat message before opening Settings got their
   answer cut off mid-output around 3000 chars -- exactly what the
   "JS solar system, last property reads `diameter: '50,72`" report
   showed. Bump the seed value to 4096; the slider range was already
   256-32768 so power users could already opt up but new users were
   silently capped 8x lower than the backend expected.

2. The Studio chips lit up "Real engine ready" + "Device: cuda
   (expected)" purely from nvidia-smi presence, with no check that
   the installed torch wheel was actually CUDA-built. A user with a
   broken install (4090 + ``torch 2.6.0+cpu``) saw nothing but green
   while every generation silently ran on CPU at a fraction of GPU
   speed. The torchInstallWarning probe in helpers/gpu.py reads
   ``torch/version.py`` directly -- not dist-info, because pip
   leaves stale ``torch-X.Y.Z+cu124.dist-info`` next to a later
   ``+cpu`` install -- and reports a one-line warning when:

     * torch is +cpu but nvidia-smi present (the user's case)
     * torch missing entirely on a CUDA host
     * torch missing entirely on Apple Silicon

   Plumbed through VideoRuntimeStatus and ImageRuntimeStatus
   (without importing torch -- safe to call from probe() despite
   Windows DLL-lock concerns). Studios render it as a red callout
   above the chip row plus a "CPU fallback" danger badge so the
   warning is visible before any model loads.

Tests: src/utils/__tests__/videos.test.ts (60/60), tsc clean. The
3 image and 1 setup-route test failures are pre-existing on this
branch (Windows path separators + image footprint estimator) and
not touched by this change.
Bug: persistent launch-settings panel pushed only ``paramsB`` into
``previewControls`` when ``previewVariant`` changed; ``numLayers`` /
``numHeads`` / ``numKvHeads`` / ``hiddenSize`` stayed at the
``emptyPreview`` defaults (all zero). Native f16 cache estimate is
``2 * num_layers * num_kv_heads * head_dim * ctx * 2 bytes`` -- with
any factor at 0 the result collapses to ~0 GB. The Studio's
"Performance Preview" then showed Cache 0.0 GB / Speed 0.0 tok/s /
Quality 0.0% and the "Fits Easily" badge fired on models that don't
actually fit (e.g. Qwen3.6-27B-GGUF Q4_K_M at 256K context, which
needs ~32 GB KV cache + 16 GB weights = ~48 GB on a 64 GB box).

Fix: when ``previewVariant.paramsB`` changes, also derive
``numLayers / hiddenSize / numHeads / numKvHeads`` via
``estimateArchFromParams`` and push the full set into
``previewControls``. Mirrors the existing launch-modal effect at
App.tsx line 822.

Reported by Cryptopoly: load failure on Qwen3.6-27B-GGUF despite GUI
claiming "Fits Easily" -- root cause was the false Fits Easily badge
hiding the actual context-cache pressure.

Tests: 1252 pytest pass / 1 skip / 2 deselected (pre-existing flakes),
331 vitest pass, tsc clean.
The "FULL CONTEXT MAY NOT FIT" preview compared the optimized KV
cache against system RAM (totalMemoryGb) only. On a CUDA host that's
the wrong constraint: llama.cpp puts the KV cache on the GPU when
ngl=999 (the default for offload-capable models), so a 60 GB f16
cache on a 24 GB 4090 OOMs the GPU long before system RAM (64 GB)
starts to matter. The user reported this directly -- "is the warning
measuring the limit on the system memory 64 GB instead of GPU
memory?"

Changes:

  * helpers/system.py: include ``gpuVramTotalGb`` in the system
    snapshot. Reuses the existing get_device_vram_total_gb() probe
    in helpers/gpu.py. Stays None on Apple Silicon (unified memory
    is already in totalMemoryGb -- reporting it again would
    double-count and produce nonsense like "60 GB > 24 GB VRAM" on
    a 64 GB Mac).

  * SystemStats.gpuVramTotalGb in src/types.ts.

  * getCacheFitStatus(optimizedCacheGb, diskGb, totalGb, bits,
    gpuVramGb?): when a discrete GPU is reported, compare the
    cache against 0.85 * gpuVramGb FIRST. If it overflows VRAM,
    return a "Cache won't fit GPU" warning that names the actual
    VRAM ceiling and recommends RotorQuant / TurboQuant or lower
    context. The system-RAM check still runs as a fallback for
    Apple Silicon and CPU-only hosts.

  * Plumb gpuVramTotalGb through every PerformancePreview consumer:
    PerformancePreview, RuntimeControls, ModelLaunchModal,
    LaunchModal, App.tsx, CompareView, ConversionTab,
    BenchmarkRunTab.

For the user's exact case (Qwen3.6-35B-A3B GGUF, 256K context,
native f16, 24 GB 4090): the warning now correctly reads "60 GB
KV cache larger than 24 GB GPU VRAM ... pick a compressed
strategy" instead of pointing at the 64 GB system RAM ceiling that
isn't actually the binding constraint.

Tests: cache.test.ts (20/20), tsc clean, python services +
backend tests pass. Snapshot smoke confirms gpuVramTotalGb=23.99
on the dev host.
The previous commit wired diagnose_diffusers_lazy_import_error()
into /api/video/preload only. The user's report shows the same
diffusers wrapper firing on the GENERATE path (CogVideoX 2B was
already preloaded, the lazy import only triggered when generate()
actually invoked the T5 text encoder). The route re-raised the
opaque "Could not import module 'T5EncoderModel'" message
unchanged.

Wire the same diagnostic into:
  * /api/video/generate (both RuntimeError and Exception branches)
  * /api/images/generate (Exception branch)

Also bump the logged traceback from the last 500 chars to the last
2000 -- the chain that breaks T5 (transformers.quantizers ->
torchao.utils -> torch.utils._pytree.register_constant) goes deeper
than 500 chars and was getting truncated mid-frame in the log.

For the user's exact runtime state, the diagnostic now surfaces:
  "torch is installed as a CPU-only wheel (2.6.0+cpu) even though
   an NVIDIA GPU is present. Generation will run on CPU at a
   fraction of GPU speed. Open Settings > Setup and click Install
   CUDA torch, then Restart Backend."
instead of the opaque T5EncoderModel wrapper.
Two issues with the GPU acceleration warning the user just spotted:

1. Image Studio showed the red "GPU acceleration not active" banner;
   Video Studio did not -- both have an NVIDIA GPU + +cpu torch, so
   both should warn.

   Root cause: my earlier replace_all Edit on video_runtime.py only
   matched the *placeholder* return path (16-space indent) and
   missed the success-path return (12-space indent) at line 961.
   On a host where torch was importable but +cpu, the success path
   ran with realGenerationAvailable=True and never set
   torchInstallWarning -- so the field came back null and the
   banner silently dropped. Add it explicitly to the success-path
   VideoRuntimeStatus return so both code paths emit the warning.

2. Both warnings just told the user to "Open Settings > Setup and
   click Install CUDA torch", which works but requires navigation.
   Add an inline "Install CUDA torch" button right inside the
   warning callout that fires the existing handleInstallCudaTorch
   handler from App.tsx (already wired to /api/setup/install-cuda-torch).
   Button only renders when the warning is specifically the "+cpu
   wheel" case; for "torch missing entirely", the existing larger
   "Install GPU runtime" primary action below the chip row covers
   it without duplicating buttons.

   Plumbed onInstallCudaTorch + installingCudaTorch as new optional
   props through ImageStudioTab and VideoStudioTab. Spinner state
   ("Installing CUDA torch...") replaces the button text while the
   ~30-60s install runs.

Tests: vitest video + cache (80/80), tsc clean.
The inline "Install CUDA torch" button I added in 25bbe0c spun and
showed a one-line success/failure summary, but no terminal output
for debugging. Users hitting "No CUDA wheel for this Python" or pip
resolver clashes had no way to see which CUDA index (cu124 / cu126 /
cu128 / cu121) was tried and what pip actually said -- they had to
open the backend Logs tab and grep.

Add a CudaTorchLogPanel component that mirrors the visual shape of
InstallLogPanel (single scrollable terminal, [ OK ] / [FAIL] markers
per attempt, target-dir / Python / index-url meta line) but is
keyed off the CudaTorchInstallResult shape returned by
/api/setup/install-cuda-torch -- the endpoint is synchronous and
returns the full attempts array on completion, so the panel only
needs to show a final state, not stream a phase lifecycle.

Behaviour:
  * Collapsed by default on success, auto-opens on failure
  * Same pip-noise filter and 80-line tail cap as InstallLogPanel
    (resolver complaints from unrelated installed packages get
    dropped from the displayed log but stay in the raw output for
    backend support)
  * Suppresses itself when there's nothing to render

Plumb the raw CudaTorchInstallResult from App.tsx down through
ImageStudioTab and VideoStudioTab as a new optional prop. The
existing reduced ``cudaTorchResult`` summary shape stays as-is so
the App-level diagnostic banner doesn't need to change.

tsc clean. The 3 failing tests in src/utils/__tests__/images.test.ts
are pre-existing on this branch, unrelated to this change (they
fail on origin/feature/chat-level-up too).
The user clicked the inline Install CUDA torch button and saw the
spinner stop, the warning text stay the same, and no log panel
appear. Backend logs (chaosengine-backend-8876.log) confirm the
/api/setup/install-cuda-torch endpoint never logged a request -- the
network call either failed silently or never went out, and our
catch path threw away the raw result so CudaTorchLogPanel had
nothing to render. The user couldn't tell whether the install was
running, finished, or never reached the backend.

Four fixes that share the goal of making this self-explanatory:

1. Always synthesize a CudaTorchInstallResult on exception. Build a
   minimal failed-attempt result carrying the catch's error message
   so CudaTorchLogPanel renders a [FAIL] entry instead of an empty
   collapse. Whatever went wrong (network error, 5xx, timeout, CORS)
   now appears in the panel verbatim.

2. Auto-refresh image + video runtime status after the install
   handler returns (success OR failure). The pre-install probe is
   cached and the warning text stayed stale -- "Install CUDA torch"
   reappeared next to a button that just ran, making it look like
   the click did nothing. The probe re-run flips torchInstallWarning
   to its current value and the banner self-updates.

3. Detect "module 'torch' has no attribute 'cuda'" in the lazy-import
   diagnostic. This shows up when torch is half-installed (the wheel
   swap purged torch's C extension but the new install failed mid-way,
   leaving torch importable but with torch.cuda missing). The new
   pattern translates to "torch is partially broken -- re-run Install
   CUDA torch and Restart Backend".

4. Morph the warning callout into a "Restart Backend to activate"
   prompt when the install succeeds and requiresRestart=true. Same
   single-banner slot, just three states (post-install restart /
   GPU acceleration warning / nothing) so we never stack two
   banners. The Restart Backend button reuses the existing
   onRestartServer handler. CudaTorchLogPanel rides along in the
   restart prompt so the user can still inspect what pip actually
   did before clicking restart.

Also two adjacent fixes that shipped in the same pass:

5. Image runtime's generate-failure demotion path now preserves
   torchInstallWarning. Previously the moment a generation failed,
   activeEngine flipped to "placeholder" and the fresh
   ImageRuntimeStatus dropped the warning -- so the user saw the
   "Install GPU runtime" callout (wrong remedy when torch IS
   installed but +cpu) instead of "Install CUDA torch" (right
   remedy). Recompute the warning in the fallback status so the
   banner stays accurate through demotion.

6. Add .gitattributes pinning text-file line endings (Cargo.toml /
   tauri.conf.json / *.json / *.toml / *.ts / *.py to LF; *.ps1 to
   CRLF for Windows-native authoring). Stops Windows users on
   default core.autocrlf from seeing phantom Cargo.toml /
   tauri.conf.json modifications every checkout (which is what
   prompted "do we need to add these to gitignore?" -- no, they
   should stay tracked, the CRLF diff was the noise).

Tests: vitest cache + videos (80/80), tsc clean, python video +
backend tests pass. Diagnostic helper smoke-tests both new and
existing patterns correctly.
Drops the convert-to-MLX button from the chat My Models page (action no
longer relevant on Windows builds) and adds 32px of right padding to
.library-row-actions so the remaining chat / server / reveal / delete
icons don't sit flush against the panel edge.
Brings the chat / image / video phase 2-3 work into staging on top of
the recent Windows runtime fixes. Conflict resolution notes:

- gpu.py / test_gpu_detection.py: kept staging's subprocess-based torch
  probe (fixes a Windows DLL lock that PR 32's in-process import would
  re-introduce). Ported PR 32's torch_install_warning() in alongside it
  so image_runtime / video_runtime imports keep working.
- setup.py: kept staging's reset_torch_status_cache + reset_vram_total_cache
  (superset of PR 32's reset).
- pyproject.toml: bumped version to 0.7.4.
- VideoStudioTab.tsx: combined PR 32's CPU-fallback danger badge with
  staging's stricter restart-required condition; dropped staging's
  duplicate LTX-2 distilled callout (PR 32 has it elsewhere).
- styles.css: kept both library-row-actions layout properties; kept
  staging's contain: layout on install-log-panel.
- videos.ts: kept staging's estimateVideoRequestPeakGb +
  nf4RuntimeFootprintForRepo helpers; widened useNf4 type to
  boolean | null per PR 32; extended NF4 lookup to cover Wan 2.2-T2V-A14B
  and LTX-Video; PR 32's relaxed risk thresholds (0.85/0.95) carried
  through. Tests aligned with PR 32's relationship-style assertions.

This redo of PR #33 fixes a single-parent commit graph that lost the
link to PR 32's individual commits. Now landing as a real merge so
git log staging shows the full feature history.
@cryptopoly cryptopoly merged commit cea9084 into staging May 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant