Problem
Direct chat has drifted into a brittle control loop built around retries, deterministic guardrails, and post-hoc rewrites. Live validation has shown several recurring failure modes:
- answers that call tools but still fail to answer the user's actual question
- answers that mix incompatible evidence sources without reconciling them
- hallucinated UI affordances such as nonexistent dashboard panels, views, or controls
- generic fallback text that blocks obviously wrong answers but still produces poor operator experience
- a growing set of regex/string guards that are hard to reason about and increasingly likely to trip unrelated prompts
The issue is not just prompt wording. The architecture currently optimizes for avoiding known bad strings instead of optimizing for a direct, evidence-grounded answer.
Goal
Replace most of the current deterministic post-processing with a structured reflection/audit pass that evaluates both:
- whether the right tools were called for the user's question
- whether the final answer is grounded in the actual tool results and directly answers the question
The target behavior is a thinner, more natural chat flow:
- answer the question
- call the right tools
- answer directly from the evidence
- perform one semantic audit pass
- either accept, rewrite once, or gather one missing evidence bundle
Non-goals
- Do not solve every chat quality problem in this issue.
- Do not redesign the MCP tool catalog itself.
- Do not expand deterministic guards except as a temporary thin safety net for clearly unacceptable outputs.
- Do not couple this work to spatial inference or other unrelated backend reasoning changes.
Relationship to existing issues
Proposed design
1. Split the current loop into explicit phases
Refactor the direct chat runtime into these phases:
- intent framing
- tool planning and execution
- candidate answer generation
- audit/reflection
- optional repair step
That should replace the current pattern where retries, evaluator prompts, and deterministic fallbacks are interleaved through one large loop.
2. Introduce a structured audit contract
The audit model should receive:
- the system prompt
- the user question
- rendered page context if present
- tool metadata for tools that were available to the model
- the actual tool trace: tool name, description, arguments, and compact result
- the candidate answer
The audit response should be strict JSON, for example:
json { \"answered_question\": true, \"grounded_in_evidence\": true, \"hallucinated_ui_or_actions\": false, \"tool_choice_ok\": true, \"missing_tool_opportunities\": [], \"contains_extraneous_content\": false, \"rewrite_needed\": false, \"repair_action\": \"accept\", \"critique\": \"\" }
Or on failure:
json { \"answered_question\": false, \"grounded_in_evidence\": false, \"hallucinated_ui_or_actions\": true, \"tool_choice_ok\": false, \"missing_tool_opportunities\": [\"analyze_node\"], \"contains_extraneous_content\": true, \"rewrite_needed\": true, \"repair_action\": \"gather_missing_evidence\", \"critique\": \"The answer talked generally about weak links but never gathered the node or edge evidence needed to identify chokepoints.\" }
3. Let the audit choose between only three outcomes
The runtime should support only three repair actions after the first answer:
ewrite_once
- gather_missing_evidence_once
Do not allow open-ended recursive repair loops.
If the audit says the tool choice was wrong and identifies one or more missing evidence opportunities, the runtime may perform one additional tool-gather step and then one final answer synthesis.
If the audit says the tools were sufficient but the wording was weak, do one rewrite using the critique.
If the audit says the evidence is insufficient, answer that directly instead of trying to fabricate a more confident response.
4. Audit tool choice, not just final prose
The current evaluator mostly checks whether the answer text is grounded. The new design should also ask:
- given the user question, were the chosen tools appropriate?
- was there an obviously better available tool that was skipped?
- did the model overuse broad tools when a narrower tool existed?
- did the model fail to gather page-visible evidence before answering?
This is the key architectural shift: correctness should be evaluated over the execution trace, not just the generated text.
5. Make tool metadata part of the audit input
If tool descriptions are the basis for whether the correct tool should have been selected, the audit prompt must include those descriptions. The tool metadata should be compact but explicit enough for the audit pass to reason about whether get_mesh_state, list_all_nodes, �nalyze_node, start_triage, etc. were the right choices.
That likely means introducing a compact audit-oriented tool catalog summary rather than dumping the full raw tool schema into the audit prompt.
6. Narrow deterministic guards to a last-resort safety net
Keep only a minimal deterministic layer for clearly unacceptable output classes such as:
- internal tool names being pushed back to the operator
- obviously nonexistent dashboard controls
- malformed response structures
But the main correctness path should move to the semantic audit contract above.
Suggested implementation plan
Phase 1: runtime scaffolding
- extract the current evaluator into a dedicated audit module with structured JSON parsing
- define an AuditVerdict model and
epair_action enum
- introduce a compact tool-catalog summary for audit prompts
- thread rendered page context, tool descriptions, and tool results into the audit input
Phase 2: control-loop simplification
- remove most ad hoc retry branches from the main loop
- replace them with a single audit decision point after candidate answer generation
- allow at most one gather_missing_evidence_once and one
ewrite_once
Phase 3: intent-aware evidence bundles
For a small initial set of high-value prompt classes, define explicit expected evidence bundles:
- overall health
- offline/stale nodes
- chokepoints / bottlenecks
- partition comparison / history questions
- node-specific investigation
The audit pass can then judge not only whether the answer sounds grounded, but whether the minimum evidence bundle for that question class was actually collected.
Phase 4: trim old guardrails
- delete deterministic guards that are superseded by the audit contract
- keep only the thin hard-stop safety rails that still provide unique value
- add regression tests proving the new audit flow catches the prior live failures without regex-based branching
Acceptance criteria
- direct chat no longer depends on a growing set of regex/string rewrites as the main correctness layer
- audit pass can reject both bad tool choice and bad answer grounding
- runtime supports only bounded repair paths (�ccept,
ewrite_once, gather_missing_evidence_once)
- live failures already observed in health, offline-node, partition/history, and chokepoint prompts are covered by regression tests
- chat flow remains direct and natural when the first answer is already good
Open questions
- Should the audit pass use the same provider as the main answer path, or a cheaper secondary model by default?
- Should tool metadata be summarized once per process and cached for audit prompts?
- Should we classify prompt intent before tool planning, or let the audit infer missing bundles from the question text alone?
- How much compacted tool-result detail is enough for the audit model to judge grounding without causing prompt bloat?
Why this is worth doing
The current deterministic-guard path is already showing diminishing returns: each fix blocks one observed bad response but makes the control flow harder to reason about and the chat experience less natural. A structured tool-audit reflection loop is more aligned with the real problem: selecting the right evidence and answering directly from it.
Problem
Direct chat has drifted into a brittle control loop built around retries, deterministic guardrails, and post-hoc rewrites. Live validation has shown several recurring failure modes:
The issue is not just prompt wording. The architecture currently optimizes for avoiding known bad strings instead of optimizing for a direct, evidence-grounded answer.
Goal
Replace most of the current deterministic post-processing with a structured reflection/audit pass that evaluates both:
The target behavior is a thinner, more natural chat flow:
Non-goals
Relationship to existing issues
Proposed design
1. Split the current loop into explicit phases
Refactor the direct chat runtime into these phases:
That should replace the current pattern where retries, evaluator prompts, and deterministic fallbacks are interleaved through one large loop.
2. Introduce a structured audit contract
The audit model should receive:
The audit response should be strict JSON, for example:
json { \"answered_question\": true, \"grounded_in_evidence\": true, \"hallucinated_ui_or_actions\": false, \"tool_choice_ok\": true, \"missing_tool_opportunities\": [], \"contains_extraneous_content\": false, \"rewrite_needed\": false, \"repair_action\": \"accept\", \"critique\": \"\" }Or on failure:
json { \"answered_question\": false, \"grounded_in_evidence\": false, \"hallucinated_ui_or_actions\": true, \"tool_choice_ok\": false, \"missing_tool_opportunities\": [\"analyze_node\"], \"contains_extraneous_content\": true, \"rewrite_needed\": true, \"repair_action\": \"gather_missing_evidence\", \"critique\": \"The answer talked generally about weak links but never gathered the node or edge evidence needed to identify chokepoints.\" }3. Let the audit choose between only three outcomes
The runtime should support only three repair actions after the first answer:
ewrite_once
Do not allow open-ended recursive repair loops.
If the audit says the tool choice was wrong and identifies one or more missing evidence opportunities, the runtime may perform one additional tool-gather step and then one final answer synthesis.
If the audit says the tools were sufficient but the wording was weak, do one rewrite using the critique.
If the audit says the evidence is insufficient, answer that directly instead of trying to fabricate a more confident response.
4. Audit tool choice, not just final prose
The current evaluator mostly checks whether the answer text is grounded. The new design should also ask:
This is the key architectural shift: correctness should be evaluated over the execution trace, not just the generated text.
5. Make tool metadata part of the audit input
If tool descriptions are the basis for whether the correct tool should have been selected, the audit prompt must include those descriptions. The tool metadata should be compact but explicit enough for the audit pass to reason about whether get_mesh_state, list_all_nodes, �nalyze_node, start_triage, etc. were the right choices.
That likely means introducing a compact audit-oriented tool catalog summary rather than dumping the full raw tool schema into the audit prompt.
6. Narrow deterministic guards to a last-resort safety net
Keep only a minimal deterministic layer for clearly unacceptable output classes such as:
But the main correctness path should move to the semantic audit contract above.
Suggested implementation plan
Phase 1: runtime scaffolding
epair_action enum
Phase 2: control-loop simplification
ewrite_once
Phase 3: intent-aware evidence bundles
For a small initial set of high-value prompt classes, define explicit expected evidence bundles:
The audit pass can then judge not only whether the answer sounds grounded, but whether the minimum evidence bundle for that question class was actually collected.
Phase 4: trim old guardrails
Acceptance criteria
ewrite_once, gather_missing_evidence_once)
Open questions
Why this is worth doing
The current deterministic-guard path is already showing diminishing returns: each fix blocks one observed bad response but makes the control flow harder to reason about and the chat experience less natural. A structured tool-audit reflection loop is more aligned with the real problem: selecting the right evidence and answering directly from it.