Refactor direct chat around tool-audit reflection instead of deterministic guards

﻿## Problem

Direct chat has drifted into a brittle control loop built around retries, deterministic guardrails, and post-hoc rewrites. Live validation has shown several recurring failure modes:

- answers that call tools but still fail to answer the user's actual question
- answers that mix incompatible evidence sources without reconciling them
- hallucinated UI affordances such as nonexistent dashboard panels, views, or controls
- generic fallback text that blocks obviously wrong answers but still produces poor operator experience
- a growing set of regex/string guards that are hard to reason about and increasingly likely to trip unrelated prompts

The issue is not just prompt wording. The architecture currently optimizes for avoiding known bad strings instead of optimizing for a direct, evidence-grounded answer.

## Goal

Replace most of the current deterministic post-processing with a structured reflection/audit pass that evaluates both:

1. whether the right tools were called for the user's question
2. whether the final answer is grounded in the actual tool results and directly answers the question

The target behavior is a thinner, more natural chat flow:

1. answer the question
2. call the right tools
3. answer directly from the evidence
4. perform one semantic audit pass
5. either accept, rewrite once, or gather one missing evidence bundle

## Non-goals

- Do not solve every chat quality problem in this issue.
- Do not redesign the MCP tool catalog itself.
- Do not expand deterministic guards except as a temporary thin safety net for clearly unacceptable outputs.
- Do not couple this work to spatial inference or other unrelated backend reasoning changes.

## Relationship to existing issues

- Depends on or should follow #73, which is already tracking modularization of direct_chat orchestration. This issue is about the runtime control model after that modularization boundary exists.
- Independent of #96.

## Proposed design

### 1. Split the current loop into explicit phases

Refactor the direct chat runtime into these phases:

1. intent framing
2. tool planning and execution
3. candidate answer generation
4. audit/reflection
5. optional repair step

That should replace the current pattern where retries, evaluator prompts, and deterministic fallbacks are interleaved through one large loop.

### 2. Introduce a structured audit contract

The audit model should receive:

- the system prompt
- the user question
- rendered page context if present
- tool metadata for tools that were available to the model
- the actual tool trace: tool name, description, arguments, and compact result
- the candidate answer

The audit response should be strict JSON, for example:

`json
{
  \"answered_question\": true,
  \"grounded_in_evidence\": true,
  \"hallucinated_ui_or_actions\": false,
  \"tool_choice_ok\": true,
  \"missing_tool_opportunities\": [],
  \"contains_extraneous_content\": false,
  \"rewrite_needed\": false,
  \"repair_action\": \"accept\",
  \"critique\": \"\"
}
`

Or on failure:

`json
{
  \"answered_question\": false,
  \"grounded_in_evidence\": false,
  \"hallucinated_ui_or_actions\": true,
  \"tool_choice_ok\": false,
  \"missing_tool_opportunities\": [\"analyze_node\"],
  \"contains_extraneous_content\": true,
  \"rewrite_needed\": true,
  \"repair_action\": \"gather_missing_evidence\",
  \"critique\": \"The answer talked generally about weak links but never gathered the node or edge evidence needed to identify chokepoints.\"
}
`

### 3. Let the audit choose between only three outcomes

The runtime should support only three repair actions after the first answer:

- ccept
- ewrite_once
- gather_missing_evidence_once

Do not allow open-ended recursive repair loops.

If the audit says the tool choice was wrong and identifies one or more missing evidence opportunities, the runtime may perform one additional tool-gather step and then one final answer synthesis.

If the audit says the tools were sufficient but the wording was weak, do one rewrite using the critique.

If the audit says the evidence is insufficient, answer that directly instead of trying to fabricate a more confident response.

### 4. Audit tool choice, not just final prose

The current evaluator mostly checks whether the answer text is grounded. The new design should also ask:

- given the user question, were the chosen tools appropriate?
- was there an obviously better available tool that was skipped?
- did the model overuse broad tools when a narrower tool existed?
- did the model fail to gather page-visible evidence before answering?

This is the key architectural shift: correctness should be evaluated over the execution trace, not just the generated text.

### 5. Make tool metadata part of the audit input

If tool descriptions are the basis for whether the correct tool should have been selected, the audit prompt must include those descriptions. The tool metadata should be compact but explicit enough for the audit pass to reason about whether get_mesh_state, list_all_nodes, nalyze_node, start_triage, etc. were the right choices.

That likely means introducing a compact audit-oriented tool catalog summary rather than dumping the full raw tool schema into the audit prompt.

### 6. Narrow deterministic guards to a last-resort safety net

Keep only a minimal deterministic layer for clearly unacceptable output classes such as:

- internal tool names being pushed back to the operator
- obviously nonexistent dashboard controls
- malformed response structures

But the main correctness path should move to the semantic audit contract above.

## Suggested implementation plan

### Phase 1: runtime scaffolding

- extract the current evaluator into a dedicated audit module with structured JSON parsing
- define an AuditVerdict model and epair_action enum
- introduce a compact tool-catalog summary for audit prompts
- thread rendered page context, tool descriptions, and tool results into the audit input

### Phase 2: control-loop simplification

- remove most ad hoc retry branches from the main loop
- replace them with a single audit decision point after candidate answer generation
- allow at most one gather_missing_evidence_once and one ewrite_once

### Phase 3: intent-aware evidence bundles

For a small initial set of high-value prompt classes, define explicit expected evidence bundles:

- overall health
- offline/stale nodes
- chokepoints / bottlenecks
- partition comparison / history questions
- node-specific investigation

The audit pass can then judge not only whether the answer sounds grounded, but whether the minimum evidence bundle for that question class was actually collected.

### Phase 4: trim old guardrails

- delete deterministic guards that are superseded by the audit contract
- keep only the thin hard-stop safety rails that still provide unique value
- add regression tests proving the new audit flow catches the prior live failures without regex-based branching

## Acceptance criteria

- direct chat no longer depends on a growing set of regex/string rewrites as the main correctness layer
- audit pass can reject both bad tool choice and bad answer grounding
- runtime supports only bounded repair paths (ccept, ewrite_once, gather_missing_evidence_once)
- live failures already observed in health, offline-node, partition/history, and chokepoint prompts are covered by regression tests
- chat flow remains direct and natural when the first answer is already good

## Open questions

- Should the audit pass use the same provider as the main answer path, or a cheaper secondary model by default?
- Should tool metadata be summarized once per process and cached for audit prompts?
- Should we classify prompt intent before tool planning, or let the audit infer missing bundles from the question text alone?
- How much compacted tool-result detail is enough for the audit model to judge grounding without causing prompt bloat?

## Why this is worth doing

The current deterministic-guard path is already showing diminishing returns: each fix blocks one observed bad response but makes the control flow harder to reason about and the chat experience less natural. A structured tool-audit reflection loop is more aligned with the real problem: selecting the right evidence and answering directly from it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor direct chat around tool-audit reflection instead of deterministic guards #106

Problem

Goal

Non-goals

Relationship to existing issues

Proposed design

1. Split the current loop into explicit phases

2. Introduce a structured audit contract

3. Let the audit choose between only three outcomes

4. Audit tool choice, not just final prose

5. Make tool metadata part of the audit input

6. Narrow deterministic guards to a last-resort safety net

Suggested implementation plan

Phase 1: runtime scaffolding

Phase 2: control-loop simplification

Phase 3: intent-aware evidence bundles

Phase 4: trim old guardrails

Acceptance criteria

Open questions

Why this is worth doing

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Refactor direct chat around tool-audit reflection instead of deterministic guards #106

Description

Problem

Goal

Non-goals

Relationship to existing issues

Proposed design

1. Split the current loop into explicit phases

2. Introduce a structured audit contract

3. Let the audit choose between only three outcomes

4. Audit tool choice, not just final prose

5. Make tool metadata part of the audit input

6. Narrow deterministic guards to a last-resort safety net

Suggested implementation plan

Phase 1: runtime scaffolding

Phase 2: control-loop simplification

Phase 3: intent-aware evidence bundles

Phase 4: trim old guardrails

Acceptance criteria

Open questions

Why this is worth doing

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions