Skip to content

[WIP] Track redesign of issue definitions for Thread Observability#117

Draft
Codex wants to merge 2 commits into
mainfrom
codex/redesign-issue-definitions
Draft

[WIP] Track redesign of issue definitions for Thread Observability#117
Codex wants to merge 2 commits into
mainfrom
codex/redesign-issue-definitions

Conversation

@Codex
Copy link
Copy Markdown
Contributor

@Codex Codex AI commented May 16, 2026

Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.


This section details on the original issue you should resolve

<issue_title>Redesign issue definitions (tracking)</issue_title>
<issue_description>## Summary

Track the redesign of the Thread Observability "issues" concept. While this is open, issue detection is paused and the API returns a placeholder envelope (see #4).

This issue is the design home. No code lands here directly — each accepted rule should ship as its own PR referencing this issue.

Problem with the previous rules

  • Most rules restated state already visible on the dashboard (partition count, phantom node count, stale-link count). Surfacing these as "issues" adds noise without insight.
  • The rules were opinionated about diagnosis. An AI consumer reading list_active_issues would anchor on the labeled framing rather than re-deriving from data. Wrong framing -> wrong investigation path.
  • Severity did not reflect actionability. Long-standing benign artifacts ranked alongside fresh anomalies.
  • Several rules were not falsifiable — there was no defined "this is cleared" condition.

Design principles for the new rules

A rule may ship into the issues system only if it satisfies all of the following:

  1. Not a restatement of state. State already exposed by another endpoint (/v1/topology, /v1/partitions, /v1/links/stale, /v1/nodes, etc.) is not an issue. An issue is a claim about state.
  2. Narrows diagnosis. A Thread-literate engineer would agree the issue points at a smaller set of causes than the raw data does. If the rule does not reduce the search space, it does not ship.
  3. Falsifiable. The rule must include an explicit "cleared when ..." condition. If you cannot write the SQL/predicate that closes the issue, the rule is too vague.
  4. Severity = actionability x freshness. Not novelty. Not noise level. A 6-month-old stale link is lower severity than the same link appearing in the last ingest.
  5. Evidence travels with the issue. Each issue row carries the EUI64s involved, the observation that triggered it, first_seen, last_seen, and the supporting metric value(s). Consumers must be able to re-evaluate without trusting the label.
  6. Cap on rule count. Target 6-10 total. If a candidate rule cannot justify its slot against the others, it does not ship.

Candidate rules (not yet accepted)

Each candidate needs: trigger predicate, clearing predicate, evidence shape, severity calculation, and a short justification against the principles above. None ship until reviewed against the bar.

  • Real partition split — multiple partitions present and at least one device seen in more than one partition within a recent window and no router-router link bridging them. (NOT: "two partitions exist.")
  • Dead-link reference — a router NeighborTable references an EUI64 the registry has never seen, and the reference has persisted across N ingests.
  • Routing loop / unreachable next hopwalk_route_to_otbr terminates in a loop, partition mismatch, or unknown next hop.
  • Router child-cap saturation — a parent router is at or above the practical 10-child cap and another node is trying to attach.
  • OTBR isolation — OTBR has no neighbors above an LQI threshold and no routes inbound, over a sustained window (rule out boot artifacts).
  • (add more — each must justify its slot)

For each accepted candidate, open a child PR titled issues: add rule <name> linking back here.

Non-goals

Definition of done for this tracking issue

  • At least 3 rules accepted and shipped behind the principles above.
  • Placeholder note in /v1/issues and the dashboard card removed (or replaced with real content).
  • list_active_issues MCP tool documented with the new rule taxonomy.

Related: #4 (placeholder implementation).
</issue_description>

Comments on the Issue (you are @codex[agent] in this section)

@DarinShapiro Status update: The new backlog roadmap schedules the issue-definition redesign last, following transport, documentation, and HA integration work. While stronger backend evidence is now available from graph, diagnostics, and chat work, the issue/rule redesign is deferred until #21 and the remaining sprint integration surfaces settle.

@Codex Codex AI linked an issue May 16, 2026 that may be closed by this pull request
6 tasks
Co-authored-by: DarinShapiro <23219821+DarinShapiro@users.noreply.github.com>
@Codex Codex AI requested a review from DarinShapiro May 16, 2026 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Redesign issue definitions (tracking)

2 participants