Skip to content

Redesign issue definitions (tracking) #5

@DarinShapiro

Description

@DarinShapiro

Summary

Track the redesign of the Thread Observability "issues" concept. While this is open, issue detection is paused and the API returns a placeholder envelope (see #4).

This issue is the design home. No code lands here directly — each accepted rule should ship as its own PR referencing this issue.

Problem with the previous rules

  • Most rules restated state already visible on the dashboard (partition count, phantom node count, stale-link count). Surfacing these as "issues" adds noise without insight.
  • The rules were opinionated about diagnosis. An AI consumer reading list_active_issues would anchor on the labeled framing rather than re-deriving from data. Wrong framing -> wrong investigation path.
  • Severity did not reflect actionability. Long-standing benign artifacts ranked alongside fresh anomalies.
  • Several rules were not falsifiable — there was no defined "this is cleared" condition.

Design principles for the new rules

A rule may ship into the issues system only if it satisfies all of the following:

  1. Not a restatement of state. State already exposed by another endpoint (/v1/topology, /v1/partitions, /v1/links/stale, /v1/nodes, etc.) is not an issue. An issue is a claim about state.
  2. Narrows diagnosis. A Thread-literate engineer would agree the issue points at a smaller set of causes than the raw data does. If the rule does not reduce the search space, it does not ship.
  3. Falsifiable. The rule must include an explicit "cleared when ..." condition. If you cannot write the SQL/predicate that closes the issue, the rule is too vague.
  4. Severity = actionability x freshness. Not novelty. Not noise level. A 6-month-old stale link is lower severity than the same link appearing in the last ingest.
  5. Evidence travels with the issue. Each issue row carries the EUI64s involved, the observation that triggered it, first_seen, last_seen, and the supporting metric value(s). Consumers must be able to re-evaluate without trusting the label.
  6. Cap on rule count. Target 6-10 total. If a candidate rule cannot justify its slot against the others, it does not ship.

Candidate rules (not yet accepted)

Each candidate needs: trigger predicate, clearing predicate, evidence shape, severity calculation, and a short justification against the principles above. None ship until reviewed against the bar.

  • Real partition split — multiple partitions present and at least one device seen in more than one partition within a recent window and no router-router link bridging them. (NOT: "two partitions exist.")
  • Dead-link reference — a router NeighborTable references an EUI64 the registry has never seen, and the reference has persisted across N ingests.
  • Routing loop / unreachable next hopwalk_route_to_otbr terminates in a loop, partition mismatch, or unknown next hop.
  • Router child-cap saturation — a parent router is at or above the practical 10-child cap and another node is trying to attach.
  • OTBR isolation — OTBR has no neighbors above an LQI threshold and no routes inbound, over a sustained window (rule out boot artifacts).
  • (add more — each must justify its slot)

For each accepted candidate, open a child PR titled issues: add rule <name> linking back here.

Non-goals

Definition of done for this tracking issue

  • At least 3 rules accepted and shipped behind the principles above.
  • Placeholder note in /v1/issues and the dashboard card removed (or replaced with real content).
  • list_active_issues MCP tool documented with the new rule taxonomy.

Related: #4 (placeholder implementation).

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions