api: emit canonical heading path on treewalk citations (HAL-70)#40
Conversation
Reviewer's GuideAdds canonical structural heading paths to TreeWalk citations by resolving TOC-derived section heading paths and emitting them as title_path on citations, with safe degradation when TOC data is unavailable and tests covering both presence and omission cases. Sequence diagram for TreeWalk citation building with title_pathsequenceDiagram
actor User
participant API
participant Deps
participant TOCProvider as TOCProvider
participant Tree as tree
User->>API: /v1/answer/treewalk
API->>Deps: buildTreeWalkCitations(ctx, tree, res)
activate Deps
Deps->>Deps: headingPathsForDoc(ctx, tree)
alt TOC available
Deps->>TOCProvider: GetTOC(ctx, tree.DocumentID)
TOCProvider-->>Deps: raw TOC JSON
Deps->>Tree: BuildHeadingPaths(tree.Root, nodes)
Tree-->>Deps: headingPaths (sectionID→path)
else TOC unavailable or invalid
Deps-->>Deps: headingPaths = nil
end
loop for each source in CitationSources(res)
Deps->>Deps: build citation map
alt len(sectionIDs) > 0 and headingPaths[primary] exists
Deps-->>Deps: set c[title_path] = headingPaths[sectionIDs[0]]
else no heading path
Deps-->>Deps: omit title_path
end
end
Deps-->>API: citations
deactivate Deps
API-->>User: response with citations (possibly with title_path)
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In headingPathsForDoc, all TOC fetch errors (including transient or unexpected ones) are silently treated as nil paths; consider differentiating retrieval.ErrNoTOC from other errors so non-"no TOC" failures are logged or surfaced for easier diagnosis while still degrading gracefully for the expected missing-TOC case.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In headingPathsForDoc, all TOC fetch errors (including transient or unexpected ones) are silently treated as nil paths; consider differentiating retrieval.ErrNoTOC from other errors so non-"no TOC" failures are logged or surfaced for easier diagnosis while still degrading gracefully for the expected missing-TOC case.
## Individual Comments
### Comment 1
<location path="internal/api/treewalk.go" line_range="390-391" />
<code_context>
+ if t == nil || t.Root == nil || d.TreeWalkStrategy == nil || d.TreeWalkStrategy.TOC == nil {
+ return nil
+ }
+ raw, err := d.TreeWalkStrategy.TOC.GetTOC(ctx, t.DocumentID)
+ if err != nil || len(raw) == 0 {
+ return nil
+ }
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Distinguish between "no TOC" and other GetTOC errors to avoid silently hiding operational issues.
Currently any `GetTOC` error (including infra/transport issues) is treated as "no TOC" and dropped with no logging, which can mask real problems. It’d be better to special‑case a sentinel like `retrieval.ErrNoTOC` (if available) and log other errors, e.g. at debug level:
```go
raw, err := d.TreeWalkStrategy.TOC.GetTOC(ctx, t.DocumentID)
if err != nil {
if !errors.Is(err, retrieval.ErrNoTOC) {
d.Logger.Debug("answer/treewalk: TOC fetch failed; citations omit heading paths",
"err", err, "document_id", t.DocumentID)
}
return nil
}
if len(raw) == 0 {
return nil
}
```
This keeps graceful degradation while exposing unexpected failures for observability.
Suggested implementation:
```golang
raw, err := d.TreeWalkStrategy.TOC.GetTOC(ctx, t.DocumentID)
if err != nil {
// Distinguish between "no TOC" and other errors so we don't silently hide
// operational issues. retrieval.ErrNoTOC is expected; everything else is logged.
if !errors.Is(err, retrieval.ErrNoTOC) {
d.Logger.Debug("answer/treewalk: TOC fetch failed; citations omit heading paths",
"err", err,
"document_id", t.DocumentID,
)
}
return nil
}
if len(raw) == 0 {
return nil
}
```
To make this compile, ensure:
1. `errors` is imported in this file:
- Add `import "errors"` to the import block if it's not already present.
2. The `retrieval` package (which defines `ErrNoTOC`) is imported:
- Add the correct import path for `retrieval` (for example `github.com/yourorg/yourrepo/internal/retrieval` or whatever is used elsewhere in the codebase).
3. If the sentinel error is named slightly differently (e.g. `retrieval.ErrNoToc` or similar), adjust the constant name in the `errors.Is` call accordingly.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Citations carried only {start_page, end_page, section_ids, quote} — no
structural heading path. The bench's path_correct@1 reads a title_path it
could only reconstruct from the parser's chunk titles, which are the wrong
vocabulary, so the metric was structurally 0%.
buildTreeWalkCitations now resolves the document's LLM TOC tree (via the
strategy's TOC provider) into a section-ID -> heading-path map
(tree.BuildHeadingPaths, HAL-109) and attaches the primary section's
heading path as title_path on each citation. Degrades cleanly: no TOC
persisted -> field omitted, prior behaviour unchanged.
Part of HAL-70.
GetTOC errors were all silently treated as 'no heading paths'. Now retrieval.ErrNoTOC (the expected missing-TOC case) stays silent while any other error is logged at debug for observability — still degrading gracefully. Per Sourcery review on #40.
4d97465 to
44c8753
Compare
|
Thanks @sourcery-ai — fixed in 44c8753: |
GetTOC errors were all silently treated as 'no heading paths'. Now retrieval.ErrNoTOC (the expected missing-TOC case) stays silent while any other error is logged at debug for observability — still degrading gracefully. Per Sourcery review on #40.
* api: emit canonical heading path (title_path) on treewalk citations
Citations carried only {start_page, end_page, section_ids, quote} — no
structural heading path. The bench's path_correct@1 reads a title_path it
could only reconstruct from the parser's chunk titles, which are the wrong
vocabulary, so the metric was structurally 0%.
buildTreeWalkCitations now resolves the document's LLM TOC tree (via the
strategy's TOC provider) into a section-ID -> heading-path map
(tree.BuildHeadingPaths, HAL-109) and attaches the primary section's
heading path as title_path on each citation. Degrades cleanly: no TOC
persisted -> field omitted, prior behaviour unchanged.
Part of HAL-70.
* api: log non-ErrNoTOC failures in headingPathsForDoc (review)
GetTOC errors were all silently treated as 'no heading paths'. Now
retrieval.ErrNoTOC (the expected missing-TOC case) stays silent while any
other error is logged at debug for observability — still degrading
gracefully. Per Sourcery review on #40.
What
The engine now emits a real structural heading path (
title_path) on every TreeWalk citation. This is the fix forpath_correct@1 = 0%(root-cause analysis on HAL-70).Before: a citation was
{start_page, end_page, section_ids, quote}. The "structure-aware citation" differentiator was not in the response payload — the consumer had to reverse-engineer a heading path from the parser's chunk titles, which are the wrong vocabulary, so the bench's path metric was structurally 0% (whilespan_in_top1was 20% — content sometimes right, label always wrong).How
buildTreeWalkCitationsresolves the document's LLM TOC tree (through the strategy's existingTOCprovider) into a section-ID → heading-path map viatree.BuildHeadingPaths(HAL-109, page-range containment), then attaches the primary (first/earliest-page) section's heading path astitle_pathon each citation — exactly the field the path-correctness metric reads.Both the non-streaming and SSE paths build citations through this one chokepoint, so both gain the field.
Degradation
headingPathsForDocreturns nil on every failure mode (no provider,ErrNoTOC, fetch error, unparseable TOC). A nil map is safe to index →title_pathis simply omitted and the citation is byte-for-byte what it was before. Never worse than today.Tests
TestBuildTreeWalkCitations_EmitsHeadingPath(citation carries["Setup","Install"]for its primary section) and..._NoTOCOmitsHeadingPath(graceful omission onErrNoTOC).go build ./...,go test ./internal/api/... ./pkg/tree/...,gofmt,go vetall clean.Stacked on HAL-109
Base branch is the HAL-109 PR (#39) because this consumes
tree.BuildHeadingPaths. The companion bench change (point the retriever at/v1/answer/treewalkand consume the enginetitle_path) ships invectorless-bench. The before/after fixture-bench numbers are pending a live run against an engine carrying both PRs.Closes HAL-70
Summary by Sourcery
Emit canonical structural heading paths on TreeWalk citations using the document TOC while preserving previous behavior when TOC data is unavailable.
New Features:
Bug Fixes:
Enhancements:
Tests: