tree: reconcile section tree + LLM TOC into a canonical heading-path map (HAL-109)#39
Conversation
Ingestion builds two independent structures — the parser's Section tree (content, summaries; the IDs citations resolve to) and the LLM-built TOC tree (the logical outline with clean headings + page anchors). They are never reconciled, so the map a citation resolves against and the map that holds the real headings can diverge. BuildHeadingPaths closes that gap without merging the trees: for every section it returns the canonical heading path it belongs under, matched by page-range containment (deepest containing TOC node wins; best overlap when a section straddles a boundary). Sections with no page range, and every section when the TOC is empty, are absent so callers fall back to existing behaviour. This is the reconciliation map HAL-70 needs to emit a real structural title_path on citations.
Reviewer's GuideAdds a new canonical mapping from parser sections to TOC-derived heading paths based on page-range matching, including TOC flattening, overlap/containment scoring, and comprehensive tests for edge cases and degradation behavior. Flow diagram for BuildHeadingPaths reconciliation processflowchart TD
A[Section root] --> B[BuildHeadingPaths]
C[TOCNode slice] --> B
B -->|root is nil or TOC empty| D[Return empty map]
B -->|valid root and TOC| E[documentMaxPage]
E --> F[flattenTOC]
F --> G[tocEntry list]
G --> H[Walk sections]
H -->|section has valid page range| I[bestHeadingPath]
H -->|no page range| H
I -->|match found| J[Store SectionID -> heading path]
I -->|no match| H
J --> H
H -->|walk complete| K["Return map[SectionID][]string"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Warning Review limit reached
More reviews will be available in 38 minutes and 31 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. 📝 WalkthroughWalkthroughAdds ChangesBuildHeadingPaths implementation and tests
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In tocEntry.span, consider replacing the hard-coded
1 << 30sentinel with a named constant (ormath.MaxInt) to make the intent clearer and avoid surprising behaviour on different architectures.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In tocEntry.span, consider replacing the hard-coded `1 << 30` sentinel with a named constant (or `math.MaxInt`) to make the intent clearer and avoid surprising behaviour on different architectures.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Replaces the 1<<30 magic sentinel in tocEntry.span with math.MaxInt so the intent is explicit and it doesn't depend on int width. Per Sourcery review on #39.
|
Thanks @sourcery-ai — fixed in b027956: replaced the |
What
Reconciles the two structures ingestion builds — the parser's
Sectiontree (content, summaries; the IDs every citation resolves to) and the LLM-built TOC tree (documents.toc_tree, the logical outline with clean headings + page anchors) — into one canonical lookup: section ID → logical heading path.The two were never reconciled, so the map a citation resolves against (parser chunk titles) and the map that holds the real headings (the TOC) diverge. That divergence is the root cause of the bench's
path_correct@1 = 0%(see HAL-70): citations carry parser-tree IDs whose titles are not the"Item 8" → "Balance Sheet"vocabulary the gold anchors expect.How
tree.BuildHeadingPaths(root, toc) map[SectionID][]stringmatches each section to the TOC by page-range containment:Open-ended TOC nodes (
EndPage = 0) resolve their effective end from the next sibling, then the parent's end, then the document's last page — so a trailing "Notes" section can't leak past its parent.Degradation
Pure, additive function — no schema change, no migration, nothing else wired in this PR. Sections with no page range, and every section when the TOC is empty/nil, are simply absent from the map, so the consumer (HAL-70) falls back to today's behaviour. It can never make a citation worse than now.
Tests
10 cases: deepest-containing wins, open-ended-last-child bounding, top-level-only sections, straddling sections (best overlap), no-page-range skip, empty/nil TOC degradation, nil root, out-of-range absence, empty-title wrapper skipped, and defensive-copy isolation.
go build ./...,go test ./pkg/tree/...,gofmt,go vetall clean.Scope
This is the reconciliation map slice of the issue — it removes the drift by giving every consumer one canonical heading path to resolve against, which is what unblocks HAL-70. Collapsing the two trees into a single struct (and dropping the second tree entirely) remains a larger follow-up; this lands the unblocking value with the smallest blast radius.
Closes HAL-109
Summary by Sourcery
Add a canonical mapping from parser sections to TOC-based heading paths to reconcile citations with logical document structure.
Enhancements:
Tests:
Summary by CodeRabbit
New Features
Tests