tree: reconcile section tree + LLM TOC into a canonical heading-path map (HAL-109) by hallelx2 · Pull Request #39 · hallelx2/vectorless-engine

hallelx2 · 2026-06-17T22:34:07Z

What

Reconciles the two structures ingestion builds — the parser's Section tree (content, summaries; the IDs every citation resolves to) and the LLM-built TOC tree (documents.toc_tree, the logical outline with clean headings + page anchors) — into one canonical lookup: section ID → logical heading path.

The two were never reconciled, so the map a citation resolves against (parser chunk titles) and the map that holds the real headings (the TOC) diverge. That divergence is the root cause of the bench's path_correct@1 = 0% (see HAL-70): citations carry parser-tree IDs whose titles are not the "Item 8" → "Balance Sheet" vocabulary the gold anchors expect.

How

tree.BuildHeadingPaths(root, toc) map[SectionID][]string matches each section to the TOC by page-range containment:

containment — a TOC node that fully contains the section beats one that only overlaps it;
depth — the deeper (more specific) heading wins, so a section under Item 8 → Balance Sheet maps to both, ending at Balance Sheet;
overlap → span → start page — deterministic tie-breaks.

Open-ended TOC nodes (EndPage = 0) resolve their effective end from the next sibling, then the parent's end, then the document's last page — so a trailing "Notes" section can't leak past its parent.

Degradation

Pure, additive function — no schema change, no migration, nothing else wired in this PR. Sections with no page range, and every section when the TOC is empty/nil, are simply absent from the map, so the consumer (HAL-70) falls back to today's behaviour. It can never make a citation worse than now.

Tests

10 cases: deepest-containing wins, open-ended-last-child bounding, top-level-only sections, straddling sections (best overlap), no-page-range skip, empty/nil TOC degradation, nil root, out-of-range absence, empty-title wrapper skipped, and defensive-copy isolation. go build ./..., go test ./pkg/tree/..., gofmt, go vet all clean.

Scope

This is the reconciliation map slice of the issue — it removes the drift by giving every consumer one canonical heading path to resolve against, which is what unblocks HAL-70. Collapsing the two trees into a single struct (and dropping the second tree entirely) remains a larger follow-up; this lands the unblocking value with the smallest blast radius.

Closes HAL-109

Summary by Sourcery

Add a canonical mapping from parser sections to TOC-based heading paths to reconcile citations with logical document structure.

Enhancements:

Introduce BuildHeadingPaths to derive logical heading paths for sections by reconciling the section tree with the LLM-generated TOC using page-range matching.
Handle open-ended and wrapper TOC nodes, page-range gaps, and out-of-range sections to ensure robust, non-destructive heading resolution.
Ensure returned heading paths are defensively copied to avoid external mutation of internal state.

Tests:

Add unit tests covering containment depth precedence, open-ended TOC bounds, top-level-only mappings, straddling sections, non-paginated degradation, empty/nil TOC and root handling, out-of-range sections, empty-title wrappers, and defensive copy behavior.

Summary by CodeRabbit

New Features
- Implemented improved table of contents reconciliation with document sections, providing more accurate heading path mapping with enhanced edge-case handling for document structure variations.
Tests
- Added comprehensive test coverage for heading path resolution, including boundary conditions, overlap scenarios, and defensive copying validation.

Ingestion builds two independent structures — the parser's Section tree (content, summaries; the IDs citations resolve to) and the LLM-built TOC tree (the logical outline with clean headings + page anchors). They are never reconciled, so the map a citation resolves against and the map that holds the real headings can diverge. BuildHeadingPaths closes that gap without merging the trees: for every section it returns the canonical heading path it belongs under, matched by page-range containment (deepest containing TOC node wins; best overlap when a section straddles a boundary). Sections with no page range, and every section when the TOC is empty, are absent so callers fall back to existing behaviour. This is the reconciliation map HAL-70 needs to emit a real structural title_path on citations.

sourcery-ai · 2026-06-17T22:34:13Z

Reviewer's Guide

Adds a new canonical mapping from parser sections to TOC-derived heading paths based on page-range matching, including TOC flattening, overlap/containment scoring, and comprehensive tests for edge cases and degradation behavior.

Flow diagram for BuildHeadingPaths reconciliation process

flowchart TD
    A[Section root] --> B[BuildHeadingPaths]
    C[TOCNode slice] --> B
    B -->|root is nil or TOC empty| D[Return empty map]
    B -->|valid root and TOC| E[documentMaxPage]
    E --> F[flattenTOC]
    F --> G[tocEntry list]
    G --> H[Walk sections]
    H -->|section has valid page range| I[bestHeadingPath]
    H -->|no page range| H
    I -->|match found| J[Store SectionID -> heading path]
    I -->|no match| H
    J --> H
    H -->|walk complete| K["Return map[SectionID][]string"]

File-Level Changes

Change	Details	Files
Introduce BuildHeadingPaths to reconcile Section tree and TOC tree into a canonical heading-path map using page-range containment and deterministic tie-breaking.	Add BuildHeadingPaths to walk the Section tree and map each paginated section ID to the best-matching TOC heading path Compute documentMaxPage to bound open-ended TOC nodes based on both sections and TOC metadata Flatten the hierarchical TOC into tocEntry records with resolved page spans and accumulated title paths, skipping empty-title nodes in paths Resolve effective EndPage for open-ended TOC nodes using next sibling, parent bound, or document max page Select the best heading via bestHeadingPath and lessSpecific using containment, depth, overlap, span, and start-page tiebreakers, and defensively copy returned paths	`pkg/tree/heading_path.go`
Add targeted unit tests covering matching rules, TOC edge cases, degradation semantics, and defensive copying guarantees for heading paths.	Introduce helpers and a representative financialTOC fixture mirroring SEC-filing TOC structure Test deepest-containing heading wins, open-ended last child bounded by parent, and sections under top-level-only TOC nodes Test straddling sections selecting the best overlap/container, skipping sections without page ranges, and absence when TOC is empty or root is nil Verify sections outside TOC page ranges are unmapped and empty-title wrapper nodes do not appear in heading paths Ensure BuildHeadingPaths results are defensively copied so callers cannot mutate internal state across calls	`pkg/tree/heading_path_test.go`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2026-06-17T22:34:21Z

Warning

Review limit reached

@hallelx2, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 38 minutes and 31 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 58299aea-0a1f-4dec-9798-cdaf59bb7101

📥 Commits

Reviewing files that changed from the base of the PR and between c011453 and b027956.

📒 Files selected for processing (1)

pkg/tree/heading_path.go

📝 Walkthrough

Walkthrough

Adds pkg/tree/heading_path.go implementing BuildHeadingPaths, which maps each SectionID in a parsed section tree to a canonical heading-path slice derived from a TOC forest. The TOC is flattened depth-first with resolved inclusive end pages; each section selects the best overlapping TOC entry by an explicit containment → depth → overlap → span → start-page precedence. A 10-case test suite is added in pkg/tree/heading_path_test.go.

Changes

BuildHeadingPaths implementation and tests

Layer / File(s)	Summary
Public entry point and `tocEntry` struct `pkg/tree/heading_path.go`	`tocEntry` struct with span method is defined; `BuildHeadingPaths` walks the section tree, skips sections without valid page ranges, invokes flattening, and selects the best TOC match per section.
TOC flattening, end-page resolution, and title normalisation `pkg/tree/heading_path.go`	`flattenTOC`/`flattenTOCAt` perform depth-first traversal accumulating heading paths and skipping empty titles; `resolveEndPage` infers inclusive end pages from next sibling, parent, or `documentMaxPage`; `normaliseTitle` trims whitespace.
Best-match selection and range utilities `pkg/tree/heading_path.go`	`bestHeadingPath` filters flat entries by non-zero overlap and picks the winner via `lessSpecific`, which encodes containment > depth > overlap amount > span tightness > start-page tie-breaking; `contains` and `overlapPages` implement the page-range math.
Test fixtures and all test cases `pkg/tree/heading_path_test.go`	`sec`/`financialTOC` helpers plus ten tests covering deepest-containing wins, open-ended last-child bounding, top-level-only, straddling best-overlap, zero page-range skipping, nil/empty TOC degradation, nil root safety, outside-range absence, empty-title skipping, and defensive copy guarantee.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A rabbit hopped through pages of text,
Mapping each section to headings annexed.
The TOC flattened, the ranges aligned,
Containment checked, the deepest assigned.
With defensive copies and nil roots tamed,
Every edge case tested and named!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: introducing a function to reconcile section and TOC trees into a heading-path map, directly addressing the PR objective.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch halleluyaholudele/hal-109-ingest-unify-the-parsed-section-tree-and-the-llm-toc-tree

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've left some high level feedback:

In tocEntry.span, consider replacing the hard-coded 1 << 30 sentinel with a named constant (or math.MaxInt) to make the intent clearer and avoid surprising behaviour on different architectures.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In tocEntry.span, consider replacing the hard-coded `1 << 30` sentinel with a named constant (or `math.MaxInt`) to make the intent clearer and avoid surprising behaviour on different architectures.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Replaces the 1<<30 magic sentinel in tocEntry.span with math.MaxInt so the intent is explicit and it doesn't depend on int width. Per Sourcery review on #39.

hallelx2 · 2026-06-17T22:57:53Z

Thanks @sourcery-ai — fixed in b027956: replaced the 1 << 30 sentinel in tocEntry.span with math.MaxInt (explicit + width-independent).

sourcery-ai Bot reviewed Jun 17, 2026

View reviewed changes

hallelx2 mentioned this pull request Jun 17, 2026

api: emit canonical heading path on treewalk citations (HAL-70) #40

Closed

tree: use math.MaxInt for the malformed-span sentinel (review)

b027956

Replaces the 1<<30 magic sentinel in tocEntry.span with math.MaxInt so the intent is explicit and it doesn't depend on int width. Per Sourcery review on #39.

hallelx2 merged commit d5881ec into main Jun 17, 2026
8 checks passed

hallelx2 deleted the halleluyaholudele/hal-109-ingest-unify-the-parsed-section-tree-and-the-llm-toc-tree branch June 17, 2026 22:59

hallelx2 mentioned this pull request Jun 17, 2026

api: emit canonical heading path on treewalk citations (HAL-70) #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tree: reconcile section tree + LLM TOC into a canonical heading-path map (HAL-109)#39

tree: reconcile section tree + LLM TOC into a canonical heading-path map (HAL-109)#39
hallelx2 merged 2 commits into
mainfrom
halleluyaholudele/hal-109-ingest-unify-the-parsed-section-tree-and-the-llm-toc-tree

hallelx2 commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

sourcery-ai Bot commented Jun 17, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

hallelx2 commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallelx2 commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Degradation

Tests

Scope

Summary by Sourcery

Summary by CodeRabbit

Uh oh!

sourcery-ai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Flow diagram for BuildHeadingPaths reconciliation process

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

hallelx2 commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallelx2 commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

sourcery-ai Bot commented Jun 17, 2026 •

edited

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading