Skip to content

tree: reconcile section tree + LLM TOC into a canonical heading-path map (HAL-109)#39

Merged
hallelx2 merged 2 commits into
mainfrom
halleluyaholudele/hal-109-ingest-unify-the-parsed-section-tree-and-the-llm-toc-tree
Jun 17, 2026
Merged

tree: reconcile section tree + LLM TOC into a canonical heading-path map (HAL-109)#39
hallelx2 merged 2 commits into
mainfrom
halleluyaholudele/hal-109-ingest-unify-the-parsed-section-tree-and-the-llm-toc-tree

Conversation

@hallelx2

@hallelx2 hallelx2 commented Jun 17, 2026

Copy link
Copy Markdown
Owner

What

Reconciles the two structures ingestion builds — the parser's Section tree (content, summaries; the IDs every citation resolves to) and the LLM-built TOC tree (documents.toc_tree, the logical outline with clean headings + page anchors) — into one canonical lookup: section ID → logical heading path.

The two were never reconciled, so the map a citation resolves against (parser chunk titles) and the map that holds the real headings (the TOC) diverge. That divergence is the root cause of the bench's path_correct@1 = 0% (see HAL-70): citations carry parser-tree IDs whose titles are not the "Item 8" → "Balance Sheet" vocabulary the gold anchors expect.

How

tree.BuildHeadingPaths(root, toc) map[SectionID][]string matches each section to the TOC by page-range containment:

  1. containment — a TOC node that fully contains the section beats one that only overlaps it;
  2. depth — the deeper (more specific) heading wins, so a section under Item 8 → Balance Sheet maps to both, ending at Balance Sheet;
  3. overlap → span → start page — deterministic tie-breaks.

Open-ended TOC nodes (EndPage = 0) resolve their effective end from the next sibling, then the parent's end, then the document's last page — so a trailing "Notes" section can't leak past its parent.

Degradation

Pure, additive function — no schema change, no migration, nothing else wired in this PR. Sections with no page range, and every section when the TOC is empty/nil, are simply absent from the map, so the consumer (HAL-70) falls back to today's behaviour. It can never make a citation worse than now.

Tests

10 cases: deepest-containing wins, open-ended-last-child bounding, top-level-only sections, straddling sections (best overlap), no-page-range skip, empty/nil TOC degradation, nil root, out-of-range absence, empty-title wrapper skipped, and defensive-copy isolation. go build ./..., go test ./pkg/tree/..., gofmt, go vet all clean.

Scope

This is the reconciliation map slice of the issue — it removes the drift by giving every consumer one canonical heading path to resolve against, which is what unblocks HAL-70. Collapsing the two trees into a single struct (and dropping the second tree entirely) remains a larger follow-up; this lands the unblocking value with the smallest blast radius.

Closes HAL-109

Summary by Sourcery

Add a canonical mapping from parser sections to TOC-based heading paths to reconcile citations with logical document structure.

Enhancements:

  • Introduce BuildHeadingPaths to derive logical heading paths for sections by reconciling the section tree with the LLM-generated TOC using page-range matching.
  • Handle open-ended and wrapper TOC nodes, page-range gaps, and out-of-range sections to ensure robust, non-destructive heading resolution.
  • Ensure returned heading paths are defensively copied to avoid external mutation of internal state.

Tests:

  • Add unit tests covering containment depth precedence, open-ended TOC bounds, top-level-only mappings, straddling sections, non-paginated degradation, empty/nil TOC and root handling, out-of-range sections, empty-title wrappers, and defensive copy behavior.

Summary by CodeRabbit

  • New Features

    • Implemented improved table of contents reconciliation with document sections, providing more accurate heading path mapping with enhanced edge-case handling for document structure variations.
  • Tests

    • Added comprehensive test coverage for heading path resolution, including boundary conditions, overlap scenarios, and defensive copying validation.

Ingestion builds two independent structures — the parser's Section tree
(content, summaries; the IDs citations resolve to) and the LLM-built TOC
tree (the logical outline with clean headings + page anchors). They are
never reconciled, so the map a citation resolves against and the map that
holds the real headings can diverge.

BuildHeadingPaths closes that gap without merging the trees: for every
section it returns the canonical heading path it belongs under, matched
by page-range containment (deepest containing TOC node wins; best overlap
when a section straddles a boundary). Sections with no page range, and
every section when the TOC is empty, are absent so callers fall back to
existing behaviour.

This is the reconciliation map HAL-70 needs to emit a real structural
title_path on citations.
@sourcery-ai

sourcery-ai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Reviewer's Guide

Adds a new canonical mapping from parser sections to TOC-derived heading paths based on page-range matching, including TOC flattening, overlap/containment scoring, and comprehensive tests for edge cases and degradation behavior.

Flow diagram for BuildHeadingPaths reconciliation process

flowchart TD
    A[Section root] --> B[BuildHeadingPaths]
    C[TOCNode slice] --> B
    B -->|root is nil or TOC empty| D[Return empty map]
    B -->|valid root and TOC| E[documentMaxPage]
    E --> F[flattenTOC]
    F --> G[tocEntry list]
    G --> H[Walk sections]
    H -->|section has valid page range| I[bestHeadingPath]
    H -->|no page range| H
    I -->|match found| J[Store SectionID -> heading path]
    I -->|no match| H
    J --> H
    H -->|walk complete| K["Return map[SectionID][]string"]
Loading

File-Level Changes

Change Details Files
Introduce BuildHeadingPaths to reconcile Section tree and TOC tree into a canonical heading-path map using page-range containment and deterministic tie-breaking.
  • Add BuildHeadingPaths to walk the Section tree and map each paginated section ID to the best-matching TOC heading path
  • Compute documentMaxPage to bound open-ended TOC nodes based on both sections and TOC metadata
  • Flatten the hierarchical TOC into tocEntry records with resolved page spans and accumulated title paths, skipping empty-title nodes in paths
  • Resolve effective EndPage for open-ended TOC nodes using next sibling, parent bound, or document max page
  • Select the best heading via bestHeadingPath and lessSpecific using containment, depth, overlap, span, and start-page tiebreakers, and defensively copy returned paths
pkg/tree/heading_path.go
Add targeted unit tests covering matching rules, TOC edge cases, degradation semantics, and defensive copying guarantees for heading paths.
  • Introduce helpers and a representative financialTOC fixture mirroring SEC-filing TOC structure
  • Test deepest-containing heading wins, open-ended last child bounded by parent, and sections under top-level-only TOC nodes
  • Test straddling sections selecting the best overlap/container, skipping sections without page ranges, and absence when TOC is empty or root is nil
  • Verify sections outside TOC page ranges are unmapped and empty-title wrapper nodes do not appear in heading paths
  • Ensure BuildHeadingPaths results are defensively copied so callers cannot mutate internal state across calls
pkg/tree/heading_path_test.go

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@hallelx2, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 38 minutes and 31 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 58299aea-0a1f-4dec-9798-cdaf59bb7101

📥 Commits

Reviewing files that changed from the base of the PR and between c011453 and b027956.

📒 Files selected for processing (1)
  • pkg/tree/heading_path.go
📝 Walkthrough

Walkthrough

Adds pkg/tree/heading_path.go implementing BuildHeadingPaths, which maps each SectionID in a parsed section tree to a canonical heading-path slice derived from a TOC forest. The TOC is flattened depth-first with resolved inclusive end pages; each section selects the best overlapping TOC entry by an explicit containment → depth → overlap → span → start-page precedence. A 10-case test suite is added in pkg/tree/heading_path_test.go.

Changes

BuildHeadingPaths implementation and tests

Layer / File(s) Summary
Public entry point and tocEntry struct
pkg/tree/heading_path.go
tocEntry struct with span method is defined; BuildHeadingPaths walks the section tree, skips sections without valid page ranges, invokes flattening, and selects the best TOC match per section.
TOC flattening, end-page resolution, and title normalisation
pkg/tree/heading_path.go
flattenTOC/flattenTOCAt perform depth-first traversal accumulating heading paths and skipping empty titles; resolveEndPage infers inclusive end pages from next sibling, parent, or documentMaxPage; normaliseTitle trims whitespace.
Best-match selection and range utilities
pkg/tree/heading_path.go
bestHeadingPath filters flat entries by non-zero overlap and picks the winner via lessSpecific, which encodes containment > depth > overlap amount > span tightness > start-page tie-breaking; contains and overlapPages implement the page-range math.
Test fixtures and all test cases
pkg/tree/heading_path_test.go
sec/financialTOC helpers plus ten tests covering deepest-containing wins, open-ended last-child bounding, top-level-only, straddling best-overlap, zero page-range skipping, nil/empty TOC degradation, nil root safety, outside-range absence, empty-title skipping, and defensive copy guarantee.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A rabbit hopped through pages of text,
Mapping each section to headings annexed.
The TOC flattened, the ranges aligned,
Containment checked, the deepest assigned.
With defensive copies and nil roots tamed,
Every edge case tested and named!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: introducing a function to reconcile section and TOC trees into a heading-path map, directly addressing the PR objective.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch halleluyaholudele/hal-109-ingest-unify-the-parsed-section-tree-and-the-llm-toc-tree

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In tocEntry.span, consider replacing the hard-coded 1 << 30 sentinel with a named constant (or math.MaxInt) to make the intent clearer and avoid surprising behaviour on different architectures.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In tocEntry.span, consider replacing the hard-coded `1 << 30` sentinel with a named constant (or `math.MaxInt`) to make the intent clearer and avoid surprising behaviour on different architectures.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Replaces the 1<<30 magic sentinel in tocEntry.span with math.MaxInt so the
intent is explicit and it doesn't depend on int width. Per Sourcery review
on #39.
@hallelx2

Copy link
Copy Markdown
Owner Author

Thanks @sourcery-ai — fixed in b027956: replaced the 1 << 30 sentinel in tocEntry.span with math.MaxInt (explicit + width-independent).

@hallelx2 hallelx2 merged commit d5881ec into main Jun 17, 2026
8 checks passed
@hallelx2 hallelx2 deleted the halleluyaholudele/hal-109-ingest-unify-the-parsed-section-tree-and-the-llm-toc-tree branch June 17, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant