fix(mtree): resolve phantom mismatches and heal stale leaves in table…#135
Conversation
|
Warning Review limit reached
Next review available in: 54 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughAdds a ChangesMerkle tree stale leaf healing
CDC bounded drain transaction safety
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Category | Results |
|---|---|
| Complexity | 2 medium |
🟢 Metrics 8 complexity · 2 duplication
Metric Results Complexity 8 Duplication 2
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/integration/mtree_stale_leaf_test.go (1)
58-58: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winLog a warning instead of silently discarding
MtreeTeardown()errors.Based on learnings, cleanup callbacks in this test suite should log unexpected failures with
t.Logf("Warning: ...")rather than silently discard them;MtreeTeardown()is not simple idempotent DDL likeDROP TABLE IF EXISTS, so a failure here could leave a dangling publication/slot that affects later tests.♻️ Proposed fix
- t.Cleanup(func() { _ = task.MtreeTeardown() }) + t.Cleanup(func() { + if err := task.MtreeTeardown(); err != nil { + t.Logf("Warning: MtreeTeardown failed: %v", err) + } + })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/mtree_stale_leaf_test.go` at line 58, The cleanup in mtree_stale_leaf_test should not silently ignore MtreeTeardown() failures; update the t.Cleanup callback to capture the returned error from task.MtreeTeardown() and log a warning with t.Logf("Warning: ...") when it fails. Keep the change localized to the mtree_stale_leaf_test cleanup around MtreeTeardown so unexpected teardown problems are surfaced during test runs instead of being discarded.Source: Learnings
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/integration/mtree_stale_leaf_test.go`:
- Line 58: The cleanup in mtree_stale_leaf_test should not silently ignore
MtreeTeardown() failures; update the t.Cleanup callback to capture the returned
error from task.MtreeTeardown() and log a warning with t.Logf("Warning: ...")
when it fails. Keep the change localized to the mtree_stale_leaf_test cleanup
around MtreeTeardown so unexpected teardown problems are surfaced during test
runs instead of being discarded.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: debef0db-5972-407d-aa3d-bdc88f2ca88e
📒 Files selected for processing (6)
db/queries/queries.godb/queries/templates.godocs/commands/mtree/mtree-table-diff.mdinternal/consistency/mtree/merkle.gointernal/infra/cdc/listen.gotests/integration/mtree_stale_leaf_test.go
…-diff After an insert replicated cleanly to a subscriber, mtree table-diff logged "Trees differ ... Found 1 mismatched blocks" yet concluded TABLES MATCH, and mtree update never cleared the stale leaf -- only a full mtree build did. Reproduced on a 2-node spock cluster; two drain defects combined to cause it: - The bounded drain's target was pg_current_wal_flush_lsn(). Spock applies transactions with asynchronous commit, so a row could be visible to readers while its commit record was still beyond the flush pointer: the drain stopped "caught up" without it and the subscriber's tree went stale behind data the user could already SELECT. The drain target (drainTargetLSN) is now captured by nudging a trivial committing transaction and reading the WAL insert position in the same statement: every commit visible at call time sits at or below the target, and the nudge's own commit record lands just above it, so the WAL writer flushes past the target within one wal_writer_delay in every synchronous_commit configuration -- the target is reachable by the walsender by construction. - All three stop points (data-path, keepalive prompt-stop, idle-timeout) could fire between a transaction's data messages and its commit, discarding the buffered changes while checkpointing at or past the commit -- skipping the transaction permanently, which is why only a rebuild recovered. A bounded drain now never stops while a transaction buffer is open (inTx). DiffMtree also degrades honestly and self-heals: when every mismatched block between a node pair holds no row differences, they are reported as stale tree hashes rather than left contradicting the TABLES MATCH verdict, and exactly those leaves are rehashed from live data (mark dirty -> rehash -> rebuild parents -> clear flags) so the phantom cannot recur on subsequent diffs. Healing recomputes honest hashes, so it can never mask real divergence; if roots still differ afterwards the diff warns that leaf ranges have diverged and recommends a rebuild. A zero diff count only proves staleness when the comparison actually completed, so pairs that lost comparison work items to errors are skipped with a re-run hint, --until runs never heal (their comparison deliberately excludes rows past the cutoff), and blocks dirtied concurrently by CDC keep their dirty state and counters for the next update. Refreshes are recorded per pair in the summary's stale_blocks_refreshed so reports distinguish a clean diff from a healed one. Integration-tested: a deterministic stale-leaf test asserts the diff reports zero row differences, heals the corrupted leaf back to its build-time hash, surfaces the refresh in the summary, clears its dirty flags, and runs clean on the next pass; the end-to-end replicated-insert scenario no longer reproduces the phantom in either the append or in-range variant. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
73523f1 to
11836af
Compare
|
@CodeRabbit review |
✅ Action performedReview finished.
|
…-diff
After an insert replicated cleanly to a subscriber, mtree table-diff logged "Trees differ ... Found 1 mismatched blocks" yet concluded TABLES MATCH, and mtree update never cleared the stale leaf -- only a full mtree build did. Reproduced on a 2-node spock cluster; two drain defects combined to cause it:
The bounded drain's target was pg_current_wal_flush_lsn(). Spock applies transactions with asynchronous commit, so a row could be visible to readers while its commit record was still beyond the flush pointer: the drain stopped "caught up" without it and the subscriber's tree went stale behind data the user could already SELECT. The target is now pg_current_wal_insert_lsn(), which covers every visible commit; the walsender reaches it within one WAL-writer flush cycle.
All three stop points (data-path, keepalive prompt-stop, idle-timeout) could fire between a transaction's data messages and its commit, discarding the buffered changes while checkpointing at or past the commit -- skipping the transaction permanently, which is why only a rebuild recovered. A bounded drain now never stops while a transaction buffer is open (inTx).
DiffMtree also degrades honestly and self-heals: mismatched blocks whose row comparison finds no differences are reported as stale tree hashes rather than left contradicting the TABLES MATCH verdict, and those leaves are rehashed from live data (mark dirty -> rehash -> rebuild parents -> clear flags) so the phantom cannot recur on subsequent diffs whatever made the tree stale. Healing recomputes honest hashes, so it can never mask real divergence; if roots still differ afterwards the diff warns that leaf ranges have diverged and recommends a rebuild.
Integration-tested: a deterministic stale-leaf test asserts the diff reports zero row differences, heals the tree, and runs clean on the next pass; the end-to-end replicated-insert scenario no longer reproduces the phantom in either the append or in-range variant.