Skip to content

fix: avoid lost-wakeup hang in DeleteFileIndex::get_deletes_for_data_file#2696

Open
dhruvarya-db wants to merge 2 commits into
apache:mainfrom
dhruvarya-db:fix-delete-file-index-lost-wakeup
Open

fix: avoid lost-wakeup hang in DeleteFileIndex::get_deletes_for_data_file#2696
dhruvarya-db wants to merge 2 commits into
apache:mainfrom
dhruvarya-db:fix-delete-file-index-lost-wakeup

Conversation

@dhruvarya-db

@dhruvarya-db dhruvarya-db commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

N/A — concurrency bug found while reviewing #2590.

What changes are included in this PR?

Fixes a lost-wakeup hang in DeleteFileIndex::get_deletes_for_data_file.

The bug

DeleteFileIndex populates asynchronously on a background task that signals completion with tokio::sync::Notify::notify_waiters(). Per the tokio docs, notify_waiters() stores no permit and only wakes Notified futures that have already been created.

get_deletes_for_data_file observes the Populating state, releases the read lock, and only then creates the Notified future. If the populating task flips the state to Populated and calls notify_waiters() in that window, the wakeup is lost and the query awaits forever.

The window is normally only a few instructions wide and is effectively never hit, because the querier usually registers before the populator is even scheduled. This is reachable from the SnapshotValidator commit path (#2590), which drops the channel sender immediately before querying — making the populator finish right as the query starts.

The fix

Create the Notified future (via notified_owned) while still holding the read lock. The populating task cannot call notify_waiters() until it acquires the write lock, which is blocked while the read lock is held — so the notification is guaranteed to happen after the future has been created, and tokio guarantees such notifications are delivered. No re-check needed.

The same latent pattern exists in arrow/delete_filter.rs and arrow/caching_delete_file_loader.rs; this PR fixes only the index site, and I'm happy to follow up on the others.

Are these changes tested?

This production change is intentionally minimal (no test), because a deterministic regression test requires a small test-only seam to make the otherwise nanosecond-wide race window reliable. To keep that out of this PR, the seam + test live in companion PRs on my fork:

I'm happy to fold the regression test into this PR if maintainers prefer.

How the race happens (before this fix)

get_deletes_for_data_file releases the read lock at (B) and only creates its Notified future at (C). The populating task can flip the state to Populated and call notify_waiters() in the window between (B) and (C). Because notify_waiters() stores no permit (unlike notify_one()) and only wakes Notified futures created before the call, a notification fired before the future exists is lost — and the consumer then awaits forever:

 CONSUMER  get_deletes_for_data_file       PRODUCER  populating task
 ─────────────────────────────────         ──────────────────────────
 read lock: state == Populating   (A)
 clone notifier (Arc<Notify>)
 drop read lock                   (B)
        │
        │  ◄═══════ WINDOW ═══════►         write lock: state = Populated
        │                                   drop write lock
        │                                   notify_waiters()  ✗ no waiter registered yet
        │                                           → wakeup lost (no permit stored)
        ▼
 notifier.notified().await        (C)        (task has exited)
        │
        ▼
   ░░░ awaits forever ░░░  ← never woken → commit/query hangs

This is reachable in practice because observing Populating at (A) is the common case: the consumer does only synchronous work between dropping the sender and the read at (A), while the populator must first be woken from its parked collect(), scheduled, and acquire the write lock — so the consumer typically reaches (A) before the state flips.

The fix creates the Notified future at (A), while the read lock is still held. The populator cannot call notify_waiters() until it takes the write lock (blocked while the read lock is held), so the future always precedes any notify_waiters() and the notification is guaranteed to be delivered.

@CTTY CTTY left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this bug and the detailed explaination! The fix LGTM.

Could you add an unit test to cover this behavior?

@dhruvarya-db

Copy link
Copy Markdown
Contributor Author

@CTTY Would this kind of a test be okay? It adds some test-only logic to the main code which is why I was avoiding it:

@dhruvarya-db dhruvarya-db force-pushed the fix-delete-file-index-lost-wakeup branch from e1f603f to 5160de6 Compare June 24, 2026 20:05
@dhruvarya-db dhruvarya-db requested a review from CTTY June 24, 2026 20:26
@dhruvarya-db

Copy link
Copy Markdown
Contributor Author

@CTTY Added a test for the fix

@CTTY CTTY left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dhruvarya-db , sorry that I didn't notice the test would be like this, and after some thoughts, I don't see an easy way to test this. Let's just proceed without a test

@dhruvarya-db dhruvarya-db force-pushed the fix-delete-file-index-lost-wakeup branch from 5160de6 to e1f603f Compare June 25, 2026 00:44
…file

DeleteFileIndex signals completion of its background populating task with
tokio::sync::Notify::notify_waiters(), which stores no permit and only wakes
Notified futures that already exist. get_deletes_for_data_file observed the
Populating state, released the read lock, and only then created the Notified
future; if the populator flipped the state to Populated and called
notify_waiters() in that window, the wakeup was lost and the query hung forever.

Create the Notified future (via notified_owned) while still holding the read
lock. The populator cannot call notify_waiters() until it takes the write lock,
which is blocked while the read lock is held, so the notification is guaranteed
to occur after the future exists and is delivered.

Signed-off-by: Dhruv Arya <aryadhruv@gmail.com>
@dhruvarya-db dhruvarya-db force-pushed the fix-delete-file-index-lost-wakeup branch from e1f603f to 2840203 Compare June 25, 2026 00:46
@dhruvarya-db dhruvarya-db requested a review from CTTY June 25, 2026 01:04
@dhruvarya-db

Copy link
Copy Markdown
Contributor Author

@CTTY I have removed the test. Could you take another look when you get a chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants