fix: avoid lost-wakeup hang in DeleteFileIndex::get_deletes_for_data_file#2696
Open
dhruvarya-db wants to merge 2 commits into
Open
fix: avoid lost-wakeup hang in DeleteFileIndex::get_deletes_for_data_file#2696dhruvarya-db wants to merge 2 commits into
dhruvarya-db wants to merge 2 commits into
Conversation
CTTY
reviewed
Jun 23, 2026
CTTY
left a comment
Collaborator
There was a problem hiding this comment.
Thanks for catching this bug and the detailed explaination! The fix LGTM.
Could you add an unit test to cover this behavior?
Contributor
Author
|
@CTTY Would this kind of a test be okay? It adds some test-only logic to the main code which is why I was avoiding it: |
e1f603f to
5160de6
Compare
Contributor
Author
|
@CTTY Added a test for the fix |
CTTY
reviewed
Jun 25, 2026
CTTY
left a comment
Collaborator
There was a problem hiding this comment.
Hi @dhruvarya-db , sorry that I didn't notice the test would be like this, and after some thoughts, I don't see an easy way to test this. Let's just proceed without a test
5160de6 to
e1f603f
Compare
…file DeleteFileIndex signals completion of its background populating task with tokio::sync::Notify::notify_waiters(), which stores no permit and only wakes Notified futures that already exist. get_deletes_for_data_file observed the Populating state, released the read lock, and only then created the Notified future; if the populator flipped the state to Populated and called notify_waiters() in that window, the wakeup was lost and the query hung forever. Create the Notified future (via notified_owned) while still holding the read lock. The populator cannot call notify_waiters() until it takes the write lock, which is blocked while the read lock is held, so the notification is guaranteed to occur after the future exists and is delivered. Signed-off-by: Dhruv Arya <aryadhruv@gmail.com>
e1f603f to
2840203
Compare
Contributor
Author
|
@CTTY I have removed the test. Could you take another look when you get a chance? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
N/A — concurrency bug found while reviewing #2590.
What changes are included in this PR?
Fixes a lost-wakeup hang in
DeleteFileIndex::get_deletes_for_data_file.The bug
DeleteFileIndexpopulates asynchronously on a background task that signals completion withtokio::sync::Notify::notify_waiters(). Per the tokio docs,notify_waiters()stores no permit and only wakesNotifiedfutures that have already been created.get_deletes_for_data_fileobserves thePopulatingstate, releases the read lock, and only then creates theNotifiedfuture. If the populating task flips the state toPopulatedand callsnotify_waiters()in that window, the wakeup is lost and the query awaits forever.The window is normally only a few instructions wide and is effectively never hit, because the querier usually registers before the populator is even scheduled. This is reachable from the
SnapshotValidatorcommit path (#2590), which drops the channel sender immediately before querying — making the populator finish right as the query starts.The fix
Create the
Notifiedfuture (vianotified_owned) while still holding the read lock. The populating task cannot callnotify_waiters()until it acquires the write lock, which is blocked while the read lock is held — so the notification is guaranteed to happen after the future has been created, and tokio guarantees such notifications are delivered. No re-check needed.The same latent pattern exists in
arrow/delete_filter.rsandarrow/caching_delete_file_loader.rs; this PR fixes only the index site, and I'm happy to follow up on the others.Are these changes tested?
This production change is intentionally minimal (no test), because a deterministic regression test requires a small test-only seam to make the otherwise nanosecond-wide race window reliable. To keep that out of this PR, the seam + test live in companion PRs on my fork:
I'm happy to fold the regression test into this PR if maintainers prefer.
How the race happens (before this fix)
get_deletes_for_data_filereleases the read lock at (B) and only creates itsNotifiedfuture at (C). The populating task can flip the state toPopulatedand callnotify_waiters()in the window between (B) and (C). Becausenotify_waiters()stores no permit (unlikenotify_one()) and only wakesNotifiedfutures created before the call, a notification fired before the future exists is lost — and the consumer then awaits forever:This is reachable in practice because observing
Populatingat (A) is the common case: the consumer does only synchronous work between dropping the sender and the read at (A), while the populator must first be woken from its parkedcollect(), scheduled, and acquire the write lock — so the consumer typically reaches (A) before the state flips.The fix creates the
Notifiedfuture at (A), while the read lock is still held. The populator cannot callnotify_waiters()until it takes the write lock (blocked while the read lock is held), so the future always precedes anynotify_waiters()and the notification is guaranteed to be delivered.