Fix AIO callback from_api completion lifetime#13151
Conversation
71db5be to
4142430
Compare
The AIOCallback io_complete handler used a member variable, from_api, to determine whether to delete itself. The problem is that in the situation where in the course of processing the function `this` was already deleted, the use of the from_api variable was, by definition, a use after free. If built under ASan, this resulted in a use-after-free assertion. Concretely, we were seeing this in docs during cache stripe initialization after an unclean shutdown. If the on-disk cache directory was dirty, startup recovery scanned the data area, cleared directory entries for the uncertain range, and wrote the repaired directory back out. The temporary AIO callbacks for that recovery wrote live in StripeInitInfo. When the recovery write completion was delivered, StripeSM::handle_recover_write_dir() deleted StripeInitInfo, which destroyed the AIOCallback object whose AIOCallback::io_complete() frame was still returning. At that point, the use of from_api was a use after free. This snapshots the API-owned callback flag before dispatching the completion and uses that local value for the post-callback cleanup. This also adds a focused regression test for completion handlers that release the callback owner before AIOCallback::io_complete() returns. Introduced in apache#13027
4142430 to
e29f654
Compare
There was a problem hiding this comment.
Pull request overview
Fixes a use-after-free in AIOCallback::io_complete() when the completion handler deletes the callback owner (and thus destroys the AIOCallback) before io_complete() finishes returning. This is in the iocore AIO subsystem and impacts both internal cache recovery codepaths and API-originated AIO completions.
Changes:
- Snapshot
from_apiinto a local before dispatching the completion event, avoiding post-callback member access after potential self-destruction. - Add a focused Catch2 unit test that deletes the completion owner inside the completion handler.
- Wire the new unit test into the
src/iocore/aioCMake test targets.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/iocore/aio/AIO.cc | Avoids UAF by copying from_api prior to invoking the completion handler. |
| include/iocore/aio/AIO.h | Adds a clarifying comment describing from_api ownership semantics. |
| src/iocore/aio/CMakeLists.txt | Adds test_AIOCallback unit test target and registers it with Catch2 test runner. |
| src/iocore/aio/unit_tests/test_AIOCallback.cc | New regression test for owner deletion during io_complete() completion dispatch. |
| ink_hrtime sleep_time = 0; | ||
| bool from_api = false; | ||
| SLINK(AIOCallback, alink); /* for AIO_Reqs::aio_temp_list */ | ||
| bool from_api = false; ///< Whether the TS API created this callback. |
There was a problem hiding this comment.
Extremely nitpick, but this doxygen comment feels misplaced. Use a regular comment, but honestly, I think the variable name itself is self documenting.
| auto owner = new AIOCompletionOwner(completed); | ||
| auto callback = &owner->callback; | ||
|
|
||
| CHECK(callback->io_complete(EVENT_NONE, nullptr) == EVENT_DONE); |
There was a problem hiding this comment.
Somewhat interesting thought from Claude, a comment may be useful? Again, nitpick.
The test only catches the regression under ASan/MSan. The owner is destroyed inside
handle_aio_complete, so on broken code the post-dispatch if (from_api) reads freed memory — but since
from_api == false by default, the branch happens to be skipped and the test passes. Without sanitizers,
this test would not have caught the original bug. The CI runs ASan, so this is acceptable, but consider
noting it in a comment so someone running the test locally without ASan understands why a passing result
isn't conclusive. Even better: also assert the from_api == true path (e.g., a second test where the owner
does not delete itself and the callback is dynamically allocated with from_api = true, verifying the
snapshot still triggers the self-delete).
The AIOCallback io_complete handler used a member variable, from_api, to
determine whether to delete itself. The problem is that in the situation
where in the course of processing the function
thiswas alreadydeleted, the use of the from_api variable was, by definition, a use
after free. If built under ASan, this resulted in a use-after-free
assertion.
Concretely, we were seeing this in docs during cache stripe
initialization after an unclean shutdown. If the on-disk cache
directory was dirty, startup recovery scanned the data area, cleared
directory entries for the uncertain range, and wrote the repaired
directory back out. The temporary AIO callbacks for that recovery wrote
live in StripeInitInfo. When the recovery write completion was
delivered, StripeSM::handle_recover_write_dir() deleted StripeInitInfo,
which destroyed the AIOCallback object whose AIOCallback::io_complete()
frame was still returning. At that point, the use of from_api was a use
after free.
This snapshots the API-owned callback flag before dispatching the
completion and uses that local value for the post-callback cleanup.
This also adds a focused regression test for completion handlers that
release the callback owner before AIOCallback::io_complete() returns.
Introduced in #13027