Skip to content

Fix batch reads hanging after digest mismatch retries are ignored#4789

Open
void-ptr974 wants to merge 1 commit intoapache:masterfrom
void-ptr974:fix_batch_read_block
Open

Fix batch reads hanging after digest mismatch retries are ignored#4789
void-ptr974 wants to merge 1 commit intoapache:masterfrom
void-ptr974:fix_batch_read_block

Conversation

@void-ptr974
Copy link
Copy Markdown
Contributor

Motivation

Batch reads could hang or fail after receiving a corrupt response from a bookie. The request was marked complete before digest verification, so although a digest mismatch triggered a retry
on another replica, the retry response could be ignored because the request was already completed.

Changes

  • Verify all entries in a batch response before completing the read request.
  • Keep the request incomplete on digest mismatch so the existing retry path can read the same batch from the next replica.
  • Avoid retaining partially verified entries when one entry in the batch fails digest verification.
  • Clarify the ByteBufList ownership and release paths with comments.

Tests

Added coverage for the batch-read digest mismatch retry path:

  • testDigestMismatchRetriesNextReplicaAndCompletes

    • Replaces the first replica with a real bookie that corrupts read responses.
    • Verifies that a digest mismatch does not complete the batch request prematurely.
    • Confirms the same batch is retried from another replica and completes with the original entry data.
  • testDigestMismatchAfterPartialVerificationDoesNotRetainEntries

    • Reads a two-entry batch where the first entry verifies successfully but the second entry is corrupted.
    • Verifies that no partially verified entries are retained after the digest mismatch.
    • Confirms the whole batch is retried from another replica and both returned entries match the original data.

Also ran the full TestBatchedRead suite to ensure existing batch-read behavior remains unchanged, including normal reads, partial batch responses, missing entries, and failed bookie
scenarios.

  Batch reads marked the request as complete before verifying the
  digest of each returned entry. If one entry in the response failed
  digest verification, the operation attempted to retry on another
  replica, but the request had already been completed, so the retry
  response could be ignored and the batch read could hang or fail.

  Verify all entries in the batch before completing the request. On
  digest mismatch, leave the request incomplete, discard the whole
  response, and retry the same batch on the next replica. Only create
  LedgerEntryImpl instances after the full batch has passed digest
  verification, so no partially verified entries are retained.

  Add tests for retrying after a corrupt batch response and for the
  case where an earlier entry verifies successfully but a later entry
  fails digest verification.
@void-ptr974
Copy link
Copy Markdown
Contributor Author

@hangc0276 @horizonzy PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant