Skip to content

[BUG-447] Idempotency bug: OOO loop when response is lost but subsequent batches succeed#448

Merged
luoyuxia merged 2 commits intoapache:mainfrom
fresh-borzoni:fix-ooo-sequence-lost-response
Mar 21, 2026
Merged

[BUG-447] Idempotency bug: OOO loop when response is lost but subsequent batches succeed#448
luoyuxia merged 2 commits intoapache:mainfrom
fresh-borzoni:fix-ooo-sequence-lost-response

Conversation

@fresh-borzoni
Copy link
Contributor

@fresh-borzoni fresh-borzoni commented Mar 20, 2026

Summary

closes #447

Fix idempotent writer infinite retry on OutOfOrderSequenceException when response is lost

When a batch response is lost (e.g. timeout) but the server committed it, and subsequent higher-sequence batches are acked, retrying the original batch causes the server to return OutOfOrderSequenceException. The client's !is_next retry heuristic incorrectly treats this as retriable, looping until retries exhaust and then failing a successfully committed batch.

Kafka deals with it at the protocol level with epoch bumping, we don't need here, so we fix at the client level.

Fix: before checking can_retry, if the batch sequence <= last_acked_sequence, complete it as success - a higher-sequence ack guarantees the batch was committed (server enforces sequential ordering).

Backport of apache/fluss#2827

@fresh-borzoni fresh-borzoni changed the title [BUG-447] Idempotency bug: OOO exception when response is lost but su… [BUG-447] Idempotency bug: OOO loop when response is lost but subsequent batches succeed Mar 20, 2026
@fresh-borzoni
Copy link
Contributor Author

Copy link
Contributor

@leekeiabstraction leekeiabstraction left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch and thank you for the PR!

I think this will be challenging to test (manually), is there a way to reproduce this to ensure that the change works? Curious about how this was detected in the first place.

Does Java side need the same change given that we mostly base on Java implementation?

Nvm: just saw Java side's Pr. TY.

@fresh-borzoni
Copy link
Contributor Author

@leekeiabstraction Ty for the review, there is unit test that checks this scenario, also updated comment.
PTAL 🙏

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@luoyuxia luoyuxia merged commit e26702e into apache:main Mar 21, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Idempotent writer enters infinite retry loop with OOO exception when response is lost but subsequent batches succeed

3 participants