watch: Fix the deadlock between in-process client-server due to cancellation storm#21065
watch: Fix the deadlock between in-process client-server due to cancellation storm#21065zhijun42 wants to merge 1 commit intoetcd-io:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zhijun42 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @zhijun42. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
Haven't dig deep, but let me list things I would look into when reviewing this code:
|
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 20 files with indirect coverage changes @@ Coverage Diff @@
## main #21065 +/- ##
==========================================
+ Coverage 68.41% 68.57% +0.15%
==========================================
Files 429 429
Lines 35242 35275 +33
==========================================
+ Hits 24111 24189 +78
+ Misses 9740 9708 -32
+ Partials 1391 1378 -13 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
/retest |
1 similar comment
|
/retest |
|
@serathius Thanks for the comprehensive list!
In the previous progress issue, there was race between watch events and control responses, and wrong ordering of revisions could happen. That is fundamentally different from this issue, where client sending watch requests much faster than server's processing rate could cause deadlock. For the ease of reasoning, you can ignore the data responses channel in the server, and just focus on how the control channel is filled up and causes deadlock between client and server data exchange.
I assume you meant to say "watch requests" (I'm not dropping responses). You're right, I just realized that dropping create requests is incorrect and will leave the system in a broken state that it can't serve create requests anymore. I refactor my changes to have two separate send channels and then always keep the watch create requests. I also added an integration test
I'm not sure what kind of robustness test would bring additional value beyond the integration test I just added.
Good point! I did have to spend many days reading through the codebase to understand the watch implementation. Unfortunately, I couldn't find an effective way to refactor it. That being said, I did do my best to:
Downsides:
Alternatives:
Is this the only solution?
|
|
And I see the test case |
… storm Signed-off-by: Zhijun <dszhijun@gmail.com>
1fe01bf to
d82c5e2
Compare
|
/retest |
|
@zhijun42: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Significantly reduced the diffs for this PR. Ran into a new flaky test I can't reproduce this error, even under extreme conditions (1 CPU, 512MB memory, GOMAXPROCS=1). |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Solves issues #18879 and #20716
This change refactors the client-side watch stream implementation to make sending watch requests (create / cancel / progress) fully asynchronous and safe from deadlock under the in-process client-server transport.
Instead of sending directly from the main
watchGRPCStream.runloop, watch requests are now enqueued (in a non-blocking way) into bounded buffered channelssendCreatecandsendBestEffortc, and handled by a dedicatedsendLoopgoroutine. I rename theserveWatchClientfunction asrecvLoop, and now this recvLoop/sendLoop design actually mirrors the server side implementation.The reason for having two channels of sending client requests is the following:
We have three types of watch requests: create / cancel / progress. A create request has to be delivered to and processed by the server, and the user client will just block until it receives created response. However, the
cancel / progress requests are best-effort. Such distinction matters because when the watch system is under high pressure, we need to keep the create requests in channel
sendCreatecwhile dropping the best-effort ones out of channelsendBestEffortc.At any moment, there'll be only one
sendLoopgoroutine running. When the connection is disrupted,recvLoopexits with error and the mainwatchGRPCStream.runloop will close thesendcqueue channel, wait forsendLoopto exit, and then spin up a new pair of recvLoop/sendLoop, which is the same as in the server side implementation.I added an integration test
TestV3WatchCancellationStormto show that the system works fine even when there are a lot of watch cancel and create requests.Once the previous PR #21064 that reproduces the deadlock issue is reviewed and merged, I'll rebase this one and update that reproduction test as well. I separated the PRs to make code review easier.