fix(proxy): clear transfer latch after unsafe attempt#24979
Conversation
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
There was a problem hiding this comment.
Pull request overview
This PR fixes a proxy tunnel transfer deadlock where Deliver() could mark a tunnel as inTransfer, but an unsafe async transfer attempt could return early before finishTransfer() cleared the latch—causing subsequent drain polls to skip the tunnel indefinitely and blocking CN draining. It also makes transfer-intent metric updates idempotent and ensures sync transfers respect the same inTransfer latch to prevent overlapping migration attempts.
Changes:
- Make
setTransferIntentidempotent via atomic swap to avoid double-inc/dec of the transfer-intent gauge. - Add explicit helpers to acquire/release the
inTransferlatch for transfer attempts, and ensure unsafe async early-return clears the latch. - Ensure
transferSyncalso acquires/releases the latch and add tests covering unsafe async retryability, sync latch behavior, and metric idempotency.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/proxy/tunnel.go | Clears inTransfer on unsafe async early-return, adds sync transfer latch acquisition, and makes transfer-intent gauge updates idempotent. |
| pkg/proxy/tunnel_test.go | Adds tests for retry behavior after unsafe transfer, sync latch behavior, and idempotent gauge updates. |
XuPeng-SH
left a comment
There was a problem hiding this comment.
I found one substantive unhappy-path issue.
In pkg/proxy/tunnel.go, the unsafe async transfer path now calls finishTransferAttempt() before setTransferIntent(true). That opens a gap where another Deliver(..., transferByRebalance) can reacquire the latch and overwrite transferType to rebalance. On passive-policy proxies, setTransferIntent(true) then becomes a no-op, so the original scaling attempt loses the intent that should drive the later sync retry.
This means the stuck-drain fix can still be bypassed under scaling+rebalance interleaving. Please preserve/publish the scaling intent while the attempt still owns the latch, or make the intent update use the attempted transfer type instead of rereading mutable t.transferType.
What type of PR is this?
Which issue(s) this PR fixes:
issue #24942
What this PR does / why we need it:
When a proxy scaling transfer cannot start safely, for example because the session is still in a transaction, the async transfer attempt returned before
finishTransferwas registered.Deliverhad already marked the tunnel asinTransfer, so later drain polls skipped the tunnel forever and the CN drain could remain blocked until the connection closed.This PR clears the transfer latch on that unsafe early-return path while preserving
transferIntent, so later retries or intent-driven transfer can still happen. It also makes transfer intent metric updates idempotent and has sync transfer acquire the same transfer latch to avoid overlapping async/sync migration attempts.Tests cover:
inTransferand can be re-enqueued;