fix: check lock table binds during load#24898
Conversation
b502ee7 to
ff15822
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses a failure mode where long-running LOAD DATA operations may hold stale lock-table bind metadata and only discover a bind change at commit time (after significant work). It introduces a transaction-operator-level bind check (throttled) and invokes it during LOAD execution so bind changes are detected earlier and fail fast with ErrLockTableBindChanged.
Changes:
- Add
TxnOperator.CheckLockTableBinds(ctx)and implement throttled, “sticky” bind-change detection intxnOperator. - Add
LockService.GetLatestLockTableBind(bind)to fetch allocator-latest bind and use it for bind validation. - Invoke bind checks in LOAD execution paths and add tests covering bind-change detection, stickiness, and throttle reset on operator reuse.
Reviewed changes
Copilot reviewed 15 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/vm/engine/test/testutil/disttae_engine.go | Extend mock lockservice to satisfy new GetLatestLockTableBind API. |
| pkg/vm/engine/entire_engine_test.go | Extend test operator to satisfy new CheckLockTableBinds API. |
| pkg/txn/storage/memorystorage/storage_txn_client.go | Extend storage txn operator to satisfy new CheckLockTableBinds API. |
| pkg/txn/client/types.go | Extend TxnOperator interface with CheckLockTableBinds. |
| pkg/txn/client/operator.go | Implement throttled bind-check logic and sticky bind-changed behavior in txnOperator. |
| pkg/txn/client/operator_test.go | Add unit tests for bind-change detection, sticky failure, and init/reset of throttling state. |
| pkg/sql/compile/compile_test.go | Update txn operator mocks to expect CheckLockTableBinds. |
| pkg/sql/colexec/external/external.go | Call txn bind check before reading external batches for LOAD. |
| pkg/lockservice/types.go | Extend lockservice interface with GetLatestLockTableBind. |
| pkg/lockservice/service_observability.go | Implement allocator-latest bind fetch via existing GetBind path. |
| pkg/incrservice/store_sql_test.go | Extend test lockservice/txn operator stubs to satisfy new interfaces. |
| pkg/frontend/txn_test.go | Extend frontend test txn operator to satisfy new CheckLockTableBinds API. |
| pkg/frontend/test/txn_mock.go | Regenerate/update mock txn operator with CheckLockTableBinds. |
| pkg/frontend/test/mock_lock/types.go | Regenerate/update mock lockservice with GetLatestLockTableBind. |
| pkg/frontend/mysql_cmd_executor.go | Add bind checks around LOAD LOCAL read/write loop. |
| pkg/cdc/table_change_stream_test.go | Extend noop txn operator to satisfy new CheckLockTableBinds API. |
| pkg/bootstrap/service_test.go | Extend bootstrap test txn operator to satisfy new CheckLockTableBinds API. |
Files not reviewed (2)
- pkg/frontend/test/mock_lock/types.go: Language not supported
- pkg/frontend/test/txn_mock.go: Language not supported
aunjgr
left a comment
There was a problem hiding this comment.
Solid fix. Throttled at 3-min, sticky on first detection, uses allocator-latest bind (bypasses local stale cache). The lockTables snapshot outside the mutex is safe since lock tables are append-only during a txn lifetime. Tests cover changed bind, stale local cache, sticky failure, and throttle reset on reuse.
XuPeng-SH
left a comment
There was a problem hiding this comment.
I did not confirm a blocking logic bug in the implementation itself, but I still think the unhappy-path coverage is not strong enough for this change.
The new bind-check behavior is added to multiple critical LOAD paths, but several of the newly introduced failure/guard branches are still effectively untested:
-
processLoadLocalnow callsCheckLockTableBinds()before the first packet read and again inside the main loop, but the current load-local tests buildExecCtxwithproc == nil, so both new call sites are dead in test coverage. We still need coverage for bind-change failure before the first read and after at least one loop iteration. -
txnOperator.CheckLockTableBinds()adds important unhappy-path state handling that is currently not pinned: RPC error should clear the throttle timestamp so later calls retry; cancelled context should also clear the throttle; a closed txn should fail before any bind lookup; and a clean second call inside the 3-minute window should no-op instead of re-querying. These branches are now part of the core contract for the new throttled check. -
External.Call()now propagates bind-check failures for LOAD, but there is still no direct test that aCheckLockTableBinds()error is surfaced from the external LOAD path.
Since this PR is specifically about failing early on stale bind changes during long-running LOAD, I think these negative paths should be covered before approval.
cdb916d to
caeba74
Compare
XuPeng-SH
left a comment
There was a problem hiding this comment.
I re-checked the latest revision and I do not see a blocking issue now.
The earlier unhappy-path coverage gaps are addressed in the current head: processLoadLocal now has bind-check failure coverage both before the first read and after entering the loop; txnOperator.CheckLockTableBinds now has direct tests for RPC error, canceled context, closed txn, sticky failure, throttle reset, and clean throttled re-check behavior; and the external LOAD path now has a direct bind-check error propagation test.
LGTM.
LeftHandCold
left a comment
There was a problem hiding this comment.
LGTM. I reviewed the latest head around the new LOAD-time bind checks, the txn-level throttled/sticky CheckLockTableBinds implementation, and the added negative-path tests. The previous coverage gaps are addressed, and the PR checks are green.
Merge Queue Status
This pull request spent 49 minutes 48 seconds in the queue, with no time running CI. Waiting for
All conditions
ReasonThe merge conditions cannot be satisfied due to failing checks Failing checks:
HintYou may have to fix your CI before adding the pull request to the queue again. |
Merge Queue Status
This pull request spent 23 seconds in the queue, including 2 seconds running CI. Required conditions to merge
|
What type of PR is this?
Which issue(s) this PR fixes:
issue #24897
What this PR does / why we need it:
A long-running
LOAD DATAcan hold stale lock table bind information while data is being read and written. If the lock-table owner CN shuts down during that window, the transaction may only discover the bind change at commit time and fail after a long run.This change adds a LOAD-time lock table bind check before external batches are read. The check is throttled at the txn operator level, so concurrent LOAD readers in the same transaction share the same interval. It reuses the existing
GetLockTableBindpath instead of adding a new lockservice/proto RPC. Once a transaction detects a bind change, subsequent checks in the same transaction keep returningErrLockTableBindChanged.Tests cover changed bind detection, sticky bind-change failure, and reset of the txn-level throttle state when a txn operator is reused.