MDEV-37974 Avoid bogus deadlock in lock_rec_insert_check_and_lock()#4672
MDEV-37974 Avoid bogus deadlock in lock_rec_insert_check_and_lock()#4672arcivanov wants to merge 1 commit into
Conversation
23599ab to
d9e12c5
Compare
c04ea78 to
df10682
Compare
0171d8d to
a99f1e0
Compare
gkodinov
left a comment
There was a problem hiding this comment.
Thank you for your contribution. This is a preliminary review. Please stand by for the final review.
dr-m
left a comment
There was a problem hiding this comment.
Thank you, this is interesting. But I think that we should be very careful here. We have been bitten by MDEV-27025 in the past. I think that we would need some additional testing in the style of https://mariadb.com/resources/blog/isolation-level-violation-testing-and-debugging-in-mariadb/ because our regular stress testing does not cover consistency or isolation very well.
a8e2ff9 to
4d36da5
Compare
|
I just posted this comment with some notes on how I think this would have to be tested. Unfortunately, I cannot currently spend time on any thorough review that would be needed here. |
dr-m
left a comment
There was a problem hiding this comment.
This needs more testing than is possible within the mtr framework.
|
@arcivanov there's a question for you from Marko here. It's been a while. Do you intend to keep working on this PR? If not, please convert to draft. |
|
@gkodinov I'm working on all my bugs right now including HEAP. |
|
@gkodinov Which Marko's question are you talking about? Per JIRA Marko asked me if I can hook up the whole of MariaDB to Jepsen and I don't see any other questions:
It's a pretty large task, I'll need to figure out if I can do that. |
|
@dr-m So I have the Jepsen cleaned up and modernized MariaDB/jepsen-mariadb#1 My question is what kinds of tests do you want? Jepsen works probabilistically and my current MTR already covers the known scenarios. |
| if (lock_t *c_lock= lock_rec_other_has_conflicting(type_mode, | ||
| g.cell(), id, | ||
| heap_no, trx)) | ||
| { | ||
| trx->mutex_lock(); | ||
| err= lock_rec_enqueue_waiting(c_lock, type_mode, id, block->page.frame, | ||
| heap_no, index, thr, nullptr); | ||
| trx->mutex_unlock(); | ||
| const ulint pred_heap_no= comp | ||
| ? rec_get_heap_no_new(rec) : rec_get_heap_no_old(rec); | ||
|
|
||
| /* MDEV-37974: If the first conflicting lock is WAITING and | ||
| we hold a granted X lock on the successor record, the | ||
| waiting lock is necessarily blocked behind our lock in the | ||
| queue and can never be granted while our lock exists. |
There was a problem hiding this comment.
This logic may also be executed both on primary key indexes (clustered index) and on secondary indexes.
The test case in MDEV-37974 seems to involve both types of indexes; the FOREIGN KEY implies a secondary index on child.parent_id.
My main concern with changes to locking is that a regression such as MDEV-27992 might be introduced. The scenario of that bug involved an UPDATE of a PRIMARY KEY column, which is a rare but possible operation.
I assume that there is an automated tool that produces such reports, because I have seen several reports that use similar notation. However, I don’t know what that tool is. This bug was not found in our own testing. I would hope that Jepsen would be able to catch that bug, but I haven’t tried it myself.
There was a problem hiding this comment.
We will be using the TROC automated tool to test this patch under both the Read Committed (RC) and Repeatable Read (RR) isolation levels.
TROC is the same framework that successfully discovered the transaction anomaly in MDEV-27992.
There was a problem hiding this comment.
I have created a txn-tester https://github.com/arcivanov/txn-tester
which combines TROC, Fucci and APTrans (projects backing followon
papers to TROC) into a single test container and standardizes their
behavior. Feel free to adopt it into MariaDB if you wish (it's under
Apache 2.0 but I'll transfer the copyright/change license if you do so).
I also created a PR into buildbot to take advantage of the txn-tester for automated CI testing: MariaDB/buildbot#975. If you adopt the txn-tester into MariaDB the GHCR location of the image will change, but otherwise it should work.
Finally, I ran the MDEV-37974 against the tool and here are the results and the report based on the results:
MDEV-37974 — Transaction/isolation fuzz testing of the lock_rec_insert_check_and_lock() fix
Summary
The MDEV-37974 branch (commit 4d36da5, "Avoid bogus deadlock in
lock_rec_insert_check_and_lock()", on top of 10.11.17) was built as a debug
server and exercised with three independent transaction/isolation anomaly
oracles — TROC, Fucci, and APTrans — running concurrently against it.
The same tools were then run against a baseline binary built from the
identical tree with only storage/innobase/lock/lock0lock.cc reverted to its
parent, to attribute every finding.
Result: no isolation regression attributable to the change. Every anomaly the
tools reported also appears on the unmodified baseline (and TROC/Fucci reported
more on the baseline). The debug server (UNIV_DEBUG, _GLIBCXX_DEBUG,
SAFE_MUTEX) raised no assertion, signal, or InnoDB invariant failure during
any run. The fix's own deterministic MTR tests pass.
Findings overview (patched vs. baseline)
| Tool | Patched (with fix) | Baseline (no fix) | Verdict |
|---|---|---|---|
| TROC — logged inconsistencies | 0 | 2 | Noise — fewer with the fix |
| Fucci — logged inconsistencies | 1 | 2 | Noise (no INSERT in schedule) |
| APTrans — SERIALIZABLE anomalies | 35 | 29 | Oracle noise |
| APTrans — REPEATABLE READ anomalies | 33 | 21 | Oracle noise |
| APTrans — READ COMMITTED anomalies | 11 | 4 | Oracle noise |
| Server assertion / crash (debug) | none | none | Clean |
The full report (with environment, per-signal triage, MTR results, and reproduction steps) is in the zip.
|
Sorry for the delay on our end, @arcivanov. I've been tasked to find ways to get this contribution moving again. I should have some updates for you by the end of the week latest. In the meanwhile, I'd like to also offer some general statements on pull request processing cadence that should set expectations, as requested by @vuvova:
|
| err= lock_rec_enqueue_waiting(c_lock, type_mode, id, block->page.frame, | ||
| heap_no, index, thr, nullptr); | ||
| trx->mutex_unlock(); | ||
| const ulint pred_heap_no= comp |
There was a problem hiding this comment.
Please check this crash , i can repro this crash only on this PR
{code:sql}
--source include/have_innodb.inc
--source include/have_partition.inc
CREATE TABLE t1 (c1 VARCHAR(3073) UNIQUE,c2 DATE)Engine=InnoDB PARTITION BY KEY (c2) PARTITIONS 2;
INSERT IGNORE INTO t1 VALUES (1,1),(2,2);
ALTER IGNORE TABLE t1 FORCE;
{code}
{noformat:title=MDEV-37974 CS 10.11.17 4d36da565ae4e6d75f6bf21922b80f9e51f0b124 (Debug, Clang 22.1.8-20260622) Build 22/06/2026}
mariadbd: /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15818: void trx_t::reset_and_truncate_undo(const dict_table_t *): Assertion `mod_tables.size() == 0 || mod_tables.begin()->first == table' failed.
{noformat}
{noformat:title=MDEV-37974 CS 10.11.17 4d36da565ae4e6d75f6bf21922b80f9e51f0b124 (Debug, Clang 22.1.8-20260622) Build 22/06/2026}
Core was generated by `/data/MDEV-37974-MD220626-mariadb-10.11.17-linux-x86_64-dbg/bin/mariadbd --defa'.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:44
[Current thread is 1 (LWP 4072395)]
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6)at ./nptl/pthread_kill.c:89
#3 0x000073129b04527e in __GI_raise (sig=sig@entry=6)at ../sysdeps/posix/raise.c:26
#4 0x000073129b0288ff in __GI_abort () at ./stdlib/abort.c:79
#5 0x000073129b02881b in __assert_fail_base (fmt=0x73129b1d01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x5607958c4119 "mod_tables.size() == 0 || mod_tables.begin()->first == table", file=file@entry=0x5607958bbc3a "/data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc", line=line@entry=15818, function=function@entry=0x5607958c40df "void trx_t::reset_and_truncate_undo(const dict_table_t *)") at ./assert/assert.c:96
#6 0x000073129b03b517 in __assert_fail (assertion=0x5607958c4119 "mod_tables.size() == 0 || mod_tables.begin()->first == table", file=0x5607958bbc3a "/data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc", line=15818, function=0x5607958c40df "void trx_t::reset_and_truncate_undo(const dict_table_t *)") at ./assert/assert.c:105
#7 0x00005607950af15d in trx_t::reset_and_truncate_undo (this=0x73129a923b80, table=0x7311a41a1950)at /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15818
#8 0x00005607950967ab in ha_innobase::extra (this=0x7311a406ce58, operation=HA_EXTRA_END_COPY)at /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15927
#9 0x000056079501fe93 in extra_cb (h=0x7311a406ce58, operation=0x731298d689ac)at /data/MDEV-37974_dbg/sql/ha_partition.cc:8998
#10 0x0000560795020356 in ha_partition::loop_partitions_over_map (this=0x7311a406c5b8, part_map=0x7311a406ee20, callback=0x56079501fe70 <extra_cb(handler*, void*)>, param=0x731298d689ac)at /data/MDEV-37974_dbg/sql/ha_partition.cc:9710
#11 0x0000560795016f87 in ha_partition::loop_partitions (this=0x7311a406c5b8, callback=0x56079501fe70 <extra_cb(handler*, void*)>, param=0x731298d689ac)at /data/MDEV-37974_dbg/sql/ha_partition.cc:9658
#12 0x000056079501fdb9 in ha_partition::extra (this=0x7311a406c5b8, operation=HA_EXTRA_END_COPY)at /data/MDEV-37974_dbg/sql/ha_partition.cc:9496
#13 0x0000560794a7dc09 in copy_data_between_tables (thd=0x7311a4000d58, from=0x7311a409db58, to=0x7311a4058e68, ignore=true, order_num=0, order=0x0, copied=0x731298d6a280, deleted=0x731298d6a278, alter_info=0x731298d6d060, alter_ctx=0x731298d6b0f8)at /data/MDEV-37974_dbg/sql/sql_table.cc:12475
#14 0x0000560794a74c0c in mysql_alter_table (thd=0x7311a4000d58, new_db=0x7311a4005908, new_name=0x7311a4005d68, create_info=0x731298d6d1d0, table_list=0x7311a4014758, recreate_info=0x731298d6c8b0, alter_info=0x731298d6d060, order_num=0, order=0x0, ignore=true, if_exists=false)at /data/MDEV-37974_dbg/sql/sql_table.cc:11597
#15 0x0000560794b23add in Sql_cmd_alter_table::execute (this=0x7311a4014e78, thd=0x7311a4000d58) at /data/MDEV-37974_dbg/sql/sql_alter.cc:688
#16 0x0000560794975a00 in mysql_execute_command (thd=0x7311a4000d58, is_called_from_prepared_stmt=false)at /data/MDEV-37974_dbg/sql/sql_parse.cc:6201
#17 0x0000560794965978 in mysql_parse (thd=0x7311a4000d58, rawbuf=0x7311a4014670 "ALTER IGNORE TABLE t1 FORCE", length=27, parser_state=0x731298d6ea20) at /data/MDEV-37974_dbg/sql/sql_parse.cc:8223
#18 0x000056079496322a in dispatch_command (command=COM_QUERY, thd=0x7311a4000d58, packet=0x7311a40a07c9 "", packet_length=27, blocking=true) at /data/MDEV-37974_dbg/sql/sql_parse.cc:1924
#19 0x00005607949663fa in do_command (thd=0x7311a4000d58, blocking=true)at /data/MDEV-37974_dbg/sql/sql_parse.cc:1434
#20 0x0000560794b1987e in do_handle_one_connection (connect=0x5607d5190248, put_in_cache=true) at /data/MDEV-37974_dbg/sql/sql_connect.cc:1475
#21 0x0000560794b19665 in handle_one_connection (arg=0x5607d5180f48)at /data/MDEV-37974_dbg/sql/sql_connect.cc:1387
#22 0x000073129b09caa4 in start_thread (arg=<optimized out>)at ./nptl/pthread_create.c:447
#23 0x000073129b129c6c in clone3 ()at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
{noformat}
There was a problem hiding this comment.
This does not reproduce on my end after the rebase (7bcbcb65ab6), was fixed likely by MDEV-38928.
| return DB_SUCCESS; | ||
|
|
||
| DBUG_LOG("ib_lock", | ||
| "insert_check trx " << ib::hex(trx->id) |
There was a problem hiding this comment.
Another crash, which I only see on this PR, was not found in the latest ES and CS versions
Crashing SQL
CREATE TEMPORARY TABLE t IGNORE AS SELECT 1 UNION SELECT 2;
{noformat:title=MDEV-37974 CS 10.11.17 4d36da565ae4e6d75f6bf21922b80f9e51f0b124 (Debug, Clang 22.1.8-20260622) Build 22/06/2026}
mariadbd: /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15817: void trx_t::reset_and_truncate_undo(const dict_table_t *): Assertion `undo_no <= 1' failed.
{noformat}
{noformat:title=MDEV-37974 CS 10.11.17 4d36da565ae4e6d75f6bf21922b80f9e51f0b124 (Debug, Clang 22.1.8-20260622) Build 22/06/2026}
Core was generated by `/data/MDEV-37974-MD220626-mariadb-10.11.17-linux-x86_64-dbg/bin/mariadbd --no-d'.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:44
[Current thread is 1 (LWP 4162121)]
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6)at ./nptl/pthread_kill.c:89
#3 0x000077762ca4527e in __GI_raise (sig=sig@entry=6)at ../sysdeps/posix/raise.c:26
#4 0x000077762ca288ff in __GI_abort () at ./stdlib/abort.c:79
#5 0x000077762ca2881b in __assert_fail_base (fmt=0x77762cbd01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x6081487040d2 "undo_no <= 1", file=file@entry=0x6081486fbc3a "/data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc", line=line@entry=15817, f
unction=function@entry=0x6081487040df "void trx_t::reset_and_truncate_undo(const dict_table_t *)") at ./assert/assert.c:96
#6 0x000077762ca3b517 in __assert_fail (assertion=0x6081487040d2 "undo_no <= 1", file=0x6081486fbc3a "/data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc", line=15817, function=0x6081487040df "void trx_t::reset_and_truncate_undo(const dict_table_t *)") at ./assert/assert.c:10
5
#7 0x0000608147eef056 in trx_t::reset_and_truncate_undo (this=0x77762c4dbb80, table=0x7775400abd50)at /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15817
#8 0x0000608147ed67ab in ha_innobase::extra (this=0x7775400ae758, operation=HA_EXTRA_END_COPY)at /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15927
#9 0x0000608147761a58 in select_insert::prepare_eof (this=0x7775400154c0)at /data/MDEV-37974_dbg/sql/sql_insert.cc:4545
#10 0x00006081477645fc in select_create::send_eof (this=0x7775400154c0)at /data/MDEV-37974_dbg/sql/sql_insert.cc:5364
#11 0x000060814782af93 in do_select (join=0x777540016d58, procedure=0x0)at /data/MDEV-37974_dbg/sql/sql_select.cc:22654
#12 0x000060814782a254 in JOIN::exec_inner (this=0x777540016d58)at /data/MDEV-37974_dbg/sql/sql_select.cc:5023
#13 0x0000608147829613 in JOIN::exec (this=0x777540016d58)at /data/MDEV-37974_dbg/sql/sql_select.cc:4807
#14 0x0000608147808b6d in mysql_select (thd=0x777540000d58, tables=0x777540005078, fields=@0x777540005830: {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x777540016870, last = 0x777540016870, elements = 1}, <No data fields>}, conds=0x0, og_num=0, order=0x0, group=0x0, hav
ing=0x0, proc_param=0x0, select_options=2199023255552, result=0x7775400154c0, unit=0x777540005028, select_lex=0x777540014fc0) at /data/MDEV-37974_dbg/sql/sql_select.cc:5285
#15 0x00006081478d7d52 in st_select_lex_unit::exec (this=0x777540005028)at /data/MDEV-37974_dbg/sql/sql_union.cc:2409
#16 0x00006081478d49a7 in mysql_union (thd=0x777540000d58, lex=0x777540004f50, result=0x7775400154c0, unit=0x777540005028, setup_tables_done_option=0)at /data/MDEV-37974_dbg/sql/sql_union.cc:42
#17 0x0000608147808587 in handle_select (thd=0x777540000d58, lex=0x777540004f50, result=0x7775400154c0, setup_tables_done_option=0)at /data/MDEV-37974_dbg/sql/sql_select.cc:591
#18 0x00006081478c01d8 in Sql_cmd_create_table_like::execute (this=0x777540013538, thd=0x777540000d58)at /data/MDEV-37974_dbg/sql/sql_table.cc:13104
#19 0x00006081477b5a00 in mysql_execute_command (thd=0x777540000d58, is_called_from_prepared_stmt=false)at /data/MDEV-37974_dbg/sql/sql_parse.cc:6201
#20 0x00006081477a5978 in mysql_parse (thd=0x777540000d58, rawbuf=0x777540013460 "CREATE TEMPORARY TABLE t IGNORE AS SELECT 1 UNION SELECT 2", length=58, parser_state=0x77762c097a20)at /data/MDEV-37974_dbg/sql/sql_parse.cc:8223
#21 0x00006081477a322a in dispatch_command (command=COM_QUERY, thd=0x777540000d58, packet=0x77754000aee9 "CREATE TEMPORARY TABLE t IGNORE AS SELECT 1 UNION SELECT 2", packet_length=58, blocking=true)at /data/MDEV-37974_dbg/sql/sql_parse.cc:1924
#22 0x00006081477a63fa in do_command (thd=0x777540000d58, blocking=true)at /data/MDEV-37974_dbg/sql/sql_parse.cc:1434
#23 0x000060814795987e in do_handle_one_connection (connect=0x608157900408, put_in_cache=true) at /data/MDEV-37974_dbg/sql/sql_connect.cc:1475
#24 0x0000608147959665 in handle_one_connection (arg=0x60815791ff68)at /data/MDEV-37974_dbg/sql/sql_connect.cc:1387
#25 0x000077762ca9caa4 in start_thread (arg=<optimized out>)at ./nptl/pthread_create.c:447
#26 0x000077762cb29c6c in clone3 ()at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
{noformat}
There was a problem hiding this comment.
This does not reproduce on my end after the rebase (7bcbcb65ab6), was fixed likely by MDEV-38928.
|
@arcivanov could you please rebase your changes and force push? I tested and the failures reported by @mariadb-SaahilAlam did not reproduce on a rebased version of your commit on top of |
When a transaction (TX1) holds a granted lock on a record and another transaction (TX2) is waiting for that same record, an `INSERT` by TX1 into the gap before that record would incorrectly enter `lock_wait()` on TX2's waiting lock, creating a false deadlock cycle. In `lock_rec_insert_check_and_lock()`, after `lock_rec_other_has_conflicting()` returns a conflicting lock, check whether: 1. the conflicting lock is **WAITING**, 2. we hold a **granted** X lock on the successor record, 3. TX2 has **not** already scanned through the gap from the predecessor side (predecessor check — see below), and 4. no other transaction holds a **GRANTED** lock that conflicts with our `INSERT_INTENTION`. Only skip the lock wait when all four conditions are met. **Predecessor check (condition 3)**: To prevent phantom reads, we must verify that TX2 has not already traversed the gap between the predecessor and successor records. If TX2 had scanned through this gap (e.g., via a range scan), TX2 would hold a granted lock on the predecessor, and allowing TX1 to insert in the gap could introduce a phantom visible to TX2 when it resumes. We check TX2's locks rather than TX1's because TX1 may only hold **implicit** locks on secondary index records. When a `DELETE` scans via the clustered (PRIMARY) index, only implicit locks (represented by the delete-mark on the record, with no explicit `lock_t` struct) exist on the secondary index. The successor's implicit lock gets converted to explicit by `lock_rec_convert_impl_to_expl_for_trx()` when TX2 conflicts with it, but the predecessor's implicit lock remains without an explicit `lock_t`. TX2, on the other hand, always acquires explicit locks during its scan via `sel_set_rec_lock()` → `lock_rec_lock(impl=false, ...)`, so its locks are reliably visible to `lock_rec_has_expl()`. The predecessor check has two cases: - **Same-page** (`pred_heap_no != PAGE_HEAP_NO_INFIMUM`): Check that TX2 has no granted lock on `pred_heap_no` using `lock_rec_has_expl(LOCK_S | LOCK_REC_NOT_GAP, ..., c_lock->trx)`. Both predecessor and successor are on the same page, covered by the same `LockGuard` — zero additional latching cost. - **Cross-page** (`pred_heap_no == PAGE_HEAP_NO_INFIMUM`): The predecessor is the page infimum, meaning the actual predecessor record is on the previous page. Walk TX2's `trx_locks` list to check whether TX2 has any record lock on the previous page (`btr_page_get_prev()`). If not, TX2 entered this page via B-tree descent and never traversed the cross-page gap. If the page is the leftmost in the index (`FIL_NULL`), no previous page exists and `pred_ok` is trivially true. Acquiring `wait_trx->mutex` is safe: `lock_sys.latch` (shared) → `trx->mutex` follows established latch order (precedent: `lock_rec_convert_impl_to_expl_for_trx()`). **Granted-lock scan (condition 4)**: The scan for granted conflicting locks is needed because lock inheritance during purge can create granted `GAP` locks from other transactions that coexist with our `LOCK_ORDINARY` but still block `INSERT_INTENTION`. `lock_rec_other_has_conflicting()` returns only the first conflicting lock; if it is `WAITING` and we skip it, granted conflicting locks later in the queue would be missed without this explicit scan. The bogus deadlock in `lock_delete_updated` and `versioning.update` (MDEV-14829 section) is also eliminated. In `lock_delete_updated`, the `DELETE` no longer deadlocks but the row at the new PK position is missed by the forward scan (pre-existing behavior). In `versioning.update`, the concurrent `UPDATE` on a system-versioned table no longer deadlocks because the historical row `INSERT` correctly skips the waiting lock. New test cases cover `SELECT ... FOR UPDATE` (test 2), `UPDATE` (test 3), cross-page (infimum) predecessor where TX1's `INSERT` lands at the start of a non-first secondary index leaf page, triggering the `trx_locks` walk to verify TX2 has no locks on the previous page — with `INFORMATION_SCHEMA` verification that the locks span different pages (test 4), and a negative test where the predecessor check correctly blocks the optimization when TX2's range scan has already locked the predecessor record (test 5).
|
@iMineLink done |
iMineLink
left a comment
There was a problem hiding this comment.
Thanks for the rebase.
Please see my review comment about the change in lock_delete_updated.result.
I see there's an apparent contradiction between the reintroduction of MDEV-27992 and your recent fuzzing result via txn-tester (thanks for sharing!): I need to inspect those, and study how to use the tools to resurface MDEV-27992.
| UPDATE t SET a = 1; | ||
| COMMIT; | ||
| connection con1; | ||
| ERROR 40001: Deadlock found when trying to get lock; try restarting transaction |
There was a problem hiding this comment.
The commit which introduced this test is:
Commit: 86c1bf11
Author: Vlad Lesin <vlad_lesin@mail.ru>
Date: Mon Mar 7 13:03:53 2022 +0300
MDEV-27992 DELETE fails to delete record after blocking is released
MDEV-27025 allows to insert records before the record on which DELETE is
locked, as a result the DELETE misses those records, what causes serious ACID
violation.
Revert MDEV-27025, MDEV-27550. The test which shows the scenario of ACID
violation is added.
I believe here we are re-introducing the same behavior that was previously reverted as a "serious ACID violation". I tested locally and it can also happen on secondary indexes with this PR's commit. Here happens in the first page of the index, where there's no previous locked record to anchor on. I believe the existing (pre-PR) behavior is user-visible and does not lead to a lost delete, and may be favorable. I don't know if the rejection logic in lock_rec_insert_check_and_lock() can be patched to avoid this while keeping the other benefits.
|
I was able to make MDEV-27992 resurface via TROC, and left the script to do so on the MDEV-37974: The result matrices are below (tested against Debug builds): BASELINE
PATCHED
The difference is only in the |
When a transaction (TX1) holds a granted lock on a record and another
transaction (TX2) is waiting for that same record, an
INSERTby TX1into the gap before that record would incorrectly enter
lock_wait()on TX2's waiting lock, creating a false deadlock cycle.
In
lock_rec_insert_check_and_lock(), afterlock_rec_other_has_conflicting()returns a conflicting lock,check whether:
predecessor side (predecessor check — see below), and
with our
INSERT_INTENTION.Only skip the lock wait when all four conditions are met.
Predecessor check (condition 3): To prevent phantom reads, we
must verify that TX2 has not already traversed the gap between the
predecessor and successor records. If TX2 had scanned through this
gap (e.g., via a range scan), TX2 would hold a granted lock on the
predecessor, and allowing TX1 to insert in the gap could introduce
a phantom visible to TX2 when it resumes.
We check TX2's locks rather than TX1's because TX1 may only hold
implicit locks on secondary index records. When a
DELETEscansvia the clustered (PRIMARY) index, only implicit locks (represented
by the delete-mark on the record, with no explicit
lock_tstruct)exist on the secondary index. The successor's implicit lock gets
converted to explicit by
lock_rec_convert_impl_to_expl_for_trx()when TX2 conflicts with it, but the predecessor's implicit lock
remains without an explicit
lock_t. TX2, on the other hand,always acquires explicit locks during its scan via
sel_set_rec_lock()→lock_rec_lock(impl=false, ...), so itslocks are reliably visible to
lock_rec_has_expl().The predecessor check has two cases:
Same-page (
pred_heap_no != PAGE_HEAP_NO_INFIMUM): Checkthat TX2 has no granted lock on
pred_heap_nousinglock_rec_has_expl(LOCK_S | LOCK_REC_NOT_GAP, ..., c_lock->trx).Both predecessor and successor are on the same page, covered by
the same
LockGuard— zero additional latching cost.Cross-page (
pred_heap_no == PAGE_HEAP_NO_INFIMUM): Thepredecessor is the page infimum, meaning the actual predecessor
record is on the previous page. Walk TX2's
trx_lockslist tocheck whether TX2 has any record lock on the previous page
(
btr_page_get_prev()). If not, TX2 entered this page viaB-tree descent and never traversed the cross-page gap. If the
page is the leftmost in the index (
FIL_NULL), no previous pageexists and
pred_okis trivially true. Acquiringwait_trx->mutexis safe:lock_sys.latch(shared) →trx->mutexfollows established latch order (precedent:lock_rec_convert_impl_to_expl_for_trx()).Granted-lock scan (condition 4): The scan for granted
conflicting locks is needed because lock inheritance during purge
can create granted
GAPlocks from other transactions that coexistwith our
LOCK_ORDINARYbut still blockINSERT_INTENTION.lock_rec_other_has_conflicting()returns only the firstconflicting lock; if it is
WAITINGand we skip it, grantedconflicting locks later in the queue would be missed without this
explicit scan.
The bogus deadlock in
lock_delete_updatedandversioning.update(MDEV-14829 section) is also eliminated. In
lock_delete_updated,the
DELETEno longer deadlocks but the row at the new PK positionis missed by the forward scan (pre-existing behavior). In
versioning.update, the concurrentUPDATEon a system-versionedtable no longer deadlocks because the historical row
INSERTcorrectly skips the waiting lock.
New test cases cover
SELECT ... FOR UPDATE(test 2),UPDATE(test 3), cross-page (infimum) predecessor where TX1's
INSERTlands at the start of a non-first secondary index leaf page,
triggering the
trx_lockswalk to verify TX2 has no locks on theprevious page — with
INFORMATION_SCHEMAverification that thelocks span different pages (test 4), and a negative test where the
predecessor check correctly blocks the optimization when TX2's
range scan has already locked the predecessor record (test 5).