Skip to content

MDEV-37974 Avoid bogus deadlock in lock_rec_insert_check_and_lock()#4672

Open
arcivanov wants to merge 1 commit into
MariaDB:10.11from
arcivanov:MDEV-37974
Open

MDEV-37974 Avoid bogus deadlock in lock_rec_insert_check_and_lock()#4672
arcivanov wants to merge 1 commit into
MariaDB:10.11from
arcivanov:MDEV-37974

Conversation

@arcivanov

@arcivanov arcivanov commented Feb 20, 2026

Copy link
Copy Markdown
Contributor

When a transaction (TX1) holds a granted lock on a record and another
transaction (TX2) is waiting for that same record, an INSERT by TX1
into the gap before that record would incorrectly enter lock_wait()
on TX2's waiting lock, creating a false deadlock cycle.

In lock_rec_insert_check_and_lock(), after
lock_rec_other_has_conflicting() returns a conflicting lock,
check whether:

  1. the conflicting lock is WAITING,
  2. we hold a granted X lock on the successor record,
  3. TX2 has not already scanned through the gap from the
    predecessor side (predecessor check — see below), and
  4. no other transaction holds a GRANTED lock that conflicts
    with our INSERT_INTENTION.

Only skip the lock wait when all four conditions are met.

Predecessor check (condition 3): To prevent phantom reads, we
must verify that TX2 has not already traversed the gap between the
predecessor and successor records. If TX2 had scanned through this
gap (e.g., via a range scan), TX2 would hold a granted lock on the
predecessor, and allowing TX1 to insert in the gap could introduce
a phantom visible to TX2 when it resumes.

We check TX2's locks rather than TX1's because TX1 may only hold
implicit locks on secondary index records. When a DELETE scans
via the clustered (PRIMARY) index, only implicit locks (represented
by the delete-mark on the record, with no explicit lock_t struct)
exist on the secondary index. The successor's implicit lock gets
converted to explicit by lock_rec_convert_impl_to_expl_for_trx()
when TX2 conflicts with it, but the predecessor's implicit lock
remains without an explicit lock_t. TX2, on the other hand,
always acquires explicit locks during its scan via
sel_set_rec_lock()lock_rec_lock(impl=false, ...), so its
locks are reliably visible to lock_rec_has_expl().

The predecessor check has two cases:

  • Same-page (pred_heap_no != PAGE_HEAP_NO_INFIMUM): Check
    that TX2 has no granted lock on pred_heap_no using
    lock_rec_has_expl(LOCK_S | LOCK_REC_NOT_GAP, ..., c_lock->trx).
    Both predecessor and successor are on the same page, covered by
    the same LockGuard — zero additional latching cost.

  • Cross-page (pred_heap_no == PAGE_HEAP_NO_INFIMUM): The
    predecessor is the page infimum, meaning the actual predecessor
    record is on the previous page. Walk TX2's trx_locks list to
    check whether TX2 has any record lock on the previous page
    (btr_page_get_prev()). If not, TX2 entered this page via
    B-tree descent and never traversed the cross-page gap. If the
    page is the leftmost in the index (FIL_NULL), no previous page
    exists and pred_ok is trivially true. Acquiring
    wait_trx->mutex is safe: lock_sys.latch (shared) →
    trx->mutex follows established latch order (precedent:
    lock_rec_convert_impl_to_expl_for_trx()).

Granted-lock scan (condition 4): The scan for granted
conflicting locks is needed because lock inheritance during purge
can create granted GAP locks from other transactions that coexist
with our LOCK_ORDINARY but still block INSERT_INTENTION.
lock_rec_other_has_conflicting() returns only the first
conflicting lock; if it is WAITING and we skip it, granted
conflicting locks later in the queue would be missed without this
explicit scan.

The bogus deadlock in lock_delete_updated and versioning.update
(MDEV-14829 section) is also eliminated. In lock_delete_updated,
the DELETE no longer deadlocks but the row at the new PK position
is missed by the forward scan (pre-existing behavior). In
versioning.update, the concurrent UPDATE on a system-versioned
table no longer deadlocks because the historical row INSERT
correctly skips the waiting lock.

New test cases cover SELECT ... FOR UPDATE (test 2), UPDATE
(test 3), cross-page (infimum) predecessor where TX1's INSERT
lands at the start of a non-first secondary index leaf page,
triggering the trx_locks walk to verify TX2 has no locks on the
previous page — with INFORMATION_SCHEMA verification that the
locks span different pages (test 4), and a negative test where the
predecessor check correctly blocks the optimization when TX2's
range scan has already locked the predecessor record (test 5).

@arcivanov arcivanov force-pushed the MDEV-37974 branch 3 times, most recently from 23599ab to d9e12c5 Compare February 20, 2026 07:25
@arcivanov arcivanov marked this pull request as draft February 20, 2026 08:16
@arcivanov arcivanov force-pushed the MDEV-37974 branch 2 times, most recently from c04ea78 to df10682 Compare February 20, 2026 08:20
@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Feb 20, 2026
@arcivanov arcivanov force-pushed the MDEV-37974 branch 5 times, most recently from 0171d8d to a99f1e0 Compare February 22, 2026 02:27
@arcivanov arcivanov marked this pull request as ready for review February 22, 2026 05:02
@arcivanov

Copy link
Copy Markdown
Contributor Author

@dr-m

@gkodinov gkodinov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. This is a preliminary review. Please stand by for the final review.

@dr-m dr-m left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is interesting. But I think that we should be very careful here. We have been bitten by MDEV-27025 in the past. I think that we would need some additional testing in the style of https://mariadb.com/resources/blog/isolation-level-violation-testing-and-debugging-in-mariadb/ because our regular stress testing does not cover consistency or isolation very well.

Comment thread mysql-test/suite/innodb/t/lock_delete_updated.test Outdated
Comment thread mysql-test/suite/innodb/t/mdev_37974.opt Outdated
Comment thread mysql-test/suite/innodb/t/mdev_37974.test
Comment thread storage/innobase/lock/lock0lock.cc
Comment thread storage/innobase/lock/lock0lock.cc Outdated
Comment thread storage/innobase/lock/lock0lock.cc Outdated
@arcivanov arcivanov force-pushed the MDEV-37974 branch 6 times, most recently from a8e2ff9 to 4d36da5 Compare February 28, 2026 23:44
@gkodinov gkodinov requested a review from dr-m March 23, 2026 12:44
@dr-m

dr-m commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

I just posted this comment with some notes on how I think this would have to be tested. Unfortunately, I cannot currently spend time on any thorough review that would be needed here.

@dr-m dr-m left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more testing than is possible within the mtr framework.

@gkodinov

Copy link
Copy Markdown
Member

@arcivanov there's a question for you from Marko here. It's been a while. Do you intend to keep working on this PR? If not, please convert to draft.

@arcivanov

Copy link
Copy Markdown
Contributor Author

@gkodinov I'm working on all my bugs right now including HEAP.

@arcivanov

Copy link
Copy Markdown
Contributor Author

@gkodinov Which Marko's question are you talking about? Per JIRA Marko asked me if I can hook up the whole of MariaDB to Jepsen and I don't see any other questions:

I think that we should try to test this with the Jepsen tool. I didn’t get involved with it. Only Vladislav Lesin did, and he is no longer working for us. There was an idea to run some Jepsen tests on our CI, but to my understanding, it was never deployed. Related to that and https://mariadb.com/resources/blog/isolation-level-violation-testing-and-debugging-in-mariadb/, MariaDB/buildbot#524 should be a useful starting point.

It would help a lot if you could investigate how to run Jepsen and how to cover this type of scenario in that framework.

It's a pretty large task, I'll need to figure out if I can do that.

@arcivanov

Copy link
Copy Markdown
Contributor Author

@dr-m So I have the Jepsen cleaned up and modernized MariaDB/jepsen-mariadb#1

My question is what kinds of tests do you want? Jepsen works probabilistically and my current MTR already covers the known scenarios.

Comment on lines 5783 to +5793
if (lock_t *c_lock= lock_rec_other_has_conflicting(type_mode,
g.cell(), id,
heap_no, trx))
{
trx->mutex_lock();
err= lock_rec_enqueue_waiting(c_lock, type_mode, id, block->page.frame,
heap_no, index, thr, nullptr);
trx->mutex_unlock();
const ulint pred_heap_no= comp
? rec_get_heap_no_new(rec) : rec_get_heap_no_old(rec);

/* MDEV-37974: If the first conflicting lock is WAITING and
we hold a granted X lock on the successor record, the
waiting lock is necessarily blocked behind our lock in the
queue and can never be granted while our lock exists.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic may also be executed both on primary key indexes (clustered index) and on secondary indexes.

The test case in MDEV-37974 seems to involve both types of indexes; the FOREIGN KEY implies a secondary index on child.parent_id.

My main concern with changes to locking is that a regression such as MDEV-27992 might be introduced. The scenario of that bug involved an UPDATE of a PRIMARY KEY column, which is a rare but possible operation.

I assume that there is an automated tool that produces such reports, because I have seen several reports that use similar notation. However, I don’t know what that tool is. This bug was not found in our own testing. I would hope that Jepsen would be able to catch that bug, but I haven’t tried it myself.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will be using the TROC automated tool to test this patch under both the Read Committed (RC) and Repeatable Read (RR) isolation levels.
TROC is the same framework that successfully discovered the transaction anomaly in MDEV-27992.

@arcivanov arcivanov Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a txn-tester https://github.com/arcivanov/txn-tester which combines TROC, Fucci and APTrans (projects backing followon papers to TROC) into a single test container and standardizes their behavior. Feel free to adopt it into MariaDB if you wish (it's under Apache 2.0 but I'll transfer the copyright/change license if you do so).
I also created a PR into buildbot to take advantage of the txn-tester for automated CI testing: MariaDB/buildbot#975. If you adopt the txn-tester into MariaDB the GHCR location of the image will change, but otherwise it should work.

Finally, I ran the MDEV-37974 against the tool and here are the results and the report based on the results:

MDEV-37974 — Transaction/isolation fuzz testing of the lock_rec_insert_check_and_lock() fix

Summary

The MDEV-37974 branch (commit 4d36da5, "Avoid bogus deadlock in
lock_rec_insert_check_and_lock()", on top of 10.11.17) was built as a debug
server and exercised with three independent transaction/isolation anomaly
oracles — TROC, Fucci, and APTrans — running concurrently against it.
The same tools were then run against a baseline binary built from the
identical tree with only storage/innobase/lock/lock0lock.cc reverted to its
parent, to attribute every finding.

Result: no isolation regression attributable to the change. Every anomaly the
tools reported also appears on the unmodified baseline (and TROC/Fucci reported
more on the baseline). The debug server (UNIV_DEBUG, _GLIBCXX_DEBUG,
SAFE_MUTEX) raised no assertion, signal, or InnoDB invariant failure during
any run. The fix's own deterministic MTR tests pass.

Findings overview (patched vs. baseline)

Tool Patched (with fix) Baseline (no fix) Verdict
TROC — logged inconsistencies 0 2 Noise — fewer with the fix
Fucci — logged inconsistencies 1 2 Noise (no INSERT in schedule)
APTrans — SERIALIZABLE anomalies 35 29 Oracle noise
APTrans — REPEATABLE READ anomalies 33 21 Oracle noise
APTrans — READ COMMITTED anomalies 11 4 Oracle noise
Server assertion / crash (debug) none none Clean

The full report (with environment, per-signal triage, MTR results, and reproduction steps) is in the zip.

MDEV-37974-txn-tester-output.zip

@gkodinov

Copy link
Copy Markdown
Member

Sorry for the delay on our end, @arcivanov. I've been tasked to find ways to get this contribution moving again. I should have some updates for you by the end of the week latest.

In the meanwhile, I'd like to also offer some general statements on pull request processing cadence that should set expectations, as requested by @vuvova:

According to our development cycle we work on bugs In the following periods 15.03 - 30.04, 15.06 - 30.07, 15.12 - 31.01 and 15.09 - 30.10. So, please, expect to get a review somewhere between these two dates and the goal is to have your PR merged before the second date.

err= lock_rec_enqueue_waiting(c_lock, type_mode, id, block->page.frame,
heap_no, index, thr, nullptr);
trx->mutex_unlock();
const ulint pred_heap_no= comp

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this crash , i can repro this crash only on this PR

{code:sql}
--source include/have_innodb.inc
--source include/have_partition.inc
CREATE TABLE t1 (c1 VARCHAR(3073) UNIQUE,c2 DATE)Engine=InnoDB PARTITION BY KEY (c2) PARTITIONS 2;
INSERT IGNORE INTO t1 VALUES (1,1),(2,2);
ALTER IGNORE TABLE t1 FORCE;
{code}
{noformat:title=MDEV-37974 CS 10.11.17 4d36da565ae4e6d75f6bf21922b80f9e51f0b124 (Debug, Clang 22.1.8-20260622) Build 22/06/2026}
mariadbd: /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15818: void trx_t::reset_and_truncate_undo(const dict_table_t *): Assertion `mod_tables.size() == 0 || mod_tables.begin()->first == table' failed.
{noformat}


{noformat:title=MDEV-37974 CS 10.11.17 4d36da565ae4e6d75f6bf21922b80f9e51f0b124 (Debug, Clang 22.1.8-20260622) Build 22/06/2026}
Core was generated by `/data/MDEV-37974-MD220626-mariadb-10.11.17-linux-x86_64-dbg/bin/mariadbd --defa'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:44

[Current thread is 1 (LWP 4072395)]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6)at ./nptl/pthread_kill.c:89
#3  0x000073129b04527e in __GI_raise (sig=sig@entry=6)at ../sysdeps/posix/raise.c:26
#4  0x000073129b0288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x000073129b02881b in __assert_fail_base (fmt=0x73129b1d01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x5607958c4119 "mod_tables.size() == 0 || mod_tables.begin()->first == table", file=file@entry=0x5607958bbc3a "/data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc", line=line@entry=15818, function=function@entry=0x5607958c40df "void trx_t::reset_and_truncate_undo(const dict_table_t *)") at ./assert/assert.c:96
#6  0x000073129b03b517 in __assert_fail (assertion=0x5607958c4119 "mod_tables.size() == 0 || mod_tables.begin()->first == table", file=0x5607958bbc3a "/data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc", line=15818, function=0x5607958c40df "void trx_t::reset_and_truncate_undo(const dict_table_t *)") at ./assert/assert.c:105
#7  0x00005607950af15d in trx_t::reset_and_truncate_undo (this=0x73129a923b80, table=0x7311a41a1950)at /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15818
#8  0x00005607950967ab in ha_innobase::extra (this=0x7311a406ce58, operation=HA_EXTRA_END_COPY)at /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15927
#9  0x000056079501fe93 in extra_cb (h=0x7311a406ce58, operation=0x731298d689ac)at /data/MDEV-37974_dbg/sql/ha_partition.cc:8998
#10 0x0000560795020356 in ha_partition::loop_partitions_over_map (this=0x7311a406c5b8, part_map=0x7311a406ee20, callback=0x56079501fe70 <extra_cb(handler*, void*)>, param=0x731298d689ac)at /data/MDEV-37974_dbg/sql/ha_partition.cc:9710
#11 0x0000560795016f87 in ha_partition::loop_partitions (this=0x7311a406c5b8, callback=0x56079501fe70 <extra_cb(handler*, void*)>, param=0x731298d689ac)at /data/MDEV-37974_dbg/sql/ha_partition.cc:9658
#12 0x000056079501fdb9 in ha_partition::extra (this=0x7311a406c5b8, operation=HA_EXTRA_END_COPY)at /data/MDEV-37974_dbg/sql/ha_partition.cc:9496
#13 0x0000560794a7dc09 in copy_data_between_tables (thd=0x7311a4000d58, from=0x7311a409db58, to=0x7311a4058e68, ignore=true, order_num=0, order=0x0, copied=0x731298d6a280, deleted=0x731298d6a278, alter_info=0x731298d6d060, alter_ctx=0x731298d6b0f8)at /data/MDEV-37974_dbg/sql/sql_table.cc:12475
#14 0x0000560794a74c0c in mysql_alter_table (thd=0x7311a4000d58, new_db=0x7311a4005908, new_name=0x7311a4005d68, create_info=0x731298d6d1d0, table_list=0x7311a4014758, recreate_info=0x731298d6c8b0, alter_info=0x731298d6d060, order_num=0, order=0x0, ignore=true, if_exists=false)at /data/MDEV-37974_dbg/sql/sql_table.cc:11597
#15 0x0000560794b23add in Sql_cmd_alter_table::execute (this=0x7311a4014e78, thd=0x7311a4000d58) at /data/MDEV-37974_dbg/sql/sql_alter.cc:688
#16 0x0000560794975a00 in mysql_execute_command (thd=0x7311a4000d58, is_called_from_prepared_stmt=false)at /data/MDEV-37974_dbg/sql/sql_parse.cc:6201
#17 0x0000560794965978 in mysql_parse (thd=0x7311a4000d58, rawbuf=0x7311a4014670 "ALTER IGNORE TABLE t1 FORCE", length=27, parser_state=0x731298d6ea20) at /data/MDEV-37974_dbg/sql/sql_parse.cc:8223
#18 0x000056079496322a in dispatch_command (command=COM_QUERY, thd=0x7311a4000d58, packet=0x7311a40a07c9 "", packet_length=27, blocking=true) at /data/MDEV-37974_dbg/sql/sql_parse.cc:1924
#19 0x00005607949663fa in do_command (thd=0x7311a4000d58, blocking=true)at /data/MDEV-37974_dbg/sql/sql_parse.cc:1434
#20 0x0000560794b1987e in do_handle_one_connection (connect=0x5607d5190248, put_in_cache=true) at /data/MDEV-37974_dbg/sql/sql_connect.cc:1475
#21 0x0000560794b19665 in handle_one_connection (arg=0x5607d5180f48)at /data/MDEV-37974_dbg/sql/sql_connect.cc:1387
#22 0x000073129b09caa4 in start_thread (arg=<optimized out>)at ./nptl/pthread_create.c:447
#23 0x000073129b129c6c in clone3 ()at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
{noformat}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not reproduce on my end after the rebase (7bcbcb65ab6), was fixed likely by MDEV-38928.

return DB_SUCCESS;

DBUG_LOG("ib_lock",
"insert_check trx " << ib::hex(trx->id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another crash, which I only see on this PR, was not found in the latest ES and CS versions

Crashing SQL
CREATE TEMPORARY TABLE t IGNORE AS SELECT 1 UNION SELECT 2;


{noformat:title=MDEV-37974 CS 10.11.17 4d36da565ae4e6d75f6bf21922b80f9e51f0b124 (Debug, Clang 22.1.8-20260622) Build 22/06/2026}
mariadbd: /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15817: void trx_t::reset_and_truncate_undo(const dict_table_t *): Assertion `undo_no <= 1' failed.
{noformat}

{noformat:title=MDEV-37974 CS 10.11.17 4d36da565ae4e6d75f6bf21922b80f9e51f0b124 (Debug, Clang 22.1.8-20260622) Build 22/06/2026}
Core was generated by `/data/MDEV-37974-MD220626-mariadb-10.11.17-linux-x86_64-dbg/bin/mariadbd --no-d'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:44

[Current thread is 1 (LWP 4162121)]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6)at ./nptl/pthread_kill.c:89
#3  0x000077762ca4527e in __GI_raise (sig=sig@entry=6)at ../sysdeps/posix/raise.c:26
#4  0x000077762ca288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x000077762ca2881b in __assert_fail_base (fmt=0x77762cbd01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x6081487040d2 "undo_no <= 1", file=file@entry=0x6081486fbc3a "/data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc", line=line@entry=15817, f
unction=function@entry=0x6081487040df "void trx_t::reset_and_truncate_undo(const dict_table_t *)") at ./assert/assert.c:96
#6  0x000077762ca3b517 in __assert_fail (assertion=0x6081487040d2 "undo_no <= 1", file=0x6081486fbc3a "/data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc", line=15817, function=0x6081487040df "void trx_t::reset_and_truncate_undo(const dict_table_t *)") at ./assert/assert.c:10
5
#7  0x0000608147eef056 in trx_t::reset_and_truncate_undo (this=0x77762c4dbb80, table=0x7775400abd50)at /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15817
#8  0x0000608147ed67ab in ha_innobase::extra (this=0x7775400ae758, operation=HA_EXTRA_END_COPY)at /data/MDEV-37974_dbg/storage/innobase/handler/ha_innodb.cc:15927
#9  0x0000608147761a58 in select_insert::prepare_eof (this=0x7775400154c0)at /data/MDEV-37974_dbg/sql/sql_insert.cc:4545
#10 0x00006081477645fc in select_create::send_eof (this=0x7775400154c0)at /data/MDEV-37974_dbg/sql/sql_insert.cc:5364
#11 0x000060814782af93 in do_select (join=0x777540016d58, procedure=0x0)at /data/MDEV-37974_dbg/sql/sql_select.cc:22654
#12 0x000060814782a254 in JOIN::exec_inner (this=0x777540016d58)at /data/MDEV-37974_dbg/sql/sql_select.cc:5023
#13 0x0000608147829613 in JOIN::exec (this=0x777540016d58)at /data/MDEV-37974_dbg/sql/sql_select.cc:4807
#14 0x0000608147808b6d in mysql_select (thd=0x777540000d58, tables=0x777540005078, fields=@0x777540005830: {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x777540016870, last = 0x777540016870, elements = 1}, <No data fields>}, conds=0x0, og_num=0, order=0x0, group=0x0, hav
ing=0x0, proc_param=0x0, select_options=2199023255552, result=0x7775400154c0, unit=0x777540005028, select_lex=0x777540014fc0) at /data/MDEV-37974_dbg/sql/sql_select.cc:5285
#15 0x00006081478d7d52 in st_select_lex_unit::exec (this=0x777540005028)at /data/MDEV-37974_dbg/sql/sql_union.cc:2409
#16 0x00006081478d49a7 in mysql_union (thd=0x777540000d58, lex=0x777540004f50, result=0x7775400154c0, unit=0x777540005028, setup_tables_done_option=0)at /data/MDEV-37974_dbg/sql/sql_union.cc:42
#17 0x0000608147808587 in handle_select (thd=0x777540000d58, lex=0x777540004f50, result=0x7775400154c0, setup_tables_done_option=0)at /data/MDEV-37974_dbg/sql/sql_select.cc:591
#18 0x00006081478c01d8 in Sql_cmd_create_table_like::execute (this=0x777540013538, thd=0x777540000d58)at /data/MDEV-37974_dbg/sql/sql_table.cc:13104
#19 0x00006081477b5a00 in mysql_execute_command (thd=0x777540000d58, is_called_from_prepared_stmt=false)at /data/MDEV-37974_dbg/sql/sql_parse.cc:6201
#20 0x00006081477a5978 in mysql_parse (thd=0x777540000d58, rawbuf=0x777540013460 "CREATE TEMPORARY TABLE t IGNORE AS SELECT 1 UNION SELECT 2", length=58, parser_state=0x77762c097a20)at /data/MDEV-37974_dbg/sql/sql_parse.cc:8223
#21 0x00006081477a322a in dispatch_command (command=COM_QUERY, thd=0x777540000d58, packet=0x77754000aee9 "CREATE TEMPORARY TABLE t IGNORE AS SELECT 1 UNION SELECT 2", packet_length=58, blocking=true)at /data/MDEV-37974_dbg/sql/sql_parse.cc:1924
#22 0x00006081477a63fa in do_command (thd=0x777540000d58, blocking=true)at /data/MDEV-37974_dbg/sql/sql_parse.cc:1434
#23 0x000060814795987e in do_handle_one_connection (connect=0x608157900408, put_in_cache=true) at /data/MDEV-37974_dbg/sql/sql_connect.cc:1475
#24 0x0000608147959665 in handle_one_connection (arg=0x60815791ff68)at /data/MDEV-37974_dbg/sql/sql_connect.cc:1387
#25 0x000077762ca9caa4 in start_thread (arg=<optimized out>)at ./nptl/pthread_create.c:447
#26 0x000077762cb29c6c in clone3 ()at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
{noformat}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not reproduce on my end after the rebase (7bcbcb65ab6), was fixed likely by MDEV-38928.

@iMineLink

Copy link
Copy Markdown
Contributor

@arcivanov could you please rebase your changes and force push? I tested and the failures reported by @mariadb-SaahilAlam did not reproduce on a rebased version of your commit on top of 84a859a6366, today's 10.11. It seems the crashes were fixed by https://jira.mariadb.org/browse/MDEV-38928.

When a transaction (TX1) holds a granted lock on a record and another
transaction (TX2) is waiting for that same record, an `INSERT` by TX1
into the gap before that record would incorrectly enter `lock_wait()`
on TX2's waiting lock, creating a false deadlock cycle.

In `lock_rec_insert_check_and_lock()`, after
`lock_rec_other_has_conflicting()` returns a conflicting lock,
check whether:

1. the conflicting lock is **WAITING**,
2. we hold a **granted** X lock on the successor record,
3. TX2 has **not** already scanned through the gap from the
   predecessor side (predecessor check — see below), and
4. no other transaction holds a **GRANTED** lock that conflicts
   with our `INSERT_INTENTION`.

Only skip the lock wait when all four conditions are met.

**Predecessor check (condition 3)**: To prevent phantom reads, we
must verify that TX2 has not already traversed the gap between the
predecessor and successor records. If TX2 had scanned through this
gap (e.g., via a range scan), TX2 would hold a granted lock on the
predecessor, and allowing TX1 to insert in the gap could introduce
a phantom visible to TX2 when it resumes.

We check TX2's locks rather than TX1's because TX1 may only hold
**implicit** locks on secondary index records. When a `DELETE` scans
via the clustered (PRIMARY) index, only implicit locks (represented
by the delete-mark on the record, with no explicit `lock_t` struct)
exist on the secondary index. The successor's implicit lock gets
converted to explicit by `lock_rec_convert_impl_to_expl_for_trx()`
when TX2 conflicts with it, but the predecessor's implicit lock
remains without an explicit `lock_t`. TX2, on the other hand,
always acquires explicit locks during its scan via
`sel_set_rec_lock()` → `lock_rec_lock(impl=false, ...)`, so its
locks are reliably visible to `lock_rec_has_expl()`.

The predecessor check has two cases:

- **Same-page** (`pred_heap_no != PAGE_HEAP_NO_INFIMUM`): Check
  that TX2 has no granted lock on `pred_heap_no` using
  `lock_rec_has_expl(LOCK_S | LOCK_REC_NOT_GAP, ..., c_lock->trx)`.
  Both predecessor and successor are on the same page, covered by
  the same `LockGuard` — zero additional latching cost.

- **Cross-page** (`pred_heap_no == PAGE_HEAP_NO_INFIMUM`): The
  predecessor is the page infimum, meaning the actual predecessor
  record is on the previous page. Walk TX2's `trx_locks` list to
  check whether TX2 has any record lock on the previous page
  (`btr_page_get_prev()`). If not, TX2 entered this page via
  B-tree descent and never traversed the cross-page gap. If the
  page is the leftmost in the index (`FIL_NULL`), no previous page
  exists and `pred_ok` is trivially true. Acquiring
  `wait_trx->mutex` is safe: `lock_sys.latch` (shared) →
  `trx->mutex` follows established latch order (precedent:
  `lock_rec_convert_impl_to_expl_for_trx()`).

**Granted-lock scan (condition 4)**: The scan for granted
conflicting locks is needed because lock inheritance during purge
can create granted `GAP` locks from other transactions that coexist
with our `LOCK_ORDINARY` but still block `INSERT_INTENTION`.
`lock_rec_other_has_conflicting()` returns only the first
conflicting lock; if it is `WAITING` and we skip it, granted
conflicting locks later in the queue would be missed without this
explicit scan.

The bogus deadlock in `lock_delete_updated` and `versioning.update`
(MDEV-14829 section) is also eliminated. In `lock_delete_updated`,
the `DELETE` no longer deadlocks but the row at the new PK position
is missed by the forward scan (pre-existing behavior). In
`versioning.update`, the concurrent `UPDATE` on a system-versioned
table no longer deadlocks because the historical row `INSERT`
correctly skips the waiting lock.

New test cases cover `SELECT ... FOR UPDATE` (test 2), `UPDATE`
(test 3), cross-page (infimum) predecessor where TX1's `INSERT`
lands at the start of a non-first secondary index leaf page,
triggering the `trx_locks` walk to verify TX2 has no locks on the
previous page — with `INFORMATION_SCHEMA` verification that the
locks span different pages (test 4), and a negative test where the
predecessor check correctly blocks the optimization when TX2's
range scan has already locked the predecessor record (test 5).
@arcivanov

Copy link
Copy Markdown
Contributor Author

@iMineLink done

@iMineLink iMineLink left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rebase.
Please see my review comment about the change in lock_delete_updated.result.
I see there's an apparent contradiction between the reintroduction of MDEV-27992 and your recent fuzzing result via txn-tester (thanks for sharing!): I need to inspect those, and study how to use the tools to resurface MDEV-27992.

UPDATE t SET a = 1;
COMMIT;
connection con1;
ERROR 40001: Deadlock found when trying to get lock; try restarting transaction

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit which introduced this test is:

Commit: 86c1bf11
Author: Vlad Lesin <vlad_lesin@mail.ru>
Date: Mon Mar 7 13:03:53 2022 +0300

MDEV-27992 DELETE fails to delete record after blocking is released

MDEV-27025 allows to insert records before the record on which DELETE is
locked, as a result the DELETE misses those records, what causes serious ACID
violation.
Revert MDEV-27025, MDEV-27550. The test which shows the scenario of ACID
violation is added.

I believe here we are re-introducing the same behavior that was previously reverted as a "serious ACID violation". I tested locally and it can also happen on secondary indexes with this PR's commit. Here happens in the first page of the index, where there's no previous locked record to anchor on. I believe the existing (pre-PR) behavior is user-visible and does not lead to a lost delete, and may be favorable. I don't know if the rejection logic in lock_rec_insert_check_and_lock() can be patched to avoid this while keeping the other benefits.

@iMineLink

Copy link
Copy Markdown
Contributor

I was able to make MDEV-27992 resurface via TROC, and left the script to do so on the MDEV-37974: run-troc-mdev27992.sh.
Turns out TROC already contained a test case for MDEV-27992 at READ COMMITTED (RC) isolation level (cases/mariadb27992.txt).
I added 3 more tests: one similar but at REPEATABLE READ (RR) isolation level, and 2 for similar issue on secondary index both at RC and RR isolation levels.

The result matrices are below (tested against Debug builds):

BASELINE 84a859a6366:

clustered secondary
RC Inconsistent [3] Inconsistent [10]
RR Undecided (deadlock) Undecided (deadlock)

PATCHED 7bcbcb65ab6:

clustered secondary
RC Inconsistent [3] Inconsistent [10]
RR Inconsistent [1] Inconsistent [10]

The difference is only in the REPEATABLE READ results: baseline raises ER_LOCK_DEADLOCK and this PR removes it, so the DELETE silently keeps the relocated row, on both clustered and secondary indexes. READ COMMITTED is identical on both and pre-existing (no gap locks there, so that guard never applied). So I believe the RR cases are attributable to this PR; RC is not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.

Development

Successfully merging this pull request may close these issues.

6 participants