MDEV-38218 : Galera test failure on galera_bf_abort_flush_for_export #552

janlindstrom · 2025-12-10T13:53:47Z

Problem is that FLUSH TABLES FOR EXPORT is a local operation (i.e it is not replicated by Galera) but it takes MDL-lock. This MDL-lock then can conflict with INSERT from other node causing INSERT to be BF aborted. This depends on timing, if we have enough time to find that INSERT is waiting MDL-lock we do UNLOCK TABLES fast enough and avoid BF abort. If not there will be BF-abort.

Test case is fixed so that no query about number of BF aborts is counted as it is not stable. Furthermore, improved error printing and added warning when query is interrupted and there is error in wsrep layer.

hemantdangi-gc

Problem is that FLUSH TABLES FOR EXPORT is a local operation (i.e
it is not replicated by Galera) but it takes MDL-lock. This
MDL-lock then can conflict with INSERT from other node causing
INSERT to be BF aborted. This depends on timing, if we
have enough time to find that INSERT is waiting MDL-lock
we do UNLOCK TABLES fast enough and avoid BF abort. If not
there will be BF-abort.

Test case is fixed so that no query about number of BF aborts
is counted as it is not stable.

The above two are optimization but not related to error. The error is for query 'SET SESSION wsrep_sync_wait = 0' and not for INSERT :
mysqltest: At line 19: query 'SET SESSION wsrep_sync_wait = 0' failed: ER_QUERY_INTERRUPTED (1317): Query execution was interrupted

The additional error priniting and warning will give more information for issue, but don't think thses changes resolves issue.

hemantdangi-gc

This looks better I have one doubt regarding requested thread whther that will be also create problem or not otherwise looks good.

This will also need server review as you have changes to MDL_ticket class.

hemantdangi-gc · 2025-12-15T06:42:49Z

sql/wsrep_mysqld.cc

+        /* These cases should be already being handled above */
+        DBUG_ASSERT(granted_sql_command != SQLCOM_FLUSH &&
+                    granted_sql_command != SQLCOM_LOCK_TABLES &&
+                    request_thd->lex->sql_command != SQLCOM_DROP_TABLE);


similar issue can happen with request_thd->lex->sql_command?

svoj

We normally determine connection state via THD members. E.g.:
thd->current_backup_stage != BACKUP_FINISHED - there's ongoing BACKUP
thd->global_read_lock.is_acquired() - ongoing FTWRL
thd->locked_tables_mode == LTM_LOCK_TABLES - ongoing FTFE or LOCK TABLES

MDL subsystem can be used to retrieve some of this information too, but I don't think it is right to treat it as a primary source for THD state.

Also granted_sql_command == SQLCOM_LOCK_TABLES is it a dead branch? IIUC it must have mdl_context.has_explicit_locks(), so it should hit previous branch?

janlindstrom · 2025-12-16T07:44:30Z

@svoj Thank you for your review, I think I will remove the added field from MDL ticket and rewrite Galera MDL-conflict resolution using your suggestions.

svoj · 2025-12-16T10:15:19Z

@janlindstrom thinking more on this, I think a more reliable way to determine your needs would be checking ticket namespace and type. Though using what I suggested above is most probably alright, I'd leave it up to you to decide which way you want to go.

key->mdl_namespace() == MDL_key::BACKUP && (ticket->get_type() <= MDL_BACKUP_WAIT_COMMIT || ticket->get_type() == MDL_BACKUP_BLOCK_DDL) - this is BACKUP STAGE

key->mdl_namespace() == MDL_key::BACKUP && (ticket->get_type() == MDL_BACKUP_FTWRL1 || ticket->get_type() == MDL_BACKUP_FTWRL2) - this is BACKUP STAGE - this is FTWRL

key->mdl_namespace() == MDL_key::TABLE && (ticket->get_type() == MDL_SHARED_READ_ONLY || ticket->get_type() == MDL_SHARED_NO_READ_WRITE) && ticket->get_ctx()->locked_tables_mode == LTM_LOCK_TABLES - FTFE or LOCK TABLES

The above means conflict exactly with BACKUP/FTWRL/FTFE/LT locks. However, IIRC, such connections may hold extra locks (like user level locks). If you want to detect all conflicts for such connections, suggestion in my previous comment is the way to go.

Problem was in wsrep_handle_mdl_conflict function was comparing thd->lex->sql_command variable for granted MDL-lock. There is two possible schedules: (1) FLUSH TABLES ... FOR EXPORT that will take MDL-lock (granted_thd). INSERT from other node is conflicting operation (request_thd) and sees MDL-conflict. Because granted_thd has not executed anything else thd->lex->sql_command == SQLCOM_FLUSH and this case was correctly handled in wsrep_handle_mdl_conflict i.e. INSERT needs to wait. (2) FLUSH TABLES ... FOR EXPORT that will take MDL-lock (granted_thd). SET SESSION wsrep_sync_wait=0; (granted_thd) INSERT from other node is conflicting operation (request_thd) However, thd->lex->sql_command is not stored to taken MDL-lock. Now as granted_thd is executing SET thd->lex->sql_command != SQLCOM_FLUSH and INSERT that is BF will abort it and that means also FTFE is killed and MDL-lock relesed. This is incorrect as FTFE has written file on filesystem and it can't be really killed. In this fix wsrep_handle_mdl_conflict is refactored not to use thd->lex->sql_command as a variable used for decisions. Instead connection state can be determined also via THD members. E.g.: * wsrep_thd_is_toi() || wsrep_thd_is_applying - ongoing TOI or applier * wsrep_thd_is_BF - thread is brute force * wsrep_thd_is_SR - thread is streaming replication thread * thd->current_backup_stage != BACKUP_FINISHED - there's ongoing BACKUP * thd->global_read_lock.is_acquired() - ongoing FTWRL * thd->locked_tables_mode == LTM_LOCK_TABLES - ongoing FTFE or LOCK TABLES

janlindstrom self-assigned this Dec 10, 2025

janlindstrom requested a review from hemantdangi-gc December 11, 2025 07:17

hemantdangi-gc reviewed Dec 11, 2025

View reviewed changes

janlindstrom force-pushed the 10.11-MDEV-22124 branch from f1cdd9b to 8a8f690 Compare December 12, 2025 12:43

hemantdangi-gc reviewed Dec 15, 2025

View reviewed changes

svoj reviewed Dec 15, 2025

View reviewed changes

janlindstrom force-pushed the 10.11-MDEV-22124 branch from 8a8f690 to 4df11e7 Compare December 17, 2025 10:51

janlindstrom force-pushed the 10.11-MDEV-22124 branch from 81d7989 to 2cf2e5e Compare December 18, 2025 08:26

test fix

929d65d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MDEV-38218 : Galera test failure on galera_bf_abort_flush_for_export #552

MDEV-38218 : Galera test failure on galera_bf_abort_flush_for_export #552

Uh oh!

janlindstrom commented Dec 10, 2025

Uh oh!

hemantdangi-gc left a comment

Uh oh!

hemantdangi-gc left a comment

Uh oh!

hemantdangi-gc Dec 15, 2025

Uh oh!

svoj left a comment

Uh oh!

janlindstrom commented Dec 16, 2025

Uh oh!

svoj commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MDEV-38218 : Galera test failure on galera_bf_abort_flush_for_export #552

Are you sure you want to change the base?

MDEV-38218 : Galera test failure on galera_bf_abort_flush_for_export #552

Uh oh!

Conversation

janlindstrom commented Dec 10, 2025

Uh oh!

hemantdangi-gc left a comment

Choose a reason for hiding this comment

Uh oh!

hemantdangi-gc left a comment

Choose a reason for hiding this comment

Uh oh!

hemantdangi-gc Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

svoj left a comment

Choose a reason for hiding this comment

Uh oh!

janlindstrom commented Dec 16, 2025

Uh oh!

svoj commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants