-
Notifications
You must be signed in to change notification settings - Fork 0
MDEV-38218 : Galera test failure on galera_bf_abort_flush_for_export #552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 10.11
Are you sure you want to change the base?
Conversation
hemantdangi-gc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is that FLUSH TABLES FOR EXPORT is a local operation (i.e
it is not replicated by Galera) but it takes MDL-lock. This
MDL-lock then can conflict with INSERT from other node causing
INSERT to be BF aborted. This depends on timing, if we
have enough time to find that INSERT is waiting MDL-lock
we do UNLOCK TABLES fast enough and avoid BF abort. If not
there will be BF-abort.
Test case is fixed so that no query about number of BF aborts
is counted as it is not stable.
The above two are optimization but not related to error. The error is for query 'SET SESSION wsrep_sync_wait = 0' and not for INSERT :
mysqltest: At line 19: query 'SET SESSION wsrep_sync_wait = 0' failed: ER_QUERY_INTERRUPTED (1317): Query execution was interrupted
The additional error priniting and warning will give more information for issue, but don't think thses changes resolves issue.
f1cdd9b to
8a8f690
Compare
hemantdangi-gc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better I have one doubt regarding requested thread whther that will be also create problem or not otherwise looks good.
This will also need server review as you have changes to MDL_ticket class.
sql/wsrep_mysqld.cc
Outdated
| /* These cases should be already being handled above */ | ||
| DBUG_ASSERT(granted_sql_command != SQLCOM_FLUSH && | ||
| granted_sql_command != SQLCOM_LOCK_TABLES && | ||
| request_thd->lex->sql_command != SQLCOM_DROP_TABLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar issue can happen with request_thd->lex->sql_command?
svoj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We normally determine connection state via THD members. E.g.:
thd->current_backup_stage != BACKUP_FINISHED - there's ongoing BACKUP
thd->global_read_lock.is_acquired() - ongoing FTWRL
thd->locked_tables_mode == LTM_LOCK_TABLES - ongoing FTFE or LOCK TABLES
MDL subsystem can be used to retrieve some of this information too, but I don't think it is right to treat it as a primary source for THD state.
Also granted_sql_command == SQLCOM_LOCK_TABLES is it a dead branch? IIUC it must have mdl_context.has_explicit_locks(), so it should hit previous branch?
|
@svoj Thank you for your review, I think I will remove the added field from MDL ticket and rewrite Galera MDL-conflict resolution using your suggestions. |
|
@janlindstrom thinking more on this, I think a more reliable way to determine your needs would be checking ticket namespace and type. Though using what I suggested above is most probably alright, I'd leave it up to you to decide which way you want to go.
The above means conflict exactly with BACKUP/FTWRL/FTFE/LT locks. However, IIRC, such connections may hold extra locks (like user level locks). If you want to detect all conflicts for such connections, suggestion in my previous comment is the way to go. |
8a8f690 to
4df11e7
Compare
Problem was in wsrep_handle_mdl_conflict function was comparing
thd->lex->sql_command variable for granted MDL-lock.
There is two possible schedules:
(1) FLUSH TABLES ... FOR EXPORT that will take MDL-lock (granted_thd).
INSERT from other node is conflicting operation (request_thd)
and sees MDL-conflict. Because granted_thd has not executed anything
else thd->lex->sql_command == SQLCOM_FLUSH and this case was
correctly handled in wsrep_handle_mdl_conflict i.e. INSERT needs
to wait.
(2) FLUSH TABLES ... FOR EXPORT that will take MDL-lock (granted_thd).
SET SESSION wsrep_sync_wait=0; (granted_thd)
INSERT from other node is conflicting operation (request_thd)
However, thd->lex->sql_command is not stored to taken MDL-lock. Now
as granted_thd is executing SET thd->lex->sql_command != SQLCOM_FLUSH
and INSERT that is BF will abort it and that means also FTFE is
killed and MDL-lock relesed. This is incorrect as FTFE has written
file on filesystem and it can't be really killed.
In this fix wsrep_handle_mdl_conflict is refactored not to use
thd->lex->sql_command as a variable used for decisions. Instead
connection state can be determined also via THD members. E.g.:
* wsrep_thd_is_toi() || wsrep_thd_is_applying - ongoing TOI or applier
* wsrep_thd_is_BF - thread is brute force
* wsrep_thd_is_SR - thread is streaming replication thread
* thd->current_backup_stage != BACKUP_FINISHED - there's ongoing BACKUP
* thd->global_read_lock.is_acquired() - ongoing FTWRL
* thd->locked_tables_mode == LTM_LOCK_TABLES - ongoing FTFE or LOCK TABLES
81d7989 to
2cf2e5e
Compare
Problem is that FLUSH TABLES FOR EXPORT is a local operation (i.e it is not replicated by Galera) but it takes MDL-lock. This MDL-lock then can conflict with INSERT from other node causing INSERT to be BF aborted. This depends on timing, if we have enough time to find that INSERT is waiting MDL-lock we do UNLOCK TABLES fast enough and avoid BF abort. If not there will be BF-abort.
Test case is fixed so that no query about number of BF aborts is counted as it is not stable. Furthermore, improved error printing and added warning when query is interrupted and there is error in wsrep layer.