Skip to content

[Bug]: DROP ACCOUNT can fail with commitUnsafe after statement_cu lock escalation #24073

@gouhongshen

Description

@gouhongshen

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

3.0-dev

Commit ID

e6d0a67

Other Environment Information

- Hardware parameters: cloud QA deployment
- OS type: issue reproduced from a mysql client on macOS; server-side logs came from a MatrixOne QA cluster
- Others: internal Grafana/Loki logs are available and were used for the analysis below

Actual Behavior

DROP ACCOUNT can hang for a long time and finally fail with commitUnsafe.

Observed SQL:

select now();
drop account `79307612_ae09_4fc0_81c3_e3f8b13a9927`;
select now();

Observed timestamps / error:

2026-04-01 11:25:19.104741

ERROR 1105 (HY000): context deadline exceeded
internal error: commitUnsafe

2026-04-01 11:40:21.038359

So the statement spent about 15 minutes before failing.

Expected Behavior

DROP ACCOUNT should complete successfully or fail quickly with a direct and actionable error. Cleaning account-owned internal metric rows should not make the whole account-drop transaction hang and then time out at commit.

Steps to Reproduce

1. Prepare an account that has accumulated a noticeable amount of statement CU metric data and is still subject to concurrent metric writes.
2. From the sys tenant, execute:

   select now();
   drop account `{account_name}`;
   select now();

3. Observe that the statement can block for minutes and then fail with:

   ERROR 1105 (HY000): context deadline exceeded
   internal error: commitUnsafe

Additional information

Key evidence from logs

  1. The second failing attempt started at 2026-04-01 11:25:19 and the final visible dropAccount ... sql: entries reached:

    delete from mo_catalog.`mo_data_key` where account_id = 255000001;
    delete from mo_catalog.`statement_cu` where account_id = 255000001;
    
  2. The same background transaction then logged:

    2026-04-01 11:26:44.895452 +0000 lockservice/lock_table_local.go:111
    failed to lock on local
    table: 272566
    row count: 4111
    opts: Exclusive-Row-Wait
    error: row level lock is too large that need upgrade to table level lock
    
  3. Earlier in the same transaction, disttae/txn_table.go:743 had already identified table 272566 as statement_cu.

  4. The same background transaction later logged:

    2026-04-01 11:33:51.786435 +0000 frontend/back_status_stmt.go:41
    time of Exec.Run : 8m30.771138389s
    
  5. It finally ended with commit-phase timeout logs:

    2026-04-01 11:40:20.975546 +0000 client/operator.go:1204
    txn send requests failed
    error: context deadline exceeded
    
    2026-04-01 11:40:20.975622 +0000 client/operator.go:1428
    txn wait committed log applied failed in rc mode
    error: context deadline exceeded
    

Code path / root cause analysis

  • pkg/frontend/authenticate.go:3787-3791 opens a background transaction for doDropAccount() and only commits it at finishTxn().

  • pkg/frontend/authenticate.go:3999-4007 iterates cluster tables and executes:

    delete from mo_catalog.`{cluster_table}` where account_id = {account_id};
  • In this reproduction, one of those cluster tables is statement_cu, and the failing lock log shows the delete tried to lock 4111 rows and had to upgrade from row-level locking to a table-level lock, but that lock was not acquired.

  • sql_statement_cu is a continuously written metric table (pkg/util/metric/mometric/metric_collector.go:316-346) and is marked with [mo_no_del_hint] in schema metadata (pkg/bootstrap/versions/v2_0_0/cluster_upgrade_list.go:264-274), so it is a particularly hot place to do synchronous account cleanup.

  • pkg/frontend/txn.go:503-506 commits with CommitOrRollbackTimeout, and pkg/txn/client/operator.go:1414-1428 waits for committed logtail apply in RC mode. After the long lock wait stretched the background transaction, the final commit hit that timeout and surfaced as context deadline exceeded / commitUnsafe.

Conclusion

The root cause is that DROP ACCOUNT performs hot cluster-table cleanup (statement_cu) inside the same large background transaction as the rest of account deletion. The delete on statement_cu hit lock escalation pressure, needed a table-level lock, could not acquire it under concurrent activity, and stretched the transaction long enough that the final commit timed out.

Possible fix directions

  1. Special-case statement_cu / similar hot metric tables so they are not cleaned synchronously inside the critical DROP ACCOUNT transaction.
  2. If synchronous cleanup is required, acquire an appropriate table-level lock up front or chunk the delete more carefully.
  3. Reduce the scope of the single DROP ACCOUNT transaction so hot cluster-table cleanup does not hold the entire account-drop path hostage.

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions