Skip to content

[feature](restore) support concurrent backup/restore#61710

Open
Ryan19929 wants to merge 1 commit intoapache:masterfrom
Ryan19929:backup_restore_concurrency
Open

[feature](restore) support concurrent backup/restore#61710
Ryan19929 wants to merge 1 commit intoapache:masterfrom
Ryan19929:backup_restore_concurrency

Conversation

@Ryan19929
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:
Doris currently only allows one backup/restore job per database at a time, which becomes a bottleneck in CCR scenarios where dozens of tables need concurrent synchronization.

This PR implements table-level backup/restore concurrency control (gated by enable_table_level_backup_concurrency, default false):

  1. Dual-queue scheduling — Splits the single job queue into Running queue (active jobs) + History queue (finished jobs). New jobs enter PENDING state; the scheduler promotes PENDING jobs to allowedJobIds for execution based on the following rules. Jobs exceeding backup_pending_job_timeout_ms in queue are auto-cancelled.

  2. OOM protection — Extends max_backup_tablets_per_job to RestoreJob; adds max_concurrent_snapshot_tasks_total for global snapshot task cap across all concurrent jobs.

  3. CANCEL label filter — Supports CANCEL BACKUP/RESTORE WHERE LABEL = 'xxx' and WHERE LABEL LIKE 'xxx%' to cancel specific jobs. Without WHERE clause, cancels all matching jobs in the database (behavior change only in concurrency mode; legacy mode unchanged).

  4. Observability — Adds QueuePos and BlockReason columns to SHOW BACKUP/RESTORE.

  5. CCR compatibility — Derives tableRefs from BackupJobInfo when RPC requests omit table_refs, ensuring correct table-level concurrency control.

Scheduling Rules:

Rule Condition Behavior
Concurrency limit runningBackups + runningRestores >= max_backup_restore_concurrent_num_per_db Block (PENDING)
Backup/Restore mutual exclusion Backup cannot activate while any restore is running, and vice versa Block (PENDING)
Full-db backup exclusivity Full-db backup already pending/running → new backup submitted Reject
Full-db backup waiting Full-db backup submitted while table-level backups are running Block (PENDING)
Full-db restore exclusivity Full-db restore running → other restores submitted Block (PENDING)
Restore table conflict Two restores targeting the same table Block (PENDING)

Design Note: Full-database backup submissions are hard-rejected when one already exists, because the backup snapshot is reusable — any subsequent backup would be redundant. Full-database restore submissions are queued (PENDING) instead, because each restore may target a different snapshot and has independent business value — they just cannot run concurrently.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
      concurrency mode only: CANCEL BACKUP/RESTORE without WHERE clause cancels all unfinished jobs of that type in the database, instead of just the single running job. This only applies when enable_table_level_backup_concurrency = true; legacy mode behavior is unchanged.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Ryan19929
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26665 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 11956c43f03ea889a8b1376852161926fd45c02e, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17587	4504	4296	4296
q2	q3	10643	775	524	524
q4	4679	361	249	249
q5	7683	1213	1000	1000
q6	185	176	150	150
q7	802	843	706	706
q8	10096	1481	1399	1399
q9	5130	4719	4660	4660
q10	6320	1903	1618	1618
q11	466	257	242	242
q12	760	581	465	465
q13	18037	2687	1949	1949
q14	223	234	217	217
q15	q16	736	749	668	668
q17	734	855	477	477
q18	6134	5387	5179	5179
q19	1366	985	606	606
q20	563	510	373	373
q21	4512	1884	1596	1596
q22	423	314	291	291
Total cold run time: 97079 ms
Total hot run time: 26665 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4796	4679	4651	4651
q2	q3	3933	4347	3790	3790
q4	900	1213	850	850
q5	4083	4387	4347	4347
q6	200	181	148	148
q7	1821	1706	1615	1615
q8	2493	2739	2550	2550
q9	7418	7435	7441	7435
q10	3752	4005	3611	3611
q11	515	441	413	413
q12	523	594	467	467
q13	2489	2980	2068	2068
q14	292	316	282	282
q15	q16	726	799	713	713
q17	1404	1420	1367	1367
q18	7367	6953	6610	6610
q19	992	970	957	957
q20	2045	2164	1984	1984
q21	4082	3645	3377	3377
q22	445	451	378	378
Total cold run time: 50276 ms
Total hot run time: 47613 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169216 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 11956c43f03ea889a8b1376852161926fd45c02e, data reload: false

query5	4328	642	511	511
query6	335	236	206	206
query7	4206	468	260	260
query8	330	241	228	228
query9	8693	2765	2726	2726
query10	514	389	339	339
query11	6993	5085	4865	4865
query12	180	131	124	124
query13	1277	460	331	331
query14	5742	3738	3529	3529
query14_1	2908	2863	2796	2796
query15	202	191	175	175
query16	972	443	453	443
query17	901	722	594	594
query18	2421	433	338	338
query19	230	204	181	181
query20	139	121	125	121
query21	217	133	109	109
query22	13342	14133	15084	14133
query23	16714	16421	16086	16086
query23_1	16018	15619	15662	15619
query24	7147	1620	1225	1225
query24_1	1230	1191	1253	1191
query25	541	469	404	404
query26	1236	263	151	151
query27	2772	483	294	294
query28	4513	1871	1867	1867
query29	864	577	477	477
query30	296	229	189	189
query31	990	958	875	875
query32	84	70	67	67
query33	506	335	286	286
query34	885	878	522	522
query35	666	677	599	599
query36	1068	1104	997	997
query37	140	98	86	86
query38	2945	2954	2876	2876
query39	862	821	810	810
query39_1	792	794	800	794
query40	232	156	140	140
query41	62	59	59	59
query42	258	256	254	254
query43	250	251	223	223
query44	
query45	197	193	182	182
query46	896	995	633	633
query47	2086	2150	2012	2012
query48	316	316	230	230
query49	630	468	392	392
query50	707	285	215	215
query51	4115	4117	4057	4057
query52	262	266	262	262
query53	296	342	286	286
query54	310	279	272	272
query55	97	91	86	86
query56	330	329	320	320
query57	1965	1874	1696	1696
query58	282	281	273	273
query59	2792	2960	2751	2751
query60	338	336	335	335
query61	156	150	154	150
query62	641	598	540	540
query63	313	283	280	280
query64	5127	1411	1115	1115
query65	
query66	1496	485	373	373
query67	24260	24362	24207	24207
query68	
query69	435	321	298	298
query70	1000	1001	955	955
query71	350	318	311	311
query72	2991	2883	2538	2538
query73	554	561	325	325
query74	9639	9571	9377	9377
query75	2866	2742	2465	2465
query76	2285	1030	702	702
query77	367	423	308	308
query78	10919	11175	10420	10420
query79	1300	779	571	571
query80	1297	626	544	544
query81	547	261	226	226
query82	1024	153	121	121
query83	333	269	250	250
query84	305	124	103	103
query85	928	502	469	469
query86	413	312	295	295
query87	3168	3161	2985	2985
query88	3577	2699	2673	2673
query89	445	381	352	352
query90	2017	191	185	185
query91	170	164	141	141
query92	76	75	71	71
query93	1034	883	503	503
query94	648	286	300	286
query95	593	334	387	334
query96	649	534	224	224
query97	2430	2503	2402	2402
query98	235	219	222	219
query99	1010	997	903	903
Total cold run time: 250448 ms
Total hot run time: 169216 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants