Skip to content

Conversation

@JNSimba
Copy link
Member

@JNSimba JNSimba commented Jan 14, 2026

What problem does this PR solve?

Related PR: #58898
After the Job is created for the first time, starting from the initial offset,
the task for the first split is scheduled, When the task status is running or failed,
If FE restarts, the split needs to be restore from the meta again.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@JNSimba
Copy link
Member Author

JNSimba commented Jan 14, 2026

run buildall

@Thearas
Copy link
Contributor

Thearas commented Jan 14, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba JNSimba requested a review from Copilot January 14, 2026 11:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes issues in the StreamingJob functionality related to split task scheduling and FE restart scenarios. Specifically, it addresses the problem where after a job's first split task is scheduled from the initial offset, if FE restarts while the task is running or failed, the split needs to be properly restored from metadata.

Changes:

  • Enhanced logging to include complete split information
  • Improved error handling for split deserialization failures
  • Added logic to handle offset provider replay when FE restarts with an empty persist state but existing metadata

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
MySqlSourceReader.java Changed logging to output full split information instead of just split ID
StreamingJobUtils.java Added exception handling for split deserialization and changed exception type to JobException
JdbcSourceOffsetProvider.java Restructured replay logic to handle FE restart scenarios where offsetProviderPersist is null but metadata exists
StreamingInsertJob.java Simplified condition for replaying offset provider to always attempt replay when provider exists

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JNSimba
Copy link
Member Author

JNSimba commented Jan 14, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31978 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e88c2e51608a8f70c2b412016b3ba2445a638dc7, data reload: false

------ Round 1 ----------------------------------
q1	17632	4338	4071	4071
q2	2024	358	266	266
q3	10123	1294	715	715
q4	10204	780	301	301
q5	7539	2110	1803	1803
q6	192	166	138	138
q7	953	793	670	670
q8	9274	1420	1139	1139
q9	4836	4640	4672	4640
q10	6789	1803	1376	1376
q11	524	301	281	281
q12	669	720	564	564
q13	17815	3812	3100	3100
q14	285	316	269	269
q15	577	506	510	506
q16	695	680	633	633
q17	648	774	567	567
q18	6725	6376	6832	6376
q19	1396	1084	714	714
q20	444	385	255	255
q21	3196	2710	2533	2533
q22	1101	1112	1061	1061
Total cold run time: 103641 ms
Total hot run time: 31978 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4338	4241	4386	4241
q2	316	412	320	320
q3	2220	2902	2426	2426
q4	1422	1862	1483	1483
q5	4732	4349	4330	4330
q6	215	172	129	129
q7	2007	1880	1794	1794
q8	2583	2426	2456	2426
q9	7063	7233	7372	7233
q10	2555	2661	2309	2309
q11	540	478	455	455
q12	636	672	571	571
q13	3389	3816	3124	3124
q14	269	293	261	261
q15	524	493	482	482
q16	623	667	622	622
q17	1157	1331	1368	1331
q18	7517	7305	7207	7207
q19	847	803	807	803
q20	1909	1958	1798	1798
q21	4475	4235	4027	4027
q22	1069	1016	980	980
Total cold run time: 50406 ms
Total hot run time: 48352 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173145 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e88c2e51608a8f70c2b412016b3ba2445a638dc7, data reload: false

query5	4416	649	468	468
query6	333	233	218	218
query7	4222	466	276	276
query8	346	260	239	239
query9	8724	2864	2889	2864
query10	526	381	324	324
query11	15137	15087	14854	14854
query12	180	118	116	116
query13	1271	510	403	403
query14	6427	3041	2779	2779
query14_1	2724	2769	2684	2684
query15	199	196	176	176
query16	975	475	462	462
query17	1113	690	586	586
query18	2618	439	341	341
query19	231	221	195	195
query20	121	115	114	114
query21	212	138	120	120
query22	3764	3882	3901	3882
query23	16065	15645	15329	15329
query23_1	15457	15494	15402	15402
query24	7136	1550	1180	1180
query24_1	1173	1158	1170	1158
query25	569	476	427	427
query26	1236	273	159	159
query27	2752	452	274	274
query28	4561	2128	2123	2123
query29	805	544	453	453
query30	315	238	215	215
query31	826	599	559	559
query32	82	79	75	75
query33	546	393	303	303
query34	892	881	520	520
query35	738	763	681	681
query36	864	879	856	856
query37	131	102	88	88
query38	2712	2659	2686	2659
query39	780	736	726	726
query39_1	713	721	722	721
query40	216	135	118	118
query41	66	62	68	62
query42	104	100	98	98
query43	429	468	424	424
query44	1353	737	733	733
query45	184	183	175	175
query46	833	922	573	573
query47	1434	1556	1338	1338
query48	312	338	236	236
query49	609	414	348	348
query50	639	282	207	207
query51	3791	3881	3733	3733
query52	97	108	94	94
query53	289	321	268	268
query54	290	285	291	285
query55	83	79	75	75
query56	302	311	316	311
query57	1012	1053	894	894
query58	269	266	250	250
query59	2115	2028	1955	1955
query60	330	338	314	314
query61	156	157	155	155
query62	383	343	305	305
query63	295	266	268	266
query64	4884	1286	968	968
query65	3808	3705	3737	3705
query66	1398	462	316	316
query67	15148	14979	15771	14979
query68	4911	1020	712	712
query69	509	366	331	331
query70	1012	843	940	843
query71	349	312	283	283
query72	5692	3334	3372	3334
query73	766	719	315	315
query74	8903	8778	8604	8604
query75	2840	2805	2458	2458
query76	3459	1072	664	664
query77	522	378	308	308
query78	9712	9826	9210	9210
query79	1604	893	576	576
query80	670	577	485	485
query81	512	263	231	231
query82	214	139	107	107
query83	255	256	248	248
query84	256	126	103	103
query85	910	520	462	462
query86	373	294	283	283
query87	2903	2894	2717	2717
query88	3465	2581	2571	2571
query89	384	352	325	325
query90	2024	167	166	166
query91	172	163	141	141
query92	82	71	69	69
query93	980	925	530	530
query94	563	327	320	320
query95	601	337	313	313
query96	649	504	226	226
query97	2320	2416	2304	2304
query98	226	206	198	198
query99	594	584	524	524
Total cold run time: 249946 ms
Total hot run time: 173145 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 26.75 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e88c2e51608a8f70c2b412016b3ba2445a638dc7, data reload: false

query1	0.05	0.05	0.05
query2	0.10	0.04	0.04
query3	0.26	0.09	0.08
query4	1.60	0.11	0.11
query5	0.27	0.25	0.25
query6	1.16	0.68	0.64
query7	0.03	0.03	0.02
query8	0.05	0.04	0.03
query9	0.57	0.51	0.51
query10	0.56	0.53	0.54
query11	0.15	0.09	0.10
query12	0.15	0.11	0.11
query13	0.60	0.58	0.60
query14	0.96	0.94	0.94
query15	0.79	0.77	0.79
query16	0.39	0.38	0.38
query17	1.06	1.03	1.07
query18	0.22	0.21	0.21
query19	1.95	1.78	1.82
query20	0.02	0.03	0.01
query21	15.46	0.27	0.14
query22	5.34	0.04	0.04
query23	15.90	0.28	0.09
query24	1.01	0.60	0.35
query25	0.09	0.06	0.06
query26	0.15	0.13	0.13
query27	0.06	0.06	0.05
query28	3.48	1.08	0.89
query29	12.54	3.86	3.14
query30	0.28	0.14	0.12
query31	2.80	0.63	0.39
query32	3.24	0.56	0.47
query33	3.03	3.07	3.05
query34	16.05	5.06	4.41
query35	4.47	4.51	4.44
query36	0.66	0.50	0.48
query37	0.11	0.07	0.06
query38	0.07	0.04	0.04
query39	0.05	0.03	0.03
query40	0.17	0.13	0.13
query41	0.09	0.03	0.03
query42	0.04	0.02	0.03
query43	0.05	0.04	0.03
Total cold run time: 96.08 s
Total hot run time: 26.75 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/32) 🎉
Increment coverage report
Complete coverage report

@JNSimba
Copy link
Member Author

JNSimba commented Jan 14, 2026

run cloud_p0

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/32) 🎉
Increment coverage report
Complete coverage report

Copy link
Contributor

@smallhibiscus smallhibiscus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@dataroaring
Copy link
Contributor

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

Copy link
Contributor

@sollhui sollhui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 15, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@JNSimba JNSimba merged commit de15d81 into apache:master Jan 15, 2026
32 of 34 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 15, 2026
… remainsplit relay problem (#59883)

### What problem does this PR solve?
 
Related PR:  #58898
After the Job is created for the first time, starting from the initial
offset,
the task for the first split is scheduled, When the task status is
running or failed,
If FE restarts, the split needs to be restore from the meta again.
yiguolei pushed a commit that referenced this pull request Jan 15, 2026
…d fe restart remainsplit relay problem #59883 (#59902)

Cherry-picked from #59883

Co-authored-by: wudi <wudi@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.3-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants