[fix](cloud) Fix cloud warm up balance slow scheduling by deardeng · Pull Request #58962 · apache/doris

deardeng · 2025-12-11T09:29:09Z

What problem does this PR solve?

Currently, when performing tablet warm-up balancing in the cloud, the sequential execution of a single warm-up task leads to a series of problems, such as:

When scaling up a computer group to include beta nodes, with a large number of tables (millions of tablets), actual tests showed that scaling from 1 beta node to 10 beta nodes took more than 6 hours to reach a balanced state. Each warm-up task RPC took about 30ms. This means that even if a new node can handle the load, scaling up a new node in the cloud can still take up to 6 hours in the worst case.
Due to the same logic, decomission be is also relatively slow.

Fixes:

Batch and pipeline warm-up tasks. Each batch can contain multiple warm-up tasks with the same source and destination (each task represents migrating one tablet).
Separate the warm-up task finish thread to prevent scheduling logic from affecting the logic that modifies tablet-to-tablet mappings.
Asynchronously fetch file cache meta in the warm_up_cache_async logic and add some bvars.

Post-fix testing showed that in a scenario with 10 databases, 10,000 tables, 100,000 partitions, and 1 million tablets, the number of be nodes increased from 3 to 10 within 10 minutes.

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

deardeng · 2025-12-11T09:32:02Z

run buildall

doris-robot · 2025-12-11T11:18:38Z

BE UT Coverage Report

Increment line coverage 0.00% (0/12) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.43% (18826/35238)
Line Coverage	39.19% (174139/444394)
Region Coverage	33.88% (135042/398632)
Branch Coverage	34.78% (58048/166885)

deardeng · 2025-12-17T11:56:04Z

run buildall

deardeng · 2025-12-17T12:00:30Z

run buildall

doris-robot · 2025-12-17T12:44:27Z

TPC-H: Total hot run time: 35190 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 88ab2db752b06c048f7b502e947f31961fa3a44e, data reload: false

------ Round 1 ----------------------------------
q1	17619	4153	4096	4096
q2	2042	362	258	258
q3	10162	1353	750	750
q4	10214	834	309	309
q5	7500	2136	1911	1911
q6	185	171	139	139
q7	1073	864	710	710
q8	9355	1464	1184	1184
q9	7088	5328	5319	5319
q10	6789	2391	1992	1992
q11	562	341	298	298
q12	664	724	581	581
q13	17795	3683	3117	3117
q14	295	292	274	274
q15	615	522	520	520
q16	691	680	641	641
q17	688	749	602	602
q18	7846	7150	7166	7150
q19	1101	970	598	598
q20	411	363	259	259
q21	4214	3977	3513	3513
q22	1072	989	969	969
Total cold run time: 107981 ms
Total hot run time: 35190 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4099	4079	4067	4067
q2	316	406	328	328
q3	2147	2646	2306	2306
q4	1342	1722	1312	1312
q5	4254	4645	4738	4645
q6	238	170	129	129
q7	2031	1931	1874	1874
q8	2738	2586	2608	2586
q9	7669	7814	7453	7453
q10	3123	3232	2819	2819
q11	598	526	491	491
q12	681	737	648	648
q13	3765	3918	3305	3305
q14	320	349	274	274
q15	557	515	517	515
q16	646	672	626	626
q17	1202	1465	1463	1463
q18	7922	7770	7650	7650
q19	951	850	852	850
q20	2091	2056	1937	1937
q21	4928	4311	4161	4161
q22	1091	1044	988	988
Total cold run time: 52709 ms
Total hot run time: 50427 ms

doris-robot · 2025-12-17T12:55:31Z

TPC-DS: Total hot run time: 178254 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 88ab2db752b06c048f7b502e947f31961fa3a44e, data reload: false

query5	5783	599	448	448
query6	331	225	211	211
query7	4228	466	283	283
query8	310	247	232	232
query9	8765	2549	2580	2549
query10	552	395	330	330
query11	15691	15191	14531	14531
query12	184	124	117	117
query13	1250	491	389	389
query14	6954	3127	2824	2824
query14_1	2697	2684	2721	2684
query15	224	203	184	184
query16	894	483	454	454
query17	1134	697	566	566
query18	2688	432	335	335
query19	229	227	204	204
query20	122	115	113	113
query21	219	139	117	117
query22	4120	3986	3807	3807
query23	16594	16295	15867	15867
query23_1	15976	16045	15990	15990
query24	7159	1657	1250	1250
query24_1	1239	1248	1281	1248
query25	556	461	417	417
query26	1245	266	161	161
query27	2772	493	320	320
query28	4483	2136	2113	2113
query29	806	549	458	458
query30	319	247	217	217
query31	821	714	633	633
query32	81	70	73	70
query33	533	353	290	290
query34	925	909	531	531
query35	797	837	745	745
query36	867	897	817	817
query37	139	102	83	83
query38	2851	2894	2854	2854
query39	760	746	728	728
query39_1	694	711	694	694
query40	231	140	122	122
query41	66	64	64	64
query42	110	116	110	110
query43	424	443	407	407
query44	1364	761	747	747
query45	200	194	186	186
query46	875	981	625	625
query47	1653	1694	1614	1614
query48	320	332	265	265
query49	619	453	359	359
query50	665	291	217	217
query51	3826	3867	3835	3835
query52	115	115	106	106
query53	333	365	294	294
query54	289	258	255	255
query55	80	81	74	74
query56	305	309	295	295
query57	1143	1127	1076	1076
query58	264	261	269	261
query59	2297	2424	2371	2371
query60	320	308	291	291
query61	162	158	152	152
query62	687	659	628	628
query63	332	291	305	291
query64	4911	1300	1091	1091
query65	4014	3947	3931	3931
query66	1387	443	336	336
query67	15258	14789	14797	14789
query68	2756	1037	760	760
query69	470	365	323	323
query70	1049	1001	960	960
query71	340	311	278	278
query72	6023	5090	4977	4977
query73	477	541	303	303
query74	8826	8860	8576	8576
query75	3125	3177	2821	2821
query76	2831	1139	726	726
query77	358	411	313	313
query78	9539	9677	8858	8858
query79	2514	878	608	608
query80	1628	670	572	572
query81	597	268	235	235
query82	411	133	101	101
query83	365	251	238	238
query84	257	125	110	110
query85	964	512	455	455
query86	475	291	308	291
query87	3009	3063	2982	2982
query88	3320	2266	2278	2266
query89	474	423	392	392
query90	2063	161	154	154
query91	174	179	143	143
query92	76	69	65	65
query93	1385	901	551	551
query94	552	304	283	283
query95	551	327	359	327
query96	593	467	211	211
query97	2295	2352	2236	2236
query98	229	199	192	192
query99	1294	1285	1196	1196
Total cold run time: 257150 ms
Total hot run time: 178254 ms

doris-robot · 2025-12-17T13:00:34Z

ClickBench: Total hot run time: 27.52 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 88ab2db752b06c048f7b502e947f31961fa3a44e, data reload: false

query1	0.05	0.05	0.06
query2	0.10	0.04	0.05
query3	0.26	0.09	0.09
query4	1.61	0.12	0.11
query5	0.27	0.26	0.27
query6	1.16	0.65	0.63
query7	0.03	0.03	0.03
query8	0.05	0.04	0.04
query9	0.58	0.51	0.51
query10	0.56	0.56	0.56
query11	0.17	0.11	0.11
query12	0.15	0.12	0.12
query13	0.61	0.59	0.60
query14	1.00	0.98	0.96
query15	0.82	0.80	0.81
query16	0.42	0.42	0.41
query17	1.03	1.08	1.10
query18	0.22	0.21	0.21
query19	1.98	1.77	1.87
query20	0.02	0.02	0.01
query21	15.44	0.29	0.14
query22	4.73	0.05	0.04
query23	15.98	0.28	0.10
query24	1.10	0.41	0.48
query25	0.08	0.08	0.08
query26	0.14	0.14	0.14
query27	0.06	0.04	0.05
query28	4.68	1.24	1.03
query29	12.58	3.97	3.19
query30	0.27	0.14	0.11
query31	2.81	0.62	0.40
query32	3.25	0.55	0.45
query33	2.96	3.05	3.16
query34	16.88	5.23	4.61
query35	4.57	4.51	4.59
query36	0.64	0.49	0.49
query37	0.11	0.07	0.06
query38	0.07	0.04	0.04
query39	0.05	0.04	0.04
query40	0.18	0.15	0.13
query41	0.09	0.03	0.02
query42	0.04	0.03	0.03
query43	0.03	0.03	0.04
Total cold run time: 97.83 s
Total hot run time: 27.52 s

doris-robot · 2025-12-17T13:43:51Z

BE UT Coverage Report

Increment line coverage 0.00% (0/12) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.38% (18821/35258)
Line Coverage	39.17% (174221/444816)
Region Coverage	33.75% (134901/399660)
Branch Coverage	34.64% (58081/167653)

hello-stephen · 2025-12-17T15:21:01Z

BE Regression && UT Coverage Report

Increment line coverage 10.38% (11/106) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	72.21% (24950/34553)
Line Coverage	58.96% (261980/444310)
Region Coverage	53.87% (217782/404297)
Branch Coverage	55.36% (93269/168466)

hello-stephen · 2025-12-17T15:32:27Z

FE Regression Coverage Report

Increment line coverage 32.47% (63/194) 🎉
Increment coverage report
Complete coverage report

github-actions · 2025-12-23T05:32:17Z

PR approved by at least one committer and no changes requested.

github-actions · 2025-12-23T05:32:20Z

PR approved by anyone and no changes requested.

freemandealer

LGTM

Currently, when performing tablet warm-up balancing in the cloud, the sequential execution of a single warm-up task leads to a series of problems, such as: 1. When scaling up a computer group to include beta nodes, with a large number of tables (millions of tablets), actual tests showed that scaling from 1 beta node to 10 beta nodes took more than 6 hours to reach a balanced state. Each warm-up task RPC took about 30ms. This means that even if a new node can handle the load, scaling up a new node in the cloud can still take up to 6 hours in the worst case. 2. Due to the same logic, decomission be is also relatively slow. Fixes: 1. Batch and pipeline warm-up tasks. Each batch can contain multiple warm-up tasks with the same source and destination (each task represents migrating one tablet). 2. Separate the warm-up task finish thread to prevent scheduling logic from affecting the logic that modifies tablet-to-tablet mappings. 3. Asynchronously fetch file cache meta in the warm_up_cache_async logic and add some bvars. Post-fix testing showed that in a scenario with 10 databases, 10,000 tables, 100,000 partitions, and 1 million tablets, the number of be nodes increased from 3 to 10 within 10 minutes.

…8962 (#59337) cherry pick from #58962

liutang123 · 2025-12-29T11:29:49Z

branch-3.1: #59336

[fix](cloud) Fix cloud warm up balance slow scheduling

06299fa

deardeng requested review from dataroaring, gavinchou and w41ter as code owners December 11, 2025 09:29

deardeng force-pushed the fix-warm-up-balance-slow branch from ad2b9c6 to 06299fa Compare December 11, 2025 09:32

fix case

88ab2db

deardeng force-pushed the fix-warm-up-balance-slow branch from bb9e649 to 88ab2db Compare December 17, 2025 12:00

gavinchou approved these changes Dec 23, 2025

View reviewed changes

gavinchou added dev/3.1.x dev/4.0.x labels Dec 23, 2025

github-actions Bot added approved Indicates a PR has been approved by one committer. reviewed labels Dec 23, 2025

deardeng mentioned this pull request Dec 24, 2025

[opt](cloud) optimize cloud balance warm up rpc #59155

Closed

16 tasks

freemandealer approved these changes Dec 24, 2025

View reviewed changes

gavinchou merged commit dbc4805 into apache:master Dec 24, 2025
28 of 30 checks passed

github-actions Bot added dev/3.1.x-conflict dev/4.0.x-conflict labels Dec 24, 2025

deardeng mentioned this pull request Dec 24, 2025

branch-4.0: [fix](cloud) Fix cloud warm up balance slow scheduling #58962 #59337

Merged

yiguolei pushed a commit that referenced this pull request Dec 25, 2025

branch-4.0: [fix](cloud) Fix cloud warm up balance slow scheduling #5…

97ea173

…8962 (#59337) cherry pick from #58962

yiguolei added dev/4.0.3-merged and removed dev/4.0.x dev/4.0.x-conflict labels Dec 25, 2025

deardeng added dev/3.1.4-merged and removed dev/3.1.x-conflict labels Mar 30, 2026

gavinchou removed the dev/3.1.x label Apr 1, 2026

Conversation

deardeng commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

deardeng commented Dec 11, 2025

Uh oh!

doris-robot commented Dec 11, 2025

BE UT Coverage Report

Uh oh!

deardeng commented Dec 17, 2025

Uh oh!

deardeng commented Dec 17, 2025

Uh oh!

doris-robot commented Dec 17, 2025

Uh oh!

doris-robot commented Dec 17, 2025

Uh oh!

doris-robot commented Dec 17, 2025

Uh oh!

doris-robot commented Dec 17, 2025

BE UT Coverage Report

Uh oh!

hello-stephen commented Dec 17, 2025

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Dec 17, 2025

FE Regression Coverage Report

Uh oh!

github-actions Bot commented Dec 23, 2025

Uh oh!

github-actions Bot commented Dec 23, 2025

Uh oh!

freemandealer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liutang123 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

deardeng commented Dec 11, 2025 •

edited

Loading

liutang123 commented Dec 29, 2025 •

edited

Loading