[fix](cloud) Fix cloud warm up balance slow scheduling (#58962) by deardeng · Pull Request #59336 · apache/doris

deardeng · 2025-12-24T11:32:36Z

Currently, when performing tablet warm-up balancing in the cloud, the sequential execution of a single warm-up task leads to a series of problems, such as:

When scaling up a computer group to include beta nodes, with a large number of tables (millions of tablets), actual tests showed that scaling from 1 beta node to 10 beta nodes took more than 6 hours to reach a balanced state. Each warm-up task RPC took about 30ms. This means that even if a new node can handle the load, scaling up a new node in the cloud can still take up to 6 hours in the worst case.
Due to the same logic, decomission be is also relatively slow.

Fixes:

Batch and pipeline warm-up tasks. Each batch can contain multiple warm-up tasks with the same source and destination (each task represents migrating one tablet).
Separate the warm-up task finish thread to prevent scheduling logic from affecting the logic that modifies tablet-to-tablet mappings.
Asynchronously fetch file cache meta in the warm_up_cache_async logic and add some bvars.

Post-fix testing showed that in a scenario with 10 databases, 10,000 tables, 100,000 partitions, and 1 million tablets, the number of be nodes increased from 3 to 10 within 10 minutes.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Currently, when performing tablet warm-up balancing in the cloud, the sequential execution of a single warm-up task leads to a series of problems, such as: 1. When scaling up a computer group to include beta nodes, with a large number of tables (millions of tablets), actual tests showed that scaling from 1 beta node to 10 beta nodes took more than 6 hours to reach a balanced state. Each warm-up task RPC took about 30ms. This means that even if a new node can handle the load, scaling up a new node in the cloud can still take up to 6 hours in the worst case. 2. Due to the same logic, decomission be is also relatively slow. Fixes: 1. Batch and pipeline warm-up tasks. Each batch can contain multiple warm-up tasks with the same source and destination (each task represents migrating one tablet). 2. Separate the warm-up task finish thread to prevent scheduling logic from affecting the logic that modifies tablet-to-tablet mappings. 3. Asynchronously fetch file cache meta in the warm_up_cache_async logic and add some bvars. Post-fix testing showed that in a scenario with 10 databases, 10,000 tables, 100,000 partitions, and 1 million tablets, the number of be nodes increased from 3 to 10 within 10 minutes.

Thearas · 2025-12-24T11:32:41Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

deardeng · 2025-12-24T11:32:48Z

run buildall

deardeng · 2025-12-24T12:13:35Z

run feut

deardeng requested a review from morrySnow as a code owner December 24, 2025 11:32

liutang123 mentioned this pull request Dec 29, 2025

[fix](cloud) Fix cloud warm up balance slow scheduling #58962

Merged

16 tasks

morrySnow closed this Jan 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](cloud) Fix cloud warm up balance slow scheduling (#58962)#59336

[fix](cloud) Fix cloud warm up balance slow scheduling (#58962)#59336
deardeng wants to merge 1 commit intoapache:branch-3.1from
deardeng:pick_58962_to_doris_branch-3.1

deardeng commented Dec 24, 2025

Uh oh!

Thearas commented Dec 24, 2025

Uh oh!

deardeng commented Dec 24, 2025

Uh oh!

deardeng commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deardeng commented Dec 24, 2025

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Dec 24, 2025

Uh oh!

deardeng commented Dec 24, 2025

Uh oh!

deardeng commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants