Optimize scheduler for chunk prefill#7466
Optimize scheduler for chunk prefill#7466liyonghua0910 wants to merge 1 commit intoPaddlePaddle:release/2.6from
Conversation
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-17 17:29 CST\n\n## 📋 Review 摘要\n\nPR 概述:优化 chunk prefill 场景下的调度策略——将 running 队列中 chunk prefill 请求的 block 分配判断改为使用更保守的阈值(_get_can_schedule_prefill_threshold_block),并移除 chunk prefill 场景的抢占(preempt)逻辑,当资源不足时直接跳过并阻止 WAITING 队列调度。\n变更范围:engine/sched/resource_manager_v1.py\n影响面 Tag:Scheduler\n\n### 📝 PR 规范检查\n\nPR 标题缺少必需的 Tag 标签,且 Motivation / Modifications 描述为空,建议补充。\n\n标题建议(可直接复制):\n-[Scheduler][Optimization] Optimize scheduler for chunk prefill\n\n描述模板(可直接复制):\n\n## Motivation\n优化 chunk prefill 场景下的调度策略,使用更保守的 block 阈值判断,并移除 chunk prefill 的抢占机制,避免不必要的 preemption 开销,提升整体调度效率。\n\n## Modifications\n1. 新增 chunk_prefill_in_running_not_satisfied 标志,标记 running 队列中 chunk prefill 无法满足时的状态。\n2. 将 chunk prefill 的 block 分配判断从 num_new_block 改为 _get_can_schedule_prefill_threshold_block 返回的保守阈值。\n3. 移除 chunk prefill 场景的 _trigger_preempt 调用,资源不足时直接 break。\n4. 当 chunk prefill 不满足时,阻止 WAITING 队列的调度。\n\n\n### 问题\n\n| 级别 | 文件 | 概述 |\n|------|------|------|\n| ❓ 疑问 |resource_manager_v1.py:929| break 会跳过 running 队列中后续所有请求(包括 decode),需确认是否影响 decode 延迟 |\n| 🟡 建议 |resource_manager_v1.py:920| 阈值判断与实际分配数量不一致,建议添加注释说明设计意图 |\n\n### 总体评价\n\n调度优化方向合理——避免 chunk prefill 场景不必要的抢占开销。但移除 preemption 后,running 队列中排在 chunk prefill 之后的 decode 请求也会被跳过调度,可能在高负载场景下影响 decode 延迟,建议作者确认这一行为变更在实际场景中的影响。"
| # Prepare prefill task | ||
| scheduled_reqs.append(self._prepare_prefill_task(request, num_new_tokens)) | ||
| else: # Not enough blocks to allocate | ||
| chunk_prefill_in_running_not_satisfied = True |
There was a problem hiding this comment.
❓ 疑问 break 会跳过 running 队列中后续所有请求的调度(包括 decode 请求)
旧代码在 chunk prefill 资源不足时会先尝试 _trigger_preempt,抢占成功后可以继续调度后续请求;仅在抢占也失败时才 break。
现在直接 break 意味着:如果 running 队列中某个 chunk prefill 请求因 can_schedule_block_num_threshold(比实际需要的 num_new_block 更保守)判断不通过,后面排队的 decode 请求也会被跳过一轮调度。
在高并发、资源紧张的场景下,这是否会导致 decode 请求延迟增加?是否有线上 benchmark 数据验证这一行为变更的影响?
| req_index += 1 | ||
| continue | ||
| num_new_block = self.get_new_block_nums(request, num_new_tokens) | ||
| can_schedule_block_num_threshold = self._get_can_schedule_prefill_threshold_block(num_new_block) |
There was a problem hiding this comment.
🟡 建议 建议添加注释说明 can_schedule_block_num_threshold 与 num_new_block 的关系
这里使用 _get_can_schedule_prefill_threshold_block 返回的阈值(可能大于 num_new_block,会额外预留 running 请求的 output block 空间)进行调度判断,但第 924 行实际分配时使用的是 num_new_block。
这种「判断用保守阈值、分配用实际数量」的设计是正确的(确保预留 decode 空间),但逻辑不够直观,建议添加注释说明设计意图,方便后续维护。例如:
# Use a conservative threshold that reserves output blocks for running decode requests,
# but only allocate the actual needed blocks for this chunk prefill.
can_schedule_block_num_threshold = self._get_can_schedule_prefill_threshold_block(num_new_block)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7466 +/- ##
==============================================
Coverage ? 73.71%
==============================================
Files ? 376
Lines ? 52987
Branches ? 8275
==============================================
Hits ? 39057
Misses ? 11189
Partials ? 2741
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.