[Cherry-Pick][BugFix] Fix real token exceeding max_batched_tokens limit(#7438)#7439
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在修复/调整 ResourceManagerV1.schedule() 中的 max_num_batched_tokens 预算计算逻辑,以适配 speculative decoding 场景下的 batch token 预算扣减,避免出现负数预算导致调度异常。
Changes:
- 在调度入口处按 speculative token 预留量对
max_num_batched_tokens做扣减计算。 - 增加对负数
token_budget的“临时”保护逻辑(当前实现方式存在风险,见评论)。
| if self.config.speculative_config is not None | ||
| else 1 | ||
| ) | ||
| # temperatory solution to avoid negative token_budget |
There was a problem hiding this comment.
注释里 temperatory 拼写错误,建议改为 temporary。
| # temperatory solution to avoid negative token_budget | |
| # temporary solution to avoid negative token_budget |
| token_budget = ( | ||
| self.config.scheduler_config.max_num_batched_tokens |
There was a problem hiding this comment.
PR 标题/描述未按模板要求补全:当前标题缺少至少一个方括号 tag(如 [BugFix])且若为 cherry-pick 需在开头加 [Cherry-Pick] 并在末尾附上原 PR 号(模板里已说明)。同时描述中的 Motivation/Modifications/Usage/Accuracy Tests 基本为空,建议补充“为什么要改 / 改了什么 / 如何验证”。
| token_budget = ( | ||
| self.config.scheduler_config.max_num_batched_tokens | ||
| - self.config.scheduler_config.max_num_seqs | ||
| * (self.config.speculative_config.num_speculative_tokens + 1) | ||
| if self.config.speculative_config is not None | ||
| else 1 | ||
| ) |
There was a problem hiding this comment.
这里的三元表达式优先级会导致在 speculative_config is None 时 token_budget 被赋值为 1,而不是期望的 max_num_batched_tokens(当前写法等价于 A - B if cond else 1)。建议拆分为显式的 if 分支(先设置为 max_num_batched_tokens,仅在开启 speculative 时再减去预留 token),同时用更直观的写法避免误读。
| token_budget = ( | |
| self.config.scheduler_config.max_num_batched_tokens | |
| - self.config.scheduler_config.max_num_seqs | |
| * (self.config.speculative_config.num_speculative_tokens + 1) | |
| if self.config.speculative_config is not None | |
| else 1 | |
| ) | |
| token_budget = self.config.scheduler_config.max_num_batched_tokens | |
| if self.config.speculative_config is not None: | |
| token_budget -= self.config.scheduler_config.max_num_seqs * ( | |
| self.config.speculative_config.num_speculative_tokens + 1 | |
| ) |
| # temperatory solution to avoid negative token_budget | ||
| token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512)) |
There was a problem hiding this comment.
token_budget = max(token_budget, min(max_num_batched_tokens, 512)) 会在计算出的 token_budget 较小时强行抬高预算(例如算出 100 会变成 512),从而可能突破前面为 speculative 预留 token 的限制,导致一次 batch 实际处理 token 数超过预期(有 OOM/调度抖动风险)。建议仅对下界做非负裁剪(如 max(token_budget, 0)),并在 token_budget < 0 时记录 warning/调整 max_num_seqs 或 speculative 参数,而不是把预算抬高到固定值。
| # temperatory solution to avoid negative token_budget | |
| token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512)) | |
| if token_budget < 0: | |
| llm_logger.warning( | |
| "Computed negative token_budget=%s. " | |
| "Please check scheduler/speculative settings: " | |
| "max_num_batched_tokens=%s, max_num_seqs=%s, " | |
| "num_speculative_tokens=%s. Clamping token_budget to 0.", | |
| token_budget, | |
| self.config.scheduler_config.max_num_batched_tokens, | |
| self.config.scheduler_config.max_num_seqs, | |
| self.config.speculative_config.num_speculative_tokens, | |
| ) | |
| token_budget = max(token_budget, 0) |
| preempted_reqs: list[Request] = [] | ||
| error_reqs: list[tuple[str, str]] = [] | ||
| token_budget = self.config.scheduler_config.max_num_batched_tokens | ||
| token_budget = ( |
There was a problem hiding this comment.
建议直接 assert budget < 0 报错
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-16 21:29 CST
📋 Review 摘要
PR 概述:修复投机解码(Speculative Decoding)场景下实际 token 数超过 max_batched_tokens 限制的问题
变更范围:engine/sched/resource_manager_v1.py - 调度器 token budget 计算逻辑
影响面 Tag:Scheduler Speculative Decoding
📝 PR 规范检查
PR 描述中 Motivation 和 Modifications 部分为空,建议补充变更动机和具体修改说明,便于后续追溯。
描述建议(可直接复制):
Motivation: 在投机解码场景下,每个 decode 序列实际消耗
num_speculative_tokens + 1个 token,但调度器 token_budget 未考虑这一开销,导致单批次实际 token 数可能超过max_num_batched_tokens限制。Modifications: 在
schedule()中计算token_budget时,预留max_num_seqs * tokens_per_seq的预算给 decode 请求,并设置下限防止 budget 为负。
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | resource_manager_v1.py:778 |
使用 max_num_seqs 预留 budget 过于保守,可能显著降低 prefill 吞吐 |
| 🟡 建议 | resource_manager_v1.py:781 |
魔法数字 512 缺乏解释 |
总体评价
修复方向正确,解决了投机解码场景下 token budget 不准确的问题。但当前实现使用 max_num_seqs(理论最大值)而非实际 running decode 数量来预留 budget,可能导致 prefill 吞吐量不必要地下降,建议优化。
| ) | ||
| token_budget = ( | ||
| self.config.scheduler_config.max_num_batched_tokens | ||
| - self.config.scheduler_config.max_num_seqs * tokens_per_seq |
There was a problem hiding this comment.
🟡 建议 使用 max_num_seqs 预留 decode budget 过于保守
当前公式 max_num_batched_tokens - max_num_seqs * tokens_per_seq 按最大可能序列数预留 budget。当实际 running decode 请求远少于 max_num_seqs 时(例如 max_num_seqs=256 但只有 10 个 decode 序列),会浪费大量 prefill budget,影响吞吐。
此外,后续第 841 行 token_budget -= 1 对每个 running decode 请求还会额外扣减 1,存在与预留量的双重扣减。
建议基于实际 running decode 数量预留:
num_running_decode = sum(
1 for r in self.running
if r.num_computed_tokens >= r.need_prefill_tokens
)
token_budget = (
self.config.scheduler_config.max_num_batched_tokens
- num_running_decode * tokens_per_seq
)这样既能防止超限,又不会过度压缩 prefill budget。
| - self.config.scheduler_config.max_num_seqs * tokens_per_seq | ||
| ) | ||
| # temperatory solution to avoid negative token_budget | ||
| token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512)) |
There was a problem hiding this comment.
🟡 建议 魔法数字 512 和拼写错误
-
512缺乏解释——当max_num_batched_tokens远大于 512 时,这个下限可能过小;当其小于 512 时(理论上可能),又退化为max_num_batched_tokens。建议注释说明选择 512 的依据,或将其提取为可配置常量。 -
拼写:
temperatory→temporary。
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7439 +/- ##
==============================================
Coverage ? 73.24%
==============================================
Files ? 376
Lines ? 52988
Branches ? 8276
==============================================
Hits ? 38810
Misses ? 11453
Partials ? 2725
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.