[BugFix] Fix real token exceeding max_batched_tokens limit#7438
[BugFix] Fix real token exceeding max_batched_tokens limit#7438freeliuzc merged 3 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
| token_budget = ( | ||
| self.config.scheduler_config.max_num_batched_tokens | ||
| - self.config.scheduler_config.max_num_seqs | ||
| * (self.config.speculative_config.num_speculative_tokens + 1) | ||
| if self.config.speculative_config is not None | ||
| else 1 | ||
| ) |
There was a problem hiding this comment.
这里的三元表达式会导致非 speculative 场景下 token_budget 直接变成 1(而不是 max_num_batched_tokens),并且当前判断只检查 speculative_config != None 就访问 num_speculative_tokens;当 speculative_config 存在但 method=None(仓库测试里就是这种配置)时会触发 AttributeError。建议改成显式 if/else:仅在 speculative_config.method 启用时才根据 num_speculative_tokens 调整预算,否则保持 max_num_batched_tokens;同时确保计算结果不会为负(例如 clamp 到 >=0)。
| token_budget = ( | |
| self.config.scheduler_config.max_num_batched_tokens | |
| - self.config.scheduler_config.max_num_seqs | |
| * (self.config.speculative_config.num_speculative_tokens + 1) | |
| if self.config.speculative_config is not None | |
| else 1 | |
| ) | |
| speculative_config = self.config.speculative_config | |
| token_budget = self.config.scheduler_config.max_num_batched_tokens | |
| if ( | |
| speculative_config is not None | |
| and getattr(speculative_config, "method", None) is not None | |
| ): | |
| token_budget = max( | |
| 0, | |
| self.config.scheduler_config.max_num_batched_tokens | |
| - self.config.scheduler_config.max_num_seqs | |
| * (speculative_config.num_speculative_tokens + 1), | |
| ) |
| token_budget = ( | ||
| self.config.scheduler_config.max_num_batched_tokens | ||
| - self.config.scheduler_config.max_num_seqs | ||
| * (self.config.speculative_config.num_speculative_tokens + 1) | ||
| if self.config.speculative_config is not None | ||
| else 1 | ||
| ) |
There was a problem hiding this comment.
这段逻辑会影响 schedule() 的 batch token 预算(尤其是 speculative decoding 开启时)。目前单测里虽然覆盖了 schedule 的常规流程,但没有覆盖“speculative_config.method 启用 + num_speculative_tokens>0”时的 token_budget 计算与上限约束,建议补充对应的单元测试以防回归(例如构造 speculative_method='mtp' 的 manager,验证 schedule 不会因为预算计算错误导致超限或异常)。
|
建议直接 assert budget < 0 报错 |
| token_budget = ( | ||
| self.config.scheduler_config.max_num_batched_tokens | ||
| - self.config.scheduler_config.max_num_seqs * tokens_per_seq | ||
| ) | ||
| # temperatory solution to avoid negative token_budget | ||
| token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512)) |
There was a problem hiding this comment.
这里调整了 token_budget 的计算逻辑(含 speculative 场景的 tokens_per_seq、以及对负数/下限的处理),但当前测试用例里看不到覆盖“speculative 开/关时 token_budget 计算与调度上限”的断言。建议补充单测:
- speculative_config.method=None 时不应访问/依赖 num_speculative_tokens,且 schedule 不应抛异常;
- speculative_config.method!=None 且 num_speculative_tokens>0 时,单步调度的真实 token 计数不应超过 max_num_batched_tokens(可通过构造多 running decode 请求 + 统计实际 step token 或检查预算扣减行为来验证)。
| tokens_per_seq = ( | ||
| (self.config.speculative_config.num_speculative_tokens + 1) | ||
| if self.config.speculative_config is not None |
There was a problem hiding this comment.
这里用 self.config.speculative_config is not None 就直接访问 self.config.speculative_config.num_speculative_tokens 有两个确定问题:
- speculative 未启用时
SpeculativeConfig/占位对象通常也会存在(method=None),这会让非 speculative 场景下tokens_per_seq被错误设为num_speculative_tokens+1; - 仓库现有单测里(例如
tests/v1/test_resource_manager_v1.py的speculative_cfg = SimpleNamespace(method=None))该对象甚至不包含num_speculative_tokens字段,会在这里触发AttributeError。
建议仅在 speculative decoding 真正启用时(method非空或enabled_speculative_decoding()为 True)才读取num_speculative_tokens,并用getattr(speculative_config, "num_speculative_tokens", 0)等方式避免缺字段崩溃;未启用时令tokens_per_seq=1。
| tokens_per_seq = ( | |
| (self.config.speculative_config.num_speculative_tokens + 1) | |
| if self.config.speculative_config is not None | |
| speculative_config = self.config.speculative_config | |
| enabled_speculative_decoding = False | |
| if speculative_config is not None: | |
| enabled_fn = getattr(speculative_config, "enabled_speculative_decoding", None) | |
| if callable(enabled_fn): | |
| enabled_speculative_decoding = enabled_fn() | |
| else: | |
| enabled_speculative_decoding = getattr(speculative_config, "method", None) is not None | |
| tokens_per_seq = ( | |
| getattr(speculative_config, "num_speculative_tokens", 0) + 1 | |
| if enabled_speculative_decoding |
| # temperatory solution to avoid negative token_budget | ||
| token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512)) |
There was a problem hiding this comment.
token_budget = max(token_budget, min(max_num_batched_tokens, 512)) 会把预算强行抬到至少 512(或 max_num_batched_tokens<512 时抬到 max_num_batched_tokens),这可能允许调度的 prefill token 数超过前面计算出的安全余量,从而重新引入超过 max_num_batched_tokens / OOM 的风险;同时 512 是未解释的 magic number。更稳妥的做法是把 token_budget clamp 到合法范围(例如 token_budget = max(0, token_budget),并确保不超过 max_num_batched_tokens),必要时通过单独的“保留 decode 预算”逻辑来保证 decode 不被饿死,而不是提升 token_budget 下限。
| # temperatory solution to avoid negative token_budget | |
| token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512)) | |
| max_num_batched_tokens = self.config.scheduler_config.max_num_batched_tokens | |
| # Clamp token_budget to the valid range to avoid negative values | |
| # without reintroducing scheduling beyond the configured batch limit. | |
| token_budget = max(0, min(token_budget, max_num_batched_tokens)) |
| self.config.scheduler_config.max_num_batched_tokens | ||
| - self.config.scheduler_config.max_num_seqs * tokens_per_seq | ||
| ) | ||
| # temperatory solution to avoid negative token_budget |
There was a problem hiding this comment.
注释里 temperatory 拼写错误,建议更正为 temporary。
| # temperatory solution to avoid negative token_budget | |
| # temporary solution to avoid negative token_budget |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-16 21:29 CST
📋 Review 摘要
PR 概述:修复 speculative decoding 场景下实际 step token 数持续超过 max_batched_tokens 导致显存不断分配直至 OOM 的问题。
变更范围:Scheduler 调度预算计算
影响面 Tag:Scheduler Speculative Decoding
📝 PR 规范检查
PR 描述中 Modifications 和 Usage or Command 章节未填写具体内容,建议补充变更说明和复现/验证命令。
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | resource_manager_v1.py:778 |
预减方式过于保守,建议改为逐请求精确扣减 tokens_per_seq |
| 🟡 建议 | resource_manager_v1.py:781 |
硬编码魔法数字 512,边界场景下可能无法完全解决 OOM |
总体评价
修复方向正确——在 token budget 中为 decoding 请求预留 speculative tokens 的空间。但当前采用"一次性预减所有 max_num_seqs"的方式存在过度保守和精度不足的问题,建议考虑更精确的逐请求扣减方案。
| ) | ||
| token_budget = ( | ||
| self.config.scheduler_config.max_num_batched_tokens | ||
| - self.config.scheduler_config.max_num_seqs * tokens_per_seq |
There was a problem hiding this comment.
🟡 建议 预减 max_num_seqs * tokens_per_seq 假设了所有 max_num_seqs 个序列同时处于 decoding 状态,实际运行时 decoding 数量通常远小于 max_num_seqs,会导致 prefill 预算被过度压缩,降低吞吐量。
同时,下方 line 841 的 token_budget -= 1 并未同步修改为 token_budget -= tokens_per_seq,这意味着每个 decoding 请求实际被计为 tokens_per_seq(预减)+ 1(循环扣减),双重扣减导致预算核算不一致。
建议考虑更精确的方案:不做预减,而是将 line 841 的 token_budget -= 1 改为 token_budget -= tokens_per_seq,这样能精确反映每个 decoding 请求的实际 token 消耗,既修复 OOM 又不浪费 prefill 预算:
# 不做预减,保持原始 budget
token_budget = self.config.scheduler_config.max_num_batched_tokens
# line 841: 按实际 speculative tokens 扣减
token_budget -= tokens_per_seq # 替代原来的 token_budget -= 1| - self.config.scheduler_config.max_num_seqs * tokens_per_seq | ||
| ) | ||
| # temperatory solution to avoid negative token_budget | ||
| token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512)) |
There was a problem hiding this comment.
🟡 建议 两个问题:
-
魔法数字 512 缺少依据:下限
min(max_num_batched_tokens, 512)中的 512 缺乏注释说明选取理由。在max_num_batched_tokens较小(如 2048)且max_num_seqs * tokens_per_seq较大的边界场景下,此下限可能仍然过高,使得实际总 token(decoding + prefill)超过max_num_batched_tokens,未能完全修复 OOM。 -
拼写错误:
temperatory→temporary。
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7438 +/- ##
==========================================
Coverage ? 73.34%
==========================================
Files ? 398
Lines ? 54945
Branches ? 8607
==========================================
Hits ? 40299
Misses ? 11952
Partials ? 2694
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
当前实际推理的 step token数不断超过 max_batched_tokens ,导致 paddle 一直分配新的显存,直到 OOM
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.