Skip to content

[Cherry-Pick][BugFix] Fix real token exceeding max_batched_tokens limit(#7438)#7439

Merged
freeliuzc merged 3 commits intoPaddlePaddle:release/2.6from
freeliuzc:cherry-pick-2.6-fix-max-num-batched-tokens
Apr 17, 2026
Merged

[Cherry-Pick][BugFix] Fix real token exceeding max_batched_tokens limit(#7438)#7439
freeliuzc merged 3 commits intoPaddlePaddle:release/2.6from
freeliuzc:cherry-pick-2.6-fix-max-num-batched-tokens

Conversation

@freeliuzc
Copy link
Copy Markdown
Collaborator

@freeliuzc freeliuzc commented Apr 16, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings April 16, 2026 12:16
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 16, 2026

Thanks for your contribution!

@freeliuzc freeliuzc changed the title Cherry pick 2.6 fix max num batched tokens [Cherry-Pick][BugFix] Fix real token exceeding max_batched_tokens limit(#7438) Apr 16, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在修复/调整 ResourceManagerV1.schedule() 中的 max_num_batched_tokens 预算计算逻辑,以适配 speculative decoding 场景下的 batch token 预算扣减,避免出现负数预算导致调度异常。

Changes:

  • 在调度入口处按 speculative token 预留量对 max_num_batched_tokens 做扣减计算。
  • 增加对负数 token_budget 的“临时”保护逻辑(当前实现方式存在风险,见评论)。

if self.config.speculative_config is not None
else 1
)
# temperatory solution to avoid negative token_budget
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释里 temperatory 拼写错误,建议改为 temporary

Suggested change
# temperatory solution to avoid negative token_budget
# temporary solution to avoid negative token_budget

Copilot uses AI. Check for mistakes.
Comment on lines +771 to +772
token_budget = (
self.config.scheduler_config.max_num_batched_tokens
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 标题/描述未按模板要求补全:当前标题缺少至少一个方括号 tag(如 [BugFix])且若为 cherry-pick 需在开头加 [Cherry-Pick] 并在末尾附上原 PR 号(模板里已说明)。同时描述中的 Motivation/Modifications/Usage/Accuracy Tests 基本为空,建议补充“为什么要改 / 改了什么 / 如何验证”。

Copilot uses AI. Check for mistakes.
Comment on lines +771 to +777
token_budget = (
self.config.scheduler_config.max_num_batched_tokens
- self.config.scheduler_config.max_num_seqs
* (self.config.speculative_config.num_speculative_tokens + 1)
if self.config.speculative_config is not None
else 1
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的三元表达式优先级会导致在 speculative_config is Nonetoken_budget 被赋值为 1,而不是期望的 max_num_batched_tokens(当前写法等价于 A - B if cond else 1)。建议拆分为显式的 if 分支(先设置为 max_num_batched_tokens,仅在开启 speculative 时再减去预留 token),同时用更直观的写法避免误读。

Suggested change
token_budget = (
self.config.scheduler_config.max_num_batched_tokens
- self.config.scheduler_config.max_num_seqs
* (self.config.speculative_config.num_speculative_tokens + 1)
if self.config.speculative_config is not None
else 1
)
token_budget = self.config.scheduler_config.max_num_batched_tokens
if self.config.speculative_config is not None:
token_budget -= self.config.scheduler_config.max_num_seqs * (
self.config.speculative_config.num_speculative_tokens + 1
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复

Comment on lines +778 to +779
# temperatory solution to avoid negative token_budget
token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token_budget = max(token_budget, min(max_num_batched_tokens, 512)) 会在计算出的 token_budget 较小时强行抬高预算(例如算出 100 会变成 512),从而可能突破前面为 speculative 预留 token 的限制,导致一次 batch 实际处理 token 数超过预期(有 OOM/调度抖动风险)。建议仅对下界做非负裁剪(如 max(token_budget, 0)),并在 token_budget < 0 时记录 warning/调整 max_num_seqs 或 speculative 参数,而不是把预算抬高到固定值。

Suggested change
# temperatory solution to avoid negative token_budget
token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))
if token_budget < 0:
llm_logger.warning(
"Computed negative token_budget=%s. "
"Please check scheduler/speculative settings: "
"max_num_batched_tokens=%s, max_num_seqs=%s, "
"num_speculative_tokens=%s. Clamping token_budget to 0.",
token_budget,
self.config.scheduler_config.max_num_batched_tokens,
self.config.scheduler_config.max_num_seqs,
self.config.speculative_config.num_speculative_tokens,
)
token_budget = max(token_budget, 0)

Copilot uses AI. Check for mistakes.
preempted_reqs: list[Request] = []
error_reqs: list[tuple[str, str]] = []
token_budget = self.config.scheduler_config.max_num_batched_tokens
token_budget = (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议直接 assert budget < 0 报错

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-16 21:29 CST

📋 Review 摘要

PR 概述:修复投机解码(Speculative Decoding)场景下实际 token 数超过 max_batched_tokens 限制的问题
变更范围engine/sched/resource_manager_v1.py - 调度器 token budget 计算逻辑
影响面 TagScheduler Speculative Decoding

📝 PR 规范检查

PR 描述中 MotivationModifications 部分为空,建议补充变更动机和具体修改说明,便于后续追溯。

描述建议(可直接复制):

Motivation: 在投机解码场景下,每个 decode 序列实际消耗 num_speculative_tokens + 1 个 token,但调度器 token_budget 未考虑这一开销,导致单批次实际 token 数可能超过 max_num_batched_tokens 限制。

Modifications: 在 schedule() 中计算 token_budget 时,预留 max_num_seqs * tokens_per_seq 的预算给 decode 请求,并设置下限防止 budget 为负。

问题

级别 文件 概述
🟡 建议 resource_manager_v1.py:778 使用 max_num_seqs 预留 budget 过于保守,可能显著降低 prefill 吞吐
🟡 建议 resource_manager_v1.py:781 魔法数字 512 缺乏解释

总体评价

修复方向正确,解决了投机解码场景下 token budget 不准确的问题。但当前实现使用 max_num_seqs(理论最大值)而非实际 running decode 数量来预留 budget,可能导致 prefill 吞吐量不必要地下降,建议优化。

)
token_budget = (
self.config.scheduler_config.max_num_batched_tokens
- self.config.scheduler_config.max_num_seqs * tokens_per_seq
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 使用 max_num_seqs 预留 decode budget 过于保守

当前公式 max_num_batched_tokens - max_num_seqs * tokens_per_seq最大可能序列数预留 budget。当实际 running decode 请求远少于 max_num_seqs 时(例如 max_num_seqs=256 但只有 10 个 decode 序列),会浪费大量 prefill budget,影响吞吐。

此外,后续第 841 行 token_budget -= 1 对每个 running decode 请求还会额外扣减 1,存在与预留量的双重扣减。

建议基于实际 running decode 数量预留:

num_running_decode = sum(
    1 for r in self.running
    if r.num_computed_tokens >= r.need_prefill_tokens
)
token_budget = (
    self.config.scheduler_config.max_num_batched_tokens
    - num_running_decode * tokens_per_seq
)

这样既能防止超限,又不会过度压缩 prefill budget。

- self.config.scheduler_config.max_num_seqs * tokens_per_seq
)
# temperatory solution to avoid negative token_budget
token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 魔法数字 512 和拼写错误

  1. 512 缺乏解释——当 max_num_batched_tokens 远大于 512 时,这个下限可能过小;当其小于 512 时(理论上可能),又退化为 max_num_batched_tokens。建议注释说明选择 512 的依据,或将其提取为可配置常量。

  2. 拼写:temperatorytemporary

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.6@72ce56b). Learn more about missing BASE report.

Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7439   +/-   ##
==============================================
  Coverage               ?   73.24%           
==============================================
  Files                  ?      376           
  Lines                  ?    52988           
  Branches               ?     8276           
==============================================
  Hits                   ?    38810           
  Misses                 ?    11453           
  Partials               ?     2725           
Flag Coverage Δ
GPU 73.24% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@freeliuzc freeliuzc merged commit 185708b into PaddlePaddle:release/2.6 Apr 17, 2026
35 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants