[Cherry-Pick][BugFix] Fix real token exceeding max_batched_tokens limit(#7438) by freeliuzc · Pull Request #7439 · PaddlePaddle/FastDeploy

freeliuzc · 2026-04-16T12:16:30Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-16T12:16:44Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 旨在修复/调整 ResourceManagerV1.schedule() 中的 max_num_batched_tokens 预算计算逻辑，以适配 speculative decoding 场景下的 batch token 预算扣减，避免出现负数预算导致调度异常。

Changes:

在调度入口处按 speculative token 预留量对 max_num_batched_tokens 做扣减计算。
增加对负数 token_budget 的“临时”保护逻辑（当前实现方式存在风险，见评论）。

Copilot · 2026-04-16T12:19:02Z

+                if self.config.speculative_config is not None
+                else 1
+            )
+            # temperatory solution to avoid negative token_budget


注释里 temperatory 拼写错误，建议改为 temporary。

Suggested change

# temperatory solution to avoid negative token_budget

# temporary solution to avoid negative token_budget

Copilot · 2026-04-16T12:19:03Z

+            token_budget = (
+                self.config.scheduler_config.max_num_batched_tokens


PR 标题/描述未按模板要求补全：当前标题缺少至少一个方括号 tag（如 [BugFix]）且若为 cherry-pick 需在开头加 [Cherry-Pick] 并在末尾附上原 PR 号（模板里已说明）。同时描述中的 Motivation/Modifications/Usage/Accuracy Tests 基本为空，建议补充“为什么要改 / 改了什么 / 如何验证”。

Copilot · 2026-04-16T12:19:03Z

+            token_budget = (
+                self.config.scheduler_config.max_num_batched_tokens
+                - self.config.scheduler_config.max_num_seqs
+                * (self.config.speculative_config.num_speculative_tokens + 1)
+                if self.config.speculative_config is not None
+                else 1
+            )


这里的三元表达式优先级会导致在 speculative_config is None 时 token_budget 被赋值为 1，而不是期望的 max_num_batched_tokens（当前写法等价于 A - B if cond else 1）。建议拆分为显式的 if 分支（先设置为 max_num_batched_tokens，仅在开启 speculative 时再减去预留 token），同时用更直观的写法避免误读。

Suggested change

token_budget = (

self.config.scheduler_config.max_num_batched_tokens

- self.config.scheduler_config.max_num_seqs

* (self.config.speculative_config.num_speculative_tokens + 1)

if self.config.speculative_config is not None

else 1

)

token_budget = self.config.scheduler_config.max_num_batched_tokens

if self.config.speculative_config is not None:

token_budget -= self.config.scheduler_config.max_num_seqs * (

self.config.speculative_config.num_speculative_tokens + 1

)

Copilot · 2026-04-16T12:19:03Z

+            # temperatory solution to avoid negative token_budget
+            token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))


token_budget = max(token_budget, min(max_num_batched_tokens, 512)) 会在计算出的 token_budget 较小时强行抬高预算（例如算出 100 会变成 512），从而可能突破前面为 speculative 预留 token 的限制，导致一次 batch 实际处理 token 数超过预期（有 OOM/调度抖动风险）。建议仅对下界做非负裁剪（如 max(token_budget, 0)），并在 token_budget < 0 时记录 warning/调整 max_num_seqs 或 speculative 参数，而不是把预算抬高到固定值。

Suggested change

# temperatory solution to avoid negative token_budget

token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))

if token_budget < 0:

llm_logger.warning(

"Computed negative token_budget=%s. "

"Please check scheduler/speculative settings: "

"max_num_batched_tokens=%s, max_num_seqs=%s, "

"num_speculative_tokens=%s. Clamping token_budget to 0.",

token_budget,

self.config.scheduler_config.max_num_batched_tokens,

self.config.scheduler_config.max_num_seqs,

self.config.speculative_config.num_speculative_tokens,

)

token_budget = max(token_budget, 0)

gongshaotian · 2026-04-16T12:27:28Z

            preempted_reqs: list[Request] = []
            error_reqs: list[tuple[str, str]] = []
-            token_budget = self.config.scheduler_config.max_num_batched_tokens
+            token_budget = (


建议直接 assert budget < 0 报错

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-16 21:29 CST

📋 Review 摘要

PR 概述：修复投机解码（Speculative Decoding）场景下实际 token 数超过 max_batched_tokens 限制的问题
变更范围：engine/sched/resource_manager_v1.py - 调度器 token budget 计算逻辑
影响面 Tag：Scheduler Speculative Decoding

📝 PR 规范检查

PR 描述中 Motivation 和 Modifications 部分为空，建议补充变更动机和具体修改说明，便于后续追溯。

描述建议（可直接复制）：

Motivation: 在投机解码场景下，每个 decode 序列实际消耗 num_speculative_tokens + 1 个 token，但调度器 token_budget 未考虑这一开销，导致单批次实际 token 数可能超过 max_num_batched_tokens 限制。

Modifications: 在 schedule() 中计算 token_budget 时，预留 max_num_seqs * tokens_per_seq 的预算给 decode 请求，并设置下限防止 budget 为负。

问题

级别	文件	概述
🟡 建议	`resource_manager_v1.py:778`	使用 `max_num_seqs` 预留 budget 过于保守，可能显著降低 prefill 吞吐
🟡 建议	`resource_manager_v1.py:781`	魔法数字 512 缺乏解释

总体评价

修复方向正确，解决了投机解码场景下 token budget 不准确的问题。但当前实现使用 max_num_seqs（理论最大值）而非实际 running decode 数量来预留 budget，可能导致 prefill 吞吐量不必要地下降，建议优化。

PaddlePaddle-bot · 2026-04-16T13:29:32Z

+            )
+            token_budget = (
+                self.config.scheduler_config.max_num_batched_tokens
+                - self.config.scheduler_config.max_num_seqs * tokens_per_seq


🟡 建议 使用 max_num_seqs 预留 decode budget 过于保守

当前公式 max_num_batched_tokens - max_num_seqs * tokens_per_seq 按最大可能序列数预留 budget。当实际 running decode 请求远少于 max_num_seqs 时（例如 max_num_seqs=256 但只有 10 个 decode 序列），会浪费大量 prefill budget，影响吞吐。

此外，后续第 841 行 token_budget -= 1 对每个 running decode 请求还会额外扣减 1，存在与预留量的双重扣减。

建议基于实际 running decode 数量预留：

num_running_decode = sum( 1 for r in self.running if r.num_computed_tokens >= r.need_prefill_tokens ) token_budget = ( self.config.scheduler_config.max_num_batched_tokens - num_running_decode * tokens_per_seq )

这样既能防止超限，又不会过度压缩 prefill budget。

PaddlePaddle-bot · 2026-04-16T13:29:32Z

+                - self.config.scheduler_config.max_num_seqs * tokens_per_seq
+            )
+            # temperatory solution to avoid negative token_budget
+            token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))


🟡 建议 魔法数字 512 和拼写错误

512 缺乏解释——当 max_num_batched_tokens 远大于 512 时，这个下限可能过小；当其小于 512 时（理论上可能），又退化为 max_num_batched_tokens。建议注释说明选择 512 的依据，或将其提取为可配置常量。

拼写：temperatory → temporary。

codecov-commenter · 2026-04-16T15:37:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.6@72ce56b). Learn more about missing BASE report.

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7439   +/-   ##
==============================================
  Coverage               ?   73.24%           
==============================================
  Files                  ?      376           
  Lines                  ?    52988           
  Branches               ?     8276           
==============================================
  Hits                   ?    38810           
  Misses                 ?    11453           
  Partials               ?     2725

Flag	Coverage Δ
GPU	`73.24% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

freeliuzc added 2 commits April 16, 2026 20:09

fix max_num_batched_tokens error compute

84e185c

add temperatory solution

13d28e0

Copilot AI review requested due to automatic review settings April 16, 2026 12:16

freeliuzc had a problem deploying to Metax_ci April 16, 2026 12:16 — with GitHub Actions Failure

Copilot started reviewing on behalf of freeliuzc April 16, 2026 12:16 View session

freeliuzc changed the title ~~Cherry pick 2.6 fix max num batched tokens~~ [Cherry-Pick][BugFix] Fix real token exceeding max_batched_tokens limit(#7438) Apr 16, 2026

Copilot AI reviewed Apr 16, 2026

View reviewed changes

gongshaotian reviewed Apr 16, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

fix bug

6502bfd

freeliuzc had a problem deploying to Metax_ci April 16, 2026 13:21 — with GitHub Actions Failure

PaddlePaddle-bot reviewed Apr 16, 2026

View reviewed changes

kevincheng2 approved these changes Apr 17, 2026

View reviewed changes

freeliuzc merged commit 185708b into PaddlePaddle:release/2.6 Apr 17, 2026
35 of 38 checks passed

	# temperatory solution to avoid negative token_budget
	# temporary solution to avoid negative token_budget

		token_budget = (
		self.config.scheduler_config.max_num_batched_tokens

		# temperatory solution to avoid negative token_budget
		token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))

-            # temperatory solution to avoid negative token_budget
-            token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))
+            if token_budget < 0:
+                llm_logger.warning(
+                    "Computed negative token_budget=%s. "
+                    "Please check scheduler/speculative settings: "
+                    "max_num_batched_tokens=%s, max_num_seqs=%s, "
+                    "num_speculative_tokens=%s. Clamping token_budget to 0.",
+                    token_budget,
+                    self.config.scheduler_config.max_num_batched_tokens,
+                    self.config.scheduler_config.max_num_seqs,
+                    self.config.speculative_config.num_speculative_tokens,
+                )
+            token_budget = max(token_budget, 0)

Conversation

freeliuzc commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

freeliuzc Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gongshaotian Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 16, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

freeliuzc commented Apr 16, 2026 •

edited

Loading