[BugFix] Fix real token exceeding max_batched_tokens limit by freeliuzc · Pull Request #7438 · PaddlePaddle/FastDeploy

freeliuzc · 2026-04-16T11:48:00Z

Motivation

当前实际推理的 step token数不断超过 max_batched_tokens ，导致 paddle 一直分配新的显存，直到 OOM

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-16T11:48:19Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 旨在修复开启/关闭 speculative decoding 时，调度阶段对 max_num_batched_tokens 的“真实 token”预算计算不准确、从而可能超限的问题。

Changes:

在 ResourceManagerV1.schedule() 中调整 token_budget 的计算逻辑，尝试为 speculative decoding 的额外 token 预留预算空间。

Copilot · 2026-04-16T11:52:16Z

+            token_budget = (
+                self.config.scheduler_config.max_num_batched_tokens
+                - self.config.scheduler_config.max_num_seqs
+                * (self.config.speculative_config.num_speculative_tokens + 1)
+                if self.config.speculative_config is not None
+                else 1
+            )


这里的三元表达式会导致非 speculative 场景下 token_budget 直接变成 1（而不是 max_num_batched_tokens），并且当前判断只检查 speculative_config != None 就访问 num_speculative_tokens；当 speculative_config 存在但 method=None（仓库测试里就是这种配置）时会触发 AttributeError。建议改成显式 if/else：仅在 speculative_config.method 启用时才根据 num_speculative_tokens 调整预算，否则保持 max_num_batched_tokens；同时确保计算结果不会为负（例如 clamp 到 >=0）。

Suggested change

token_budget = (

self.config.scheduler_config.max_num_batched_tokens

- self.config.scheduler_config.max_num_seqs

* (self.config.speculative_config.num_speculative_tokens + 1)

if self.config.speculative_config is not None

else 1

)

speculative_config = self.config.speculative_config

token_budget = self.config.scheduler_config.max_num_batched_tokens

if (

speculative_config is not None

and getattr(speculative_config, "method", None) is not None

):

token_budget = max(

0,

self.config.scheduler_config.max_num_batched_tokens

- self.config.scheduler_config.max_num_seqs

* (speculative_config.num_speculative_tokens + 1),

)

这里的review感觉要改下

Copilot · 2026-04-16T11:52:17Z

+            token_budget = (
+                self.config.scheduler_config.max_num_batched_tokens
+                - self.config.scheduler_config.max_num_seqs
+                * (self.config.speculative_config.num_speculative_tokens + 1)
+                if self.config.speculative_config is not None
+                else 1
+            )


这段逻辑会影响 schedule() 的 batch token 预算（尤其是 speculative decoding 开启时）。目前单测里虽然覆盖了 schedule 的常规流程，但没有覆盖“speculative_config.method 启用 + num_speculative_tokens>0”时的 token_budget 计算与上限约束，建议补充对应的单元测试以防回归（例如构造 speculative_method='mtp' 的 manager，验证 schedule 不会因为预算计算错误导致超限或异常）。

gongshaotian · 2026-04-16T12:30:45Z

建议直接 assert budget < 0 报错

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Copilot · 2026-04-16T13:24:19Z

+            token_budget = (
+                self.config.scheduler_config.max_num_batched_tokens
+                - self.config.scheduler_config.max_num_seqs * tokens_per_seq
+            )
+            # temperatory solution to avoid negative token_budget
+            token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))


这里调整了 token_budget 的计算逻辑（含 speculative 场景的 tokens_per_seq、以及对负数/下限的处理），但当前测试用例里看不到覆盖“speculative 开/关时 token_budget 计算与调度上限”的断言。建议补充单测：

speculative_config.method=None 时不应访问/依赖 num_speculative_tokens，且 schedule 不应抛异常；

speculative_config.method!=None 且 num_speculative_tokens>0 时，单步调度的真实 token 计数不应超过 max_num_batched_tokens（可通过构造多 running decode 请求 + 统计实际 step token 或检查预算扣减行为来验证）。

Copilot · 2026-04-16T13:24:20Z

+            tokens_per_seq = (
+                (self.config.speculative_config.num_speculative_tokens + 1)
+                if self.config.speculative_config is not None


这里用 self.config.speculative_config is not None 就直接访问 self.config.speculative_config.num_speculative_tokens 有两个确定问题：

speculative 未启用时 SpeculativeConfig/占位对象通常也会存在（method=None），这会让非 speculative 场景下 tokens_per_seq 被错误设为 num_speculative_tokens+1；

仓库现有单测里（例如 tests/v1/test_resource_manager_v1.py 的 speculative_cfg = SimpleNamespace(method=None)）该对象甚至不包含 num_speculative_tokens 字段，会在这里触发 AttributeError。
建议仅在 speculative decoding 真正启用时（method 非空或 enabled_speculative_decoding() 为 True）才读取 num_speculative_tokens，并用 getattr(speculative_config, "num_speculative_tokens", 0) 等方式避免缺字段崩溃；未启用时令 tokens_per_seq=1。

Suggested change

tokens_per_seq = (

(self.config.speculative_config.num_speculative_tokens + 1)

if self.config.speculative_config is not None

speculative_config = self.config.speculative_config

enabled_speculative_decoding = False

if speculative_config is not None:

enabled_fn = getattr(speculative_config, "enabled_speculative_decoding", None)

if callable(enabled_fn):

enabled_speculative_decoding = enabled_fn()

else:

enabled_speculative_decoding = getattr(speculative_config, "method", None) is not None

tokens_per_seq = (

getattr(speculative_config, "num_speculative_tokens", 0) + 1

if enabled_speculative_decoding

Copilot · 2026-04-16T13:24:20Z

+            # temperatory solution to avoid negative token_budget
+            token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))


token_budget = max(token_budget, min(max_num_batched_tokens, 512)) 会把预算强行抬到至少 512（或 max_num_batched_tokens<512 时抬到 max_num_batched_tokens），这可能允许调度的 prefill token 数超过前面计算出的安全余量，从而重新引入超过 max_num_batched_tokens / OOM 的风险；同时 512 是未解释的 magic number。更稳妥的做法是把 token_budget clamp 到合法范围（例如 token_budget = max(0, token_budget)，并确保不超过 max_num_batched_tokens），必要时通过单独的“保留 decode 预算”逻辑来保证 decode 不被饿死，而不是提升 token_budget 下限。

Suggested change

# temperatory solution to avoid negative token_budget

token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))

max_num_batched_tokens = self.config.scheduler_config.max_num_batched_tokens

# Clamp token_budget to the valid range to avoid negative values

# without reintroducing scheduling beyond the configured batch limit.

token_budget = max(0, min(token_budget, max_num_batched_tokens))

Copilot · 2026-04-16T13:24:20Z

+                self.config.scheduler_config.max_num_batched_tokens
+                - self.config.scheduler_config.max_num_seqs * tokens_per_seq
+            )
+            # temperatory solution to avoid negative token_budget


注释里 temperatory 拼写错误，建议更正为 temporary。

Suggested change

# temperatory solution to avoid negative token_budget

# temporary solution to avoid negative token_budget

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-16 21:29 CST

📋 Review 摘要

PR 概述：修复 speculative decoding 场景下实际 step token 数持续超过 max_batched_tokens 导致显存不断分配直至 OOM 的问题。
变更范围：Scheduler 调度预算计算
影响面 Tag：Scheduler Speculative Decoding

📝 PR 规范检查

PR 描述中 Modifications 和 Usage or Command 章节未填写具体内容，建议补充变更说明和复现/验证命令。

问题

级别	文件	概述
🟡 建议	`resource_manager_v1.py:778`	预减方式过于保守，建议改为逐请求精确扣减 `tokens_per_seq`
🟡 建议	`resource_manager_v1.py:781`	硬编码魔法数字 512，边界场景下可能无法完全解决 OOM

总体评价

修复方向正确——在 token budget 中为 decoding 请求预留 speculative tokens 的空间。但当前采用"一次性预减所有 max_num_seqs"的方式存在过度保守和精度不足的问题，建议考虑更精确的逐请求扣减方案。

PaddlePaddle-bot · 2026-04-16T13:29:43Z

+            )
+            token_budget = (
+                self.config.scheduler_config.max_num_batched_tokens
+                - self.config.scheduler_config.max_num_seqs * tokens_per_seq


🟡 建议 预减 max_num_seqs * tokens_per_seq 假设了所有 max_num_seqs 个序列同时处于 decoding 状态，实际运行时 decoding 数量通常远小于 max_num_seqs，会导致 prefill 预算被过度压缩，降低吞吐量。

同时，下方 line 841 的 token_budget -= 1 并未同步修改为 token_budget -= tokens_per_seq，这意味着每个 decoding 请求实际被计为 tokens_per_seq（预减）+ 1（循环扣减），双重扣减导致预算核算不一致。

建议考虑更精确的方案：不做预减，而是将 line 841 的 token_budget -= 1 改为 token_budget -= tokens_per_seq，这样能精确反映每个 decoding 请求的实际 token 消耗，既修复 OOM 又不浪费 prefill 预算：

# 不做预减，保持原始 budget token_budget = self.config.scheduler_config.max_num_batched_tokens # line 841: 按实际 speculative tokens 扣减 token_budget -= tokens_per_seq # 替代原来的 token_budget -= 1

PaddlePaddle-bot · 2026-04-16T13:29:43Z

+                - self.config.scheduler_config.max_num_seqs * tokens_per_seq
+            )
+            # temperatory solution to avoid negative token_budget
+            token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))


🟡 建议 两个问题：

魔法数字 512 缺少依据：下限 min(max_num_batched_tokens, 512) 中的 512 缺乏注释说明选取理由。在 max_num_batched_tokens 较小（如 2048）且 max_num_seqs * tokens_per_seq 较大的边界场景下，此下限可能仍然过高，使得实际总 token（decoding + prefill）超过 max_num_batched_tokens，未能完全修复 OOM。

拼写错误：temperatory → temporary。

codecov-commenter · 2026-04-16T14:50:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@d2d633b). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7438   +/-   ##
==========================================
  Coverage           ?   73.34%           
==========================================
  Files              ?      398           
  Lines              ?    54945           
  Branches           ?     8607           
==========================================
  Hits               ?    40299           
  Misses             ?    11952           
  Partials           ?     2694

Flag	Coverage Δ
GPU	`73.34% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

#7438) (#7439) * fix max_num_batched_tokens error compute * add temperatory solution * fix bug

#7438) (#7440) * fix max_num_batched_tokens error compute * add temperatory solution * fix bug

fix max_num_batched_tokens error compute

3f7f9d8

Copilot AI review requested due to automatic review settings April 16, 2026 11:48

freeliuzc had a problem deploying to Metax_ci April 16, 2026 11:48 — with GitHub Actions Failure

Copilot started reviewing on behalf of freeliuzc April 16, 2026 11:48 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

add temperatory solution

d5eef76

freeliuzc temporarily deployed to Metax_ci April 16, 2026 12:13 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

fix bug

dc526f9

Copilot AI review requested due to automatic review settings April 16, 2026 13:20

freeliuzc temporarily deployed to Metax_ci April 16, 2026 13:20 — with GitHub Actions Inactive

Copilot started reviewing on behalf of freeliuzc April 16, 2026 13:21 View session

freeliuzc requested review from Copilot and removed request for Copilot April 16, 2026 13:23

Copilot started reviewing on behalf of freeliuzc April 16, 2026 13:24 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

PaddlePaddle-bot reviewed Apr 16, 2026

View reviewed changes

kevincheng2 approved these changes Apr 17, 2026

View reviewed changes

freeliuzc added a commit that referenced this pull request Apr 17, 2026

[Cherry-Pick][BugFix] Fix real token exceeding max_batched_tokens limit(

185708b

#7438) (#7439) * fix max_num_batched_tokens error compute * add temperatory solution * fix bug

freeliuzc added a commit that referenced this pull request Apr 17, 2026

[Cherry-Pick][BugFix] Fix real token exceeding max_batched_tokens limit(

6119a07

#7438) (#7440) * fix max_num_batched_tokens error compute * add temperatory solution * fix bug

freeliuzc merged commit 43685a9 into PaddlePaddle:develop Apr 17, 2026
54 of 58 checks passed

-            tokens_per_seq = (
-                (self.config.speculative_config.num_speculative_tokens + 1)
-                if self.config.speculative_config is not None
+            speculative_config = self.config.speculative_config
+            enabled_speculative_decoding = False
+            if speculative_config is not None:
+                enabled_fn = getattr(speculative_config, "enabled_speculative_decoding", None)
+                if callable(enabled_fn):
+                    enabled_speculative_decoding = enabled_fn()
+                else:
+                    enabled_speculative_decoding = getattr(speculative_config, "method", None) is not None
+            tokens_per_seq = (
+                getattr(speculative_config, "num_speculative_tokens", 0) + 1
+                if enabled_speculative_decoding

		# temperatory solution to avoid negative token_budget
		token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))

-            # temperatory solution to avoid negative token_budget
-            token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))
+            max_num_batched_tokens = self.config.scheduler_config.max_num_batched_tokens
+            # Clamp token_budget to the valid range to avoid negative values
+            # without reintroducing scheduling beyond the configured batch limit.
+            token_budget = max(0, min(token_budget, max_num_batched_tokens))

	# temperatory solution to avoid negative token_budget
	# temporary solution to avoid negative token_budget

Conversation

freeliuzc commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

kevincheng2 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

gongshaotian commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 16, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

freeliuzc commented Apr 16, 2026 •

edited

Loading