Skip to content

[BugFix] Fix real token exceeding max_batched_tokens limit#7438

Merged
freeliuzc merged 3 commits intoPaddlePaddle:developfrom
freeliuzc:fix_max_num_batched_tokens_dev
Apr 17, 2026
Merged

[BugFix] Fix real token exceeding max_batched_tokens limit#7438
freeliuzc merged 3 commits intoPaddlePaddle:developfrom
freeliuzc:fix_max_num_batched_tokens_dev

Conversation

@freeliuzc
Copy link
Copy Markdown
Collaborator

@freeliuzc freeliuzc commented Apr 16, 2026

Motivation

当前实际推理的 step token数不断超过 max_batched_tokens ,导致 paddle 一直分配新的显存,直到 OOM

fb2f799026e26ad46c5a72baddf397bb

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings April 16, 2026 11:48
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 16, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在修复开启/关闭 speculative decoding 时,调度阶段对 max_num_batched_tokens 的“真实 token”预算计算不准确、从而可能超限的问题。

Changes:

  • ResourceManagerV1.schedule() 中调整 token_budget 的计算逻辑,尝试为 speculative decoding 的额外 token 预留预算空间。

Comment on lines +771 to +777
token_budget = (
self.config.scheduler_config.max_num_batched_tokens
- self.config.scheduler_config.max_num_seqs
* (self.config.speculative_config.num_speculative_tokens + 1)
if self.config.speculative_config is not None
else 1
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的三元表达式会导致非 speculative 场景下 token_budget 直接变成 1(而不是 max_num_batched_tokens),并且当前判断只检查 speculative_config != None 就访问 num_speculative_tokens;当 speculative_config 存在但 method=None(仓库测试里就是这种配置)时会触发 AttributeError。建议改成显式 if/else:仅在 speculative_config.method 启用时才根据 num_speculative_tokens 调整预算,否则保持 max_num_batched_tokens;同时确保计算结果不会为负(例如 clamp 到 >=0)。

Suggested change
token_budget = (
self.config.scheduler_config.max_num_batched_tokens
- self.config.scheduler_config.max_num_seqs
* (self.config.speculative_config.num_speculative_tokens + 1)
if self.config.speculative_config is not None
else 1
)
speculative_config = self.config.speculative_config
token_budget = self.config.scheduler_config.max_num_batched_tokens
if (
speculative_config is not None
and getattr(speculative_config, "method", None) is not None
):
token_budget = max(
0,
self.config.scheduler_config.max_num_batched_tokens
- self.config.scheduler_config.max_num_seqs
* (speculative_config.num_speculative_tokens + 1),
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的review感觉要改下

Comment on lines +771 to +777
token_budget = (
self.config.scheduler_config.max_num_batched_tokens
- self.config.scheduler_config.max_num_seqs
* (self.config.speculative_config.num_speculative_tokens + 1)
if self.config.speculative_config is not None
else 1
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这段逻辑会影响 schedule() 的 batch token 预算(尤其是 speculative decoding 开启时)。目前单测里虽然覆盖了 schedule 的常规流程,但没有覆盖“speculative_config.method 启用 + num_speculative_tokens>0”时的 token_budget 计算与上限约束,建议补充对应的单元测试以防回归(例如构造 speculative_method='mtp' 的 manager,验证 schedule 不会因为预算计算错误导致超限或异常)。

Copilot generated this review using guidance from repository custom instructions.
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@gongshaotian
Copy link
Copy Markdown
Collaborator

建议直接 assert budget < 0 报错

Copilot AI review requested due to automatic review settings April 16, 2026 13:20
@freeliuzc freeliuzc requested review from Copilot and removed request for Copilot April 16, 2026 13:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comment on lines +776 to +781
token_budget = (
self.config.scheduler_config.max_num_batched_tokens
- self.config.scheduler_config.max_num_seqs * tokens_per_seq
)
# temperatory solution to avoid negative token_budget
token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里调整了 token_budget 的计算逻辑(含 speculative 场景的 tokens_per_seq、以及对负数/下限的处理),但当前测试用例里看不到覆盖“speculative 开/关时 token_budget 计算与调度上限”的断言。建议补充单测:

  • speculative_config.method=None 时不应访问/依赖 num_speculative_tokens,且 schedule 不应抛异常;
  • speculative_config.method!=None 且 num_speculative_tokens>0 时,单步调度的真实 token 计数不应超过 max_num_batched_tokens(可通过构造多 running decode 请求 + 统计实际 step token 或检查预算扣减行为来验证)。

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +771 to +773
tokens_per_seq = (
(self.config.speculative_config.num_speculative_tokens + 1)
if self.config.speculative_config is not None
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里用 self.config.speculative_config is not None 就直接访问 self.config.speculative_config.num_speculative_tokens 有两个确定问题:

  1. speculative 未启用时 SpeculativeConfig/占位对象通常也会存在(method=None),这会让非 speculative 场景下 tokens_per_seq 被错误设为 num_speculative_tokens+1
  2. 仓库现有单测里(例如 tests/v1/test_resource_manager_v1.pyspeculative_cfg = SimpleNamespace(method=None))该对象甚至不包含 num_speculative_tokens 字段,会在这里触发 AttributeError
    建议仅在 speculative decoding 真正启用时(method 非空或 enabled_speculative_decoding() 为 True)才读取 num_speculative_tokens,并用 getattr(speculative_config, "num_speculative_tokens", 0) 等方式避免缺字段崩溃;未启用时令 tokens_per_seq=1
Suggested change
tokens_per_seq = (
(self.config.speculative_config.num_speculative_tokens + 1)
if self.config.speculative_config is not None
speculative_config = self.config.speculative_config
enabled_speculative_decoding = False
if speculative_config is not None:
enabled_fn = getattr(speculative_config, "enabled_speculative_decoding", None)
if callable(enabled_fn):
enabled_speculative_decoding = enabled_fn()
else:
enabled_speculative_decoding = getattr(speculative_config, "method", None) is not None
tokens_per_seq = (
getattr(speculative_config, "num_speculative_tokens", 0) + 1
if enabled_speculative_decoding

Copilot uses AI. Check for mistakes.
Comment on lines +780 to +781
# temperatory solution to avoid negative token_budget
token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token_budget = max(token_budget, min(max_num_batched_tokens, 512)) 会把预算强行抬到至少 512(或 max_num_batched_tokens<512 时抬到 max_num_batched_tokens),这可能允许调度的 prefill token 数超过前面计算出的安全余量,从而重新引入超过 max_num_batched_tokens / OOM 的风险;同时 512 是未解释的 magic number。更稳妥的做法是把 token_budget clamp 到合法范围(例如 token_budget = max(0, token_budget),并确保不超过 max_num_batched_tokens),必要时通过单独的“保留 decode 预算”逻辑来保证 decode 不被饿死,而不是提升 token_budget 下限。

Suggested change
# temperatory solution to avoid negative token_budget
token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))
max_num_batched_tokens = self.config.scheduler_config.max_num_batched_tokens
# Clamp token_budget to the valid range to avoid negative values
# without reintroducing scheduling beyond the configured batch limit.
token_budget = max(0, min(token_budget, max_num_batched_tokens))

Copilot uses AI. Check for mistakes.
self.config.scheduler_config.max_num_batched_tokens
- self.config.scheduler_config.max_num_seqs * tokens_per_seq
)
# temperatory solution to avoid negative token_budget
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释里 temperatory 拼写错误,建议更正为 temporary

Suggested change
# temperatory solution to avoid negative token_budget
# temporary solution to avoid negative token_budget

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-16 21:29 CST

📋 Review 摘要

PR 概述:修复 speculative decoding 场景下实际 step token 数持续超过 max_batched_tokens 导致显存不断分配直至 OOM 的问题。
变更范围:Scheduler 调度预算计算
影响面 TagScheduler Speculative Decoding

📝 PR 规范检查

PR 描述中 ModificationsUsage or Command 章节未填写具体内容,建议补充变更说明和复现/验证命令。

问题

级别 文件 概述
🟡 建议 resource_manager_v1.py:778 预减方式过于保守,建议改为逐请求精确扣减 tokens_per_seq
🟡 建议 resource_manager_v1.py:781 硬编码魔法数字 512,边界场景下可能无法完全解决 OOM

总体评价

修复方向正确——在 token budget 中为 decoding 请求预留 speculative tokens 的空间。但当前采用"一次性预减所有 max_num_seqs"的方式存在过度保守和精度不足的问题,建议考虑更精确的逐请求扣减方案。

)
token_budget = (
self.config.scheduler_config.max_num_batched_tokens
- self.config.scheduler_config.max_num_seqs * tokens_per_seq
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 预减 max_num_seqs * tokens_per_seq 假设了所有 max_num_seqs 个序列同时处于 decoding 状态,实际运行时 decoding 数量通常远小于 max_num_seqs,会导致 prefill 预算被过度压缩,降低吞吐量。

同时,下方 line 841 的 token_budget -= 1 并未同步修改为 token_budget -= tokens_per_seq,这意味着每个 decoding 请求实际被计为 tokens_per_seq(预减)+ 1(循环扣减),双重扣减导致预算核算不一致。

建议考虑更精确的方案:不做预减,而是将 line 841 的 token_budget -= 1 改为 token_budget -= tokens_per_seq,这样能精确反映每个 decoding 请求的实际 token 消耗,既修复 OOM 又不浪费 prefill 预算:

# 不做预减,保持原始 budget
token_budget = self.config.scheduler_config.max_num_batched_tokens

# line 841: 按实际 speculative tokens 扣减
token_budget -= tokens_per_seq  # 替代原来的 token_budget -= 1

- self.config.scheduler_config.max_num_seqs * tokens_per_seq
)
# temperatory solution to avoid negative token_budget
token_budget = max(token_budget, min(self.config.scheduler_config.max_num_batched_tokens, 512))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 两个问题:

  1. 魔法数字 512 缺少依据:下限 min(max_num_batched_tokens, 512) 中的 512 缺乏注释说明选取理由。在 max_num_batched_tokens 较小(如 2048)且 max_num_seqs * tokens_per_seq 较大的边界场景下,此下限可能仍然过高,使得实际总 token(decoding + prefill)超过 max_num_batched_tokens,未能完全修复 OOM。

  2. 拼写错误temperatorytemporary

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@d2d633b). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7438   +/-   ##
==========================================
  Coverage           ?   73.34%           
==========================================
  Files              ?      398           
  Lines              ?    54945           
  Branches           ?     8607           
==========================================
  Hits               ?    40299           
  Misses             ?    11952           
  Partials           ?     2694           
Flag Coverage Δ
GPU 73.34% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

freeliuzc added a commit that referenced this pull request Apr 17, 2026
#7438) (#7439)

* fix max_num_batched_tokens error compute

* add temperatory solution

* fix bug
freeliuzc added a commit that referenced this pull request Apr 17, 2026
#7438) (#7440)

* fix max_num_batched_tokens error compute

* add temperatory solution

* fix bug
@freeliuzc freeliuzc merged commit 43685a9 into PaddlePaddle:develop Apr 17, 2026
54 of 58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants