[Speculative Decoding][BugFix] Fix apply repeat times penalty kernel and change spec default verify strategy#7467
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 聚焦于投机解码(Speculative Decoding)链路中的两个点:修复 repeat-times penalty 相关 CUDA kernel 在重构后可能出现的越界访问,并将 speculative decoding 的默认验证策略调整为更偏向 target_match 的行为。
Changes:
- 修复 speculate_get_token_penalty_multi_scores 中 repeat times 统计的循环边界,避免 prompt 偏移后访问越界。
- 更新对应的 Python reference 测试逻辑,使其按
cur_len限制遍历范围。 - 将
SpeculativeConfig.verify_strategy默认值从topp调整为target_match。
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/operators/test_speculate_get_token_penalty_multi_scores.py | 调整 reference 侧 repeat times 统计范围以匹配新的 cur_len 上界假设(但当前仍未覆盖 prompt 偏移场景)。 |
| fastdeploy/config.py | 修改 speculative decoding 的默认 verify strategy 为 target_match。 |
| custom_ops/gpu_ops/speculate_decoding/speculate_get_token_penalty_multi_scores.cu | 修复 CUDA kernel 中 repeat times 统计的 loop 上界,降低越界风险并减少无效遍历。 |
| "benchmark_mode": False, | ||
| "enf_gen_phase_tag": False, | ||
| "enable_draft_logprob": False, | ||
| "verify_strategy": "topp", | ||
| "verify_strategy": "target_match", | ||
| "accept_policy": "normal", |
There was a problem hiding this comment.
这里把 speculative decoding 的默认 verify_strategy 从 "topp" 改成了 "target_match",属于用户可见行为变更;目前文档仍写的是 topp (default),会造成配置说明与实际默认值不一致。建议同步更新 docs/features/speculative_decoding.md 与 docs/zh/features/speculative_decoding.md 中关于默认策略的描述,或在文档中明确默认值已调整。
| const int64_t* pre_ids_now = token_ids_all + bi * length_id + prompt_lens[bi]; | ||
| int* repeat_times_now = repeat_times + token_idx * length; | ||
| for (int i = tid; i < cur_len[bi]; i += blockDim.x) { | ||
| int64_t id = pre_ids_now[i]; | ||
| if (id < 0) break; | ||
| atomicAdd(&repeat_times_now[id], 1); |
There was a problem hiding this comment.
update_repeat_times 的循环上界从 length_id 改为 cur_len[bi] 解决了 prompt_lens 偏移后可能越界的问题,但当前单测 tests/operators/test_speculate_get_token_penalty_multi_scores.py 里 prompt_lens 全为 0,无法覆盖这次修复的关键路径。建议补充一个 prompt_lens 非 0(且接近 length_id 边界)的用例,确保不会再出现 bi*length_id+prompt_lens+cur_len 越界访问。
| token_ids_all_now = token_ids_all[bi] | ||
| repeat_times_now = repeat_times[token_idx] | ||
|
|
||
| for i in range(length_id): | ||
| for i in range(cur_len[bi]): | ||
| id = token_ids_all_now[i] | ||
| if id < 0: |
There was a problem hiding this comment.
测试里的 reference 实现 update_repeat_times 目前完全不考虑 prompt_lens(而 CUDA kernel 会用 prompt_lens 做偏移)。这会导致一旦补充 prompt_lens 非 0 的用例时,reference 结果与 kernel 行为不一致,从而掩盖/误报问题。建议让 reference 也按 prompt_lens 偏移读取 token_ids_all,再结合 cur_len 统计重复次数。
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7467 +/- ##
==========================================
Coverage ? 74.17%
==========================================
Files ? 398
Lines ? 54987
Branches ? 8616
==========================================
Hits ? 40786
Misses ? 11468
Partials ? 2733
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
10262a7
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-17 22:19 CST
📋 Review 摘要
PR 概述:修复投机解码 repeat penalty kernel 的越界访问 bug,并将默认验证策略从 topp 改为 target_match。
变更范围:custom_ops/gpu_ops/speculate_decoding/、fastdeploy/config.py、测试文件
影响面 Tag:Speculative Decoding OP FDConfig
📝 PR 规范检查
Accuracy Tests 章节为空。此 PR 修复了 CUDA kernel 的越界访问 bug,建议补充精度对比测试结果以证明修复正确性。
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | tests/layers/test_speculative_sampler.py |
缺少 target_match 验证策略的测试用例 |
总体评价
核心 bug 修复正确且重要:update_repeat_times kernel 中 pre_ids_now 已通过 prompt_lens[bi] 做了偏移,之前以 length_id(整行 buffer 长度)为循环上界会导致读取超出当前 batch item 行边界的内存,改为 cur_len[bi] 后仅遍历实际有效的 generated token,修复了越界问题。Python 参考实现和单元测试已同步修改。默认策略变更合理,测试也做了相应适配。
|
|
||
| # Use ngram method for speculative decoding | ||
| fd_config = _create_fd_config(max_model_len, method="ngram") | ||
| fd_config = _create_fd_config(max_model_len, method="ngram", verify_strategy="topp") |
There was a problem hiding this comment.
🟡 建议 缺少 target_match 验证策略的测试覆盖
当前所有 speculative sampler 测试用例都显式指定了 verify_strategy="topp",但本 PR 已将默认策略改为 target_match。建议至少新增一个使用 verify_strategy="target_match" 的测试用例(或直接使用默认值不传参),以确保新默认策略路径也有测试覆盖。
Motivation
Modifications
Usage or Command
None
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.