[Cherry-Pick][Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap(#7323) by freeliuzc · Pull Request #7441 · PaddlePaddle/FastDeploy

freeliuzc · 2026-04-16T12:28:48Z

…de with insert_task overlap (#7323)

support mtp overlap in pd-split mode with insert_task overlap

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…de with insert_task overlap (PaddlePaddle#7323) * support mtp overlap in pd-split mode with insert_task overlap

paddle-bot · 2026-04-16T12:28:58Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR（cherry-pick 自 #7323）主要围绕 PD-split 模式下 speculative decoding（尤其 SUFFIX 分支）插入任务时的更深度 overlap 做了输入写入方式调整（大量改为 async_set_value），并补充了对应单测；同时增强了 async_expert_loader 对 cuda-python 13.x 的兼容性。

Changes:

在 GPUModelRunner.insert_tasks_v1 与 MTPProposer.insert_tasks_v1 中，将多处 share_inputs/model_inputs 写入改为 async_set_value(..., blocking=False)，以减少同步开销并支持 overlap。
在 PD-split decode + SpecMethod.SUFFIX 分支中，调整 draft token 写入与 seq_lens_this_time_buffer 的设置逻辑，并新增单测覆盖该分支。
async_expert_loader 增加对 cuda-python 13.x（cuda-bindings）的导入路径兼容。

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/worker/test_gpu_model_runner.py	新增 PD-split decode + SUFFIX 分支的 insert_tasks_v1 行为单测，并用同步 mock 替代 async_set_value
fastdeploy/worker/gpu_model_runner.py	insert_tasks_v1 大量写入改为 async_set_value；调整 SUFFIX 分支写入 draft_tokens/seq_lens；新增 stop_seqs 的整块写入逻辑
fastdeploy/spec_decode/mtp.py	MTP insert_tasks_v1 写入改为 async_set_value，并对 hybrid_mode 下 input_ids_cpu 写入做条件保护
fastdeploy/model_executor/xpu_pre_and_post_process.py	新增 XPU 侧 async_set_value 实现，供 mtp.py 在 XPU 分支使用
fastdeploy/model_executor/pre_and_post_process.py	将 async_set_value 改为跨平台可用（不再仅 CUDA 定义），并修正 list/ndarray 类型判断
fastdeploy/eplb/async_expert_loader.py	兼容 cuda-python 13.x 的 cudart 导入（cuda.bindings.runtime），并更新缺失依赖提示

Copilot · 2026-04-16T12:34:21Z

+
+    _cuda_ver = getattr(_cuda_pkg, "__version__", None)
+    if _cuda_ver is None:
+        # cuda-python >= 13.x 无顶层 __version__，通过 cuda-bindings 子包判断


这里新增的注释为中文（关于 cuda-python 13.x 版本判断），但仓库约定代码注释使用英文。建议将该注释翻译为英文，便于非中文读者理解与维护。

Suggested change

# cuda-python >= 13.x 无顶层 __version__，通过 cuda-bindings 子包判断

# cuda-python >= 13.x does not expose a top-level __version__; detect it via the cuda-bindings package

Copilot · 2026-04-16T12:34:22Z

-            self.share_inputs["top_p_normalized_logprobs"][idx : idx + 1] = request.get(
-                "top_p_normalized_logprobs", False
+            self.share_inputs["top_k_list"][idx] = request.get("top_k", 0)
+            async_set_value(self.share_inputs["eos_token_id"][:], request.eos_token_ids)


这里对 eos_token_id 使用 async_set_value 传入的是一维 list，但 eos_token_id 的张量形状通常是 [eos_tokens_lens, 1]（二维）。当前 async_set_value 在 CUDA 路径会直接走 custom_numpy_to_tensor 的 fast path 并提前 return，可能绕过 reshape，导致形状不匹配或拷贝结果错位。建议在写入前将 eos_token_ids reshape 成 (eos_tokens_lens, 1)，或在 async_set_value 内部保证 fast path 也会先对齐形状。

Copilot · 2026-04-16T12:34:22Z

+                # 每条 stop sequence pad 到 stop_seqs_max_len，凑齐空行后整块写入
+                # 避免对第 3 维做部分切片（非连续内存）导致 async_set_value stride 错位
+                stop_token_ids = request.get("stop_token_ids")
+                max_len = self.model_config.stop_seqs_max_len
+                padded = [seq + [-1] * (max_len - len(seq)) for seq in stop_token_ids]
+                padded.extend([[-1] * max_len] * (self.model_config.max_stop_seqs_num - stop_seqs_num))


stop_seqs 的写入这里传入的 padded 是二维 list（shape 约为 [max_stop_seqs_num, max_len]），但目标切片是三维 [1, max_stop_seqs_num, max_len]。若 async_set_value 在 CUDA fast path 直接调用 custom_numpy_to_tensor 并跳过 reshape，可能导致形状不匹配或数据布局错误。建议将 padded 包一层 batch 维（例如变成 [padded]）或显式 reshape 成目标形状，或修复 async_set_value 的 fast path 先 reshape。

Suggested change

# 每条 stop sequence pad 到 stop_seqs_max_len，凑齐空行后整块写入

# 避免对第 3 维做部分切片（非连续内存）导致 async_set_value stride 错位

stop_token_ids = request.get("stop_token_ids")

max_len = self.model_config.stop_seqs_max_len

padded = [seq + [-1] * (max_len - len(seq)) for seq in stop_token_ids]

padded.extend([[-1] * max_len] * (self.model_config.max_stop_seqs_num - stop_seqs_num))

# Pad each stop sequence to stop_seqs_max_len, then write the full block.

# Keep the input shape aligned with the 3D target slice to avoid

# CUDA fast-path shape/stride mismatches in async_set_value.

stop_token_ids = request.get("stop_token_ids")

max_len = self.model_config.stop_seqs_max_len

padded = [seq + [-1] * (max_len - len(seq)) for seq in stop_token_ids]

padded.extend([[-1] * max_len] * (self.model_config.max_stop_seqs_num - stop_seqs_num))

padded = np.asarray(padded, dtype="int64").reshape(1, self.model_config.max_stop_seqs_num, max_len)

Copilot · 2026-04-16T12:34:22Z

+                # 每条 stop sequence pad 到 stop_seqs_max_len，凑齐空行后整块写入
+                # 避免对第 3 维做部分切片（非连续内存）导致 async_set_value stride 错位
+                stop_token_ids = request.get("stop_token_ids")
+                max_len = self.model_config.stop_seqs_max_len
+                padded = [seq + [-1] * (max_len - len(seq)) for seq in stop_token_ids]
+                padded.extend([[-1] * max_len] * (self.model_config.max_stop_seqs_num - stop_seqs_num))
+                async_set_value(self.share_inputs["stop_seqs"][idx : idx + 1, :, :], padded)


这里新增的代码注释为中文，但仓库约定代码注释使用英文，避免影响跨团队协作与后续维护。建议将该段注释翻译为英文，并保持术语（stop sequence / pad / stride）一致。

Copilot · 2026-04-16T12:34:23Z

+        if current_platform.is_cuda():
            if str(src.dtype) != dtype_str:
                srt_tensor = paddle.empty(tgt.shape, dtype=str(src.dtype))
                src = custom_numpy_to_tensor(src, srt_tensor)
            else:
                return custom_numpy_to_tensor(src, tgt)


async_set_value 在 CUDA 且 dtype 匹配时会直接 return custom_numpy_to_tensor(src, tgt)，这会绕过后续的 shape/reshape 逻辑。对于 tgt 为二维/三维而 src 为一维/二维的常见写法（如 eos_token_id、stop_seqs 等），可能导致形状不匹配或拷贝错位。建议在进入 custom_numpy_to_tensor 前先确保 src 已 reshape 到 tgt.shape，或移除该 early-return，统一走后续 reshape + copy_ 路径。

Suggested change

if current_platform.is_cuda():

if str(src.dtype) != dtype_str:

srt_tensor = paddle.empty(tgt.shape, dtype=str(src.dtype))

src = custom_numpy_to_tensor(src, srt_tensor)

else:

return custom_numpy_to_tensor(src, tgt)

target_shape = tuple(tgt.shape)

if src.shape != target_shape:

src = np.reshape(src, target_shape)

if current_platform.is_cuda():

tensor_dtype = str(src.dtype) if str(src.dtype) != dtype_str else tgt.dtype

src_tensor = paddle.empty(target_shape, dtype=tensor_dtype)

src = custom_numpy_to_tensor(src, src_tensor)

[Speculative Decoding] Support mtp super ultra overlap in pd-split mo…

5534ba1

…de with insert_task overlap (PaddlePaddle#7323) * support mtp overlap in pd-split mode with insert_task overlap

Copilot AI review requested due to automatic review settings April 16, 2026 12:28

freeliuzc temporarily deployed to Metax_ci April 16, 2026 12:28 — with GitHub Actions Inactive

Copilot started reviewing on behalf of freeliuzc April 16, 2026 12:29 View session

RichardWooSJTU merged commit b440150 into PaddlePaddle:release/online/20260415 Apr 16, 2026
13 of 16 checks passed

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap(#7323)#7441

[Cherry-Pick][Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap(#7323)#7441
RichardWooSJTU merged 1 commit intoPaddlePaddle:release/online/20260415from
freeliuzc:cherry-pick-overlap

freeliuzc commented Apr 16, 2026

Uh oh!

paddle-bot bot commented Apr 16, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# cuda-python >= 13.x 无顶层 __version__，通过 cuda-bindings 子包判断
	# cuda-python >= 13.x does not expose a top-level __version__; detect it via the cuda-bindings package

Conversation

freeliuzc commented Apr 16, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Apr 16, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants