[Cherry-Pick][RL]moe bf16 ep support paddle batch_gemm(#7337)#7339
[Cherry-Pick][RL]moe bf16 ep support paddle batch_gemm(#7337)#7339ckl117 merged 3 commits intoPaddlePaddle:release/2.6from
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7339 +/- ##
==============================================
Coverage ? 73.84%
==============================================
Files ? 376
Lines ? 52964
Branches ? 8269
==============================================
Hits ? 39112
Misses ? 11116
Partials ? 2736
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-11 19:35 CST
📋 Review 摘要
PR 概述:为 MoE BF16 EP 路径添加 paddle.batched_gemm 支持,替代原有的 compute_ffn 调用,并根据 FD_MOE_PROB_IN_ADVANCE 环境变量选择 fused_swiglu_scale 或 swiglu 激活函数。
变更范围:model_executor/layers/moe/fused_moe_cutlass_backend.py、tests/layers/test_fused_moe_cutlass_backend.py
影响面 Tag:[RL] [OP]
📝 PR 规范检查
标题问题:标题格式正确,包含 [RL] 和 [Cherry-Pick] 标签。
描述问题:Motivation 和 Modifications 章节均为空,缺少变更说明。
描述建议(可直接复制):
## Motivation
在 MoE EP prefill 路径中,当 `FD_USE_PHI_MOE_PERMUTE=True` 且使用 w16a16 量化时,使用 `paddle.batched_gemm` 替代原有的 `compute_ffn` 调用,以提高性能。同时支持 `FD_MOE_PROB_IN_ADVANCE` 环境变量控制是否在激活函数前进行概率加权。
## Modifications
1. 在 `apply_ep_prefill` 方法的 `moe_permute/moe_unpermute` 路径中:
- 使用两次 `paddle.incubate.nn.functional.batched_gemm` 替代 `self.compute_ffn` 调用
- 根据 `FD_MOE_PROB_IN_ADVANCE` 环境变量选择:
- True: 使用 `paddlefleet_ops.fused_swiglu_scale` (融合激活和缩放)
- False: 使用 `paddle.incubate.nn.functional.swiglu` (仅激活)
- 调整 `moe_unpermute` 的 `using_weighted_combine` 参数逻辑
2. 测试文件:
- 添加 `align` 辅助函数用于对齐 token 数
- 修改 `RealMoELayer` 的权重形状定义以适配 batched_gemm
- 在 stub dispatch 中添加对 `expert_alignment` 的支持问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fused_moe_cutlass_backend.py:176 |
paddlefleet_ops 可能为 None,直接访问属性会导致 AttributeError |
总体评价
整体变更思路合理,使用 batched_gemm 替代原有 FFN 计算路径。但存在一个运行时错误风险:当 paddlefleet.ops 不可用时,直接调用 paddlefleet_ops.fused_swiglu_scale 会抛出异常,需要添加可用性检查。
| recv_num_tokens_per_expert_list, | ||
| ) | ||
| if fastdeploy.envs.FD_MOE_PROB_IN_ADVANCE: | ||
| out = paddlefleet_ops.fused_swiglu_scale(out, dst_weights) |
There was a problem hiding this comment.
🔴 Bug 当 paddlefleet.ops 不可用时,此处会抛出 AttributeError。paddlefleet_ops 通过 try_import 导入,失败时返回 None,直接访问属性会导致运行时错误。
建议修复方式:
if fastdeploy.envs.FD_MOE_PROB_IN_ADVANCE:
if paddlefleet_ops is not None and hasattr(paddlefleet_ops, "fused_swiglu_scale"):
out = paddlefleet_ops.fused_swiglu_scale(out, dst_weights)
else:
out = paddle.incubate.nn.functional.swiglu(out)
else:
out = paddle.incubate.nn.functional.swiglu(out)
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.