[RL]适配async ipc by tianhaodongbd · Pull Request #7472 · PaddlePaddle/FastDeploy

tianhaodongbd · 2026-04-17T11:31:48Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-17T11:32:01Z

Thanks for your contribution!

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-17 19:40 CST

📋 Review 摘要

PR 概述：为 RL 模块适配 async IPC 权重传输方式，新增基于 ZMQ 的 receive_all_buckets 函数和 _update_ipc_async 方法，替换原有同步 IPC 策略。
变更范围：fastdeploy/rl/dynamic_weight_manager.py、fastdeploy/model_executor/model_loader/default_loader_v1.py
影响面 Tag：RL Loader

📝 PR 规范检查

PR 描述中 Motivation、Modifications、Usage or Command 均未填写，建议补充变更目的与使用方式。

描述模板（可直接复制）：

## Motivation
适配 async IPC 权重传输方式，替换原有同步 IPC 策略，提升 RL 场景下权重加载性能。

## Modifications
1. 新增 `receive_all_buckets` 函数，通过 ZMQ + CUDA IPC 异步接收训练侧权重
2. 新增 `_update_ipc_async` 方法，替换原有 `_update_ipc` 
3. 将 `process_final_after_loading` 调用从 `default_loader_v1.py` 移至 `_update_ipc_async` 中
4. 临时禁用 `finalize_update` 中的参数校验

问题

级别	文件	概述
🔴 Bug	`dynamic_weight_manager.py:108`	残留调试 print，打印完整 tensor 内容会导致生产环境日志爆炸和性能下降
🔴 Bug	`dynamic_weight_manager.py:278`	`reload_model_weights` 中 "ipc" 策略仍映射到旧的 `_update_ipc`，与此处修改不一致
🟡 建议	`dynamic_weight_manager.py:527`	`_verify_parameters("update")` 被注释掉，移除了重要的权重完整性校验
🟡 建议	`dynamic_weight_manager.py:91`	`target_device` 参数未被使用，`buffer_meta[6]` 始终从环境变量读取

总体评价

async IPC 的核心实现逻辑清晰，但存在残留调试代码和 reload_model_weights 路径不一致的问题，需要在合入前修复。

PaddlePaddle-bot · 2026-04-17T11:40:53Z

+                target_dtype = _DTYPE_MAP[dtype_key]
+                param = param_bytes.view(target_dtype).reshape(shape)
+                # Clone to own memory so we can release the IPC flat buffer
+                print(">>>>>>>>>> ori param_key", name, param, param._md5sum())


🔴 Bug 残留调试 print 语句，会将每个参数的完整 tensor 内容和 md5 打印到 stdout。

在生产环境中，模型可能有数百个参数，这会导致：

日志量急剧膨胀（tensor 内容可能非常大）

_md5sum() 需要遍历整个 tensor 计算校验和，带来额外性能开销

建议移除或替换为 logger.debug 且只打印 name/shape/dtype：

logger.debug(f"Received param: {name}, shape={param.shape}, dtype={param.dtype}")

PaddlePaddle-bot · 2026-04-17T11:40:54Z

            "ipc_snapshot": self._update_ipc_snapshot,
-            "ipc": self._update_ipc,
+            # "ipc": self._update_ipc,
+            "ipc": self._update_ipc_async,


🔴 Bug update_parameters 中 "ipc" 策略已改为 _update_ipc_async，但 reload_model_weights 方法（第327-332行）中 "ipc" 仍映射到旧的 _update_ipc。

当通过 reload_model_weights 路径触发权重重载时，会走旧的同步 IPC 逻辑，且不会调用 process_final_after_loading，导致行为不一致。

建议同步更新 reload_model_weights 中的策略映射：

def reload_model_weights(self): if not self.first_load: ... strategy_handlers = { "ipc_snapshot": self._update_ipc_snapshot, "ipc": self._update_ipc_async, }

PaddlePaddle-bot · 2026-04-17T11:40:54Z

    def finalize_update(self, pid: int = 0):
        """Finalize update process with verification."""
-        self._verify_parameters("update")
+        # self._verify_parameters("update")


🟡 建议 _verify_parameters("update") 被注释掉，移除了权重加载后的完整性校验（检查所有参数是否已正确初始化）。

如果是因为 _update_ipc_async 使用 model.load_weights 而非 _update_model_from_state 导致 self.state_dict 未同步更新，从而校验不通过，建议：

在 _update_ipc_async 结束后同步 self.state_dict

或者添加注释说明禁用原因和后续恢复计划

直接移除校验可能掩盖权重加载不完整的问题。

PaddlePaddle-bot · 2026-04-17T11:40:54Z

+
+            # --- Rebuild flat uint8 buffer from CUDA IPC handle ---
+            buffer_meta[0] = buffer_meta[0].encode("latin-1")
+            buffer_meta[6] = int(os.getenv("FLAGS_selected_gpus", "0"))


🟡 建议 函数签名中接受 target_device 参数，但此处 buffer_meta[6] 始终从 FLAGS_selected_gpus 环境变量读取，target_device 实际未被使用。

建议：

如果 target_device 后续会使用，应替换此处的硬编码：
buffer_meta[6] = target_device if target_device is not None else int(os.getenv("FLAGS_selected_gpus", "0"))

如果不需要该参数，应从函数签名中移除以避免误导。

codecov-commenter · 2026-04-17T13:01:07Z

Codecov Report

❌ Patch coverage is 15.00000% with 51 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@a729e0f). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/rl/dynamic_weight_manager.py	15.00%	51 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7472   +/-   ##
==========================================
  Coverage           ?   73.77%           
==========================================
  Files              ?      398           
  Lines              ?    55034           
  Branches           ?     8615           
==========================================
  Hits               ?    40601           
  Misses             ?    11708           
  Partials           ?     2725

Flag	Coverage Δ
GPU	`73.77% <15.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

[RL]适配async ipc

5806608

tianhaodongbd had a problem deploying to Metax_ci April 17, 2026 11:31 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RL]适配async ipc#7472

[RL]适配async ipc#7472
tianhaodongbd wants to merge 1 commit intoPaddlePaddle:developfrom
tianhaodongbd:async_ipc

tianhaodongbd commented Apr 17, 2026

Uh oh!

paddle-bot bot commented Apr 17, 2026

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Apr 17, 2026

Uh oh!

PaddlePaddle-bot Apr 17, 2026

Uh oh!

PaddlePaddle-bot Apr 17, 2026

Uh oh!

PaddlePaddle-bot Apr 17, 2026

Uh oh!

codecov-commenter commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tianhaodongbd commented Apr 17, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Apr 17, 2026

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 17, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants