Skip to content

[RL]适配async ipc#7472

Open
tianhaodongbd wants to merge 1 commit intoPaddlePaddle:developfrom
tianhaodongbd:async_ipc
Open

[RL]适配async ipc#7472
tianhaodongbd wants to merge 1 commit intoPaddlePaddle:developfrom
tianhaodongbd:async_ipc

Conversation

@tianhaodongbd
Copy link
Copy Markdown
Contributor

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 17, 2026

Thanks for your contribution!

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-17 19:40 CST

📋 Review 摘要

PR 概述:为 RL 模块适配 async IPC 权重传输方式,新增基于 ZMQ 的 receive_all_buckets 函数和 _update_ipc_async 方法,替换原有同步 IPC 策略。
变更范围fastdeploy/rl/dynamic_weight_manager.pyfastdeploy/model_executor/model_loader/default_loader_v1.py
影响面 TagRL Loader

📝 PR 规范检查

PR 描述中 Motivation、Modifications、Usage or Command 均未填写,建议补充变更目的与使用方式。

描述模板(可直接复制):

## Motivation
适配 async IPC 权重传输方式,替换原有同步 IPC 策略,提升 RL 场景下权重加载性能。

## Modifications
1. 新增 `receive_all_buckets` 函数,通过 ZMQ + CUDA IPC 异步接收训练侧权重
2. 新增 `_update_ipc_async` 方法,替换原有 `_update_ipc` 
3. 将 `process_final_after_loading` 调用从 `default_loader_v1.py` 移至 `_update_ipc_async` 中
4. 临时禁用 `finalize_update` 中的参数校验

问题

级别 文件 概述
🔴 Bug dynamic_weight_manager.py:108 残留调试 print,打印完整 tensor 内容会导致生产环境日志爆炸和性能下降
🔴 Bug dynamic_weight_manager.py:278 reload_model_weights 中 "ipc" 策略仍映射到旧的 _update_ipc,与此处修改不一致
🟡 建议 dynamic_weight_manager.py:527 _verify_parameters("update") 被注释掉,移除了重要的权重完整性校验
🟡 建议 dynamic_weight_manager.py:91 target_device 参数未被使用,buffer_meta[6] 始终从环境变量读取

总体评价

async IPC 的核心实现逻辑清晰,但存在残留调试代码和 reload_model_weights 路径不一致的问题,需要在合入前修复。

target_dtype = _DTYPE_MAP[dtype_key]
param = param_bytes.view(target_dtype).reshape(shape)
# Clone to own memory so we can release the IPC flat buffer
print(">>>>>>>>>> ori param_key", name, param, param._md5sum())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 残留调试 print 语句,会将每个参数的完整 tensor 内容和 md5 打印到 stdout。

在生产环境中,模型可能有数百个参数,这会导致:

  1. 日志量急剧膨胀(tensor 内容可能非常大)
  2. _md5sum() 需要遍历整个 tensor 计算校验和,带来额外性能开销

建议移除或替换为 logger.debug 且只打印 name/shape/dtype:

logger.debug(f"Received param: {name}, shape={param.shape}, dtype={param.dtype}")

"ipc_snapshot": self._update_ipc_snapshot,
"ipc": self._update_ipc,
# "ipc": self._update_ipc,
"ipc": self._update_ipc_async,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug update_parameters 中 "ipc" 策略已改为 _update_ipc_async,但 reload_model_weights 方法(第327-332行)中 "ipc" 仍映射到旧的 _update_ipc

当通过 reload_model_weights 路径触发权重重载时,会走旧的同步 IPC 逻辑,且不会调用 process_final_after_loading,导致行为不一致。

建议同步更新 reload_model_weights 中的策略映射:

def reload_model_weights(self):
    if not self.first_load:
        ...
        strategy_handlers = {
            "ipc_snapshot": self._update_ipc_snapshot,
            "ipc": self._update_ipc_async,
        }

def finalize_update(self, pid: int = 0):
"""Finalize update process with verification."""
self._verify_parameters("update")
# self._verify_parameters("update")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 _verify_parameters("update") 被注释掉,移除了权重加载后的完整性校验(检查所有参数是否已正确初始化)。

如果是因为 _update_ipc_async 使用 model.load_weights 而非 _update_model_from_state 导致 self.state_dict 未同步更新,从而校验不通过,建议:

  1. _update_ipc_async 结束后同步 self.state_dict
  2. 或者添加注释说明禁用原因和后续恢复计划

直接移除校验可能掩盖权重加载不完整的问题。


# --- Rebuild flat uint8 buffer from CUDA IPC handle ---
buffer_meta[0] = buffer_meta[0].encode("latin-1")
buffer_meta[6] = int(os.getenv("FLAGS_selected_gpus", "0"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 函数签名中接受 target_device 参数,但此处 buffer_meta[6] 始终从 FLAGS_selected_gpus 环境变量读取,target_device 实际未被使用。

建议:

  • 如果 target_device 后续会使用,应替换此处的硬编码:
    buffer_meta[6] = target_device if target_device is not None else int(os.getenv("FLAGS_selected_gpus", "0"))
  • 如果不需要该参数,应从函数签名中移除以避免误导。

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 15.00000% with 51 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@a729e0f). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/rl/dynamic_weight_manager.py 15.00% 51 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7472   +/-   ##
==========================================
  Coverage           ?   73.77%           
==========================================
  Files              ?      398           
  Lines              ?    55034           
  Branches           ?     8615           
==========================================
  Hits               ?    40601           
  Misses             ?    11708           
  Partials           ?     2725           
Flag Coverage Δ
GPU 73.77% <15.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants