[RL] Support cpu tensor broadcast#7833
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览所有 Required 任务均已通过 ✅,PR 可合并。有 2 个可选任务失败,不影响合并。
2 任务状态汇总2.1 Required任务 : 3/3 通过
2.2 可选任务 — 16/18 通过
3 失败详情(仅 required)无 required 失败任务。 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-15 18:46:19
📋 Review 摘要
PR 概述:将 worker 进程的权重同步信号广播从 broadcast_object_list 改为基于 CPU tensor 的 broadcast(gloo backend),以支持 CPU 场景下的多 rank 信号同步。
变更范围:fastdeploy/worker/worker_process.py(PaddleDisWorkerProc 事件循环)
影响面 Tag:[RL] [BugFix]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | worker_process.py:471 |
dist.new_group() 在事件循环主路径每次迭代都调用,导致性能劣化、资源泄漏和潜在死锁 |
| 🔴 Bug | worker_process.py:536 |
dist.new_group() 在内层轮询循环每秒调用一次,同上问题 |
| 📝 PR 规范 | — | PR 描述所有 section 均为空,Checklist 全部未勾选 |
📝 PR 规范检查
PR 标题含有官方 Tag [RL],格式合规;但变更位于 fastdeploy/worker/worker_process.py(Worker 层),而非 fastdeploy/rl/,建议改用 [BugFix] 或 [Optimization] 更准确反映实际影响面。PR 描述所有段落(Motivation / Modifications / Usage or Command / Accuracy Tests)均为空,Checklist 全部未勾选,需补充完整。
标题建议(可直接复制):
[BugFix] Fix dist.new_group() called repeatedly in event loop for cpu tensor broadcast
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
在多 rank(`self.ranks > 1`)场景下,原有 `broadcast_object_list`(默认 NCCL 组)无法支持 CPU tensor 广播。为支持 RL 训练等 CPU-only 通信场景,需改用 gloo backend 进行权重信号广播。
## Modifications
- `fastdeploy/worker/worker_process.py`:
- `_broadcast_model_weights_signal`:将广播实现从 `broadcast_object_list` 改为 `paddle.full` 构造 CPU int32 tensor + `paddle.distributed.broadcast`(gloo group),并通过 `.numpy()[0]` 读取结果。
- `event_loop_normal`:为上述 broadcast 调用创建 gloo backend 的通信组。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [ ] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
本 PR 使用 CPU tensor + gloo backend 替代 broadcast_object_list 的思路是正确的,但 dist.new_group() 被错误地放入事件循环热路径,每次迭代都创建新通信组,存在严重的性能劣化、资源泄漏和死锁风险,需修复后方可合入。
| self.model_weights_signal[0] = int(self.model_weights_status.value[0]) | ||
| if self.ranks > 1: | ||
| self.model_weights_signal[0] = self._broadcast_model_weights_signal(src=0, group=None) | ||
| group = dist.new_group(list(range(self.ranks)), backend="gloo") |
There was a problem hiding this comment.
🔴 Bug dist.new_group() 在事件循环主路径中每次迭代都被调用
new_group() 是集合通信操作,要求所有 rank 同时调用,且每次都会创建新的进程组对象。在热路径中反复调用将导致:
- 性能严重劣化:每次迭代都要跨节点建立新通信组,开销极大
- 内存/资源泄漏:创建的 group 对象无清理逻辑,持续积累
- 潜在死锁:若各 rank 调度时序稍有偏差(一个在主循环,另一个在内层循环),则永远无法对齐
new_group调用
建议修复方式:
将 gloo group 的创建移到初始化阶段(__init__ 或首次进入循环前),并作为实例变量复用:
# 在 __init__ 或初始化方法中(仅创建一次)
if self.ranks > 1:
self._gloo_broadcast_group = dist.new_group(
list(range(self.ranks)), backend="gloo"
)
# 在 event_loop_normal 中复用
self.model_weights_signal[0] = self._broadcast_model_weights_signal(
src=0, group=self._gloo_broadcast_group
)| while self.model_weights_signal[0] != ModelWeightsStatus.UPDATING: | ||
| self.model_weights_signal[0] = self.model_weights_status.value[0] | ||
| if self.ranks > 1: | ||
| group = dist.new_group(list(range(self.ranks)), backend="gloo") |
There was a problem hiding this comment.
🔴 Bug 同上:dist.new_group() 在内层轮询循环(while ... != UPDATING)中每秒调用一次
此处循环通过 time.sleep(1) 反复执行,每次都创建新的 gloo 进程组,与外层循环同样存在资源泄漏和死锁风险。
修复同上:统一改为在初始化时创建 self._gloo_broadcast_group,此处直接引用实例变量。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7833 +/- ##
==========================================
Coverage ? 63.27%
==========================================
Files ? 462
Lines ? 64279
Branches ? 9851
==========================================
Hits ? 40672
Misses ? 20842
Partials ? 2765
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required任务 : 7/10 通过
2.2 可选任务 — 28/31 通过
3 失败详情(仅 required)Run Stable Tests / stable_tests — 测试失败(置信度: 低)Run Stable Tests / stable_tests
根因详情: 关键日志: 修复建议:
修复建议摘要: 确认 gloo backend 可用,或添加环境兼容性检查 关联变更: Extracted partial CE model tasks to run in CI. / run_ce_cases — 测试失败(置信度: 低)Extracted partial CE model tasks to run in CI. / run_ce_cases
根因详情: 关键日志: 修复建议:
修复建议摘要: 将 while 循环内的 new_group() 移至循环外,避免重复创建 关联变更: |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览所有 Required 任务全部通过 ✅,建议合并(3 个 Optional 任务失败,不阻塞合并)。
2 任务状态汇总2.1 Required任务 : 4/4 通过
2.2 可选任务 — 17/20 通过
3 失败详情(仅 required)无 required 失败任务。 |
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.