Skip to content

[RL] Support cpu tensor broadcast#7833

Open
Sunny-bot1 wants to merge 2 commits into
PaddlePaddle:developfrom
Sunny-bot1:broadcast_cpu
Open

[RL] Support cpu tensor broadcast#7833
Sunny-bot1 wants to merge 2 commits into
PaddlePaddle:developfrom
Sunny-bot1:broadcast_cpu

Conversation

@Sunny-bot1
Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 15, 2026

Thanks for your contribution!

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 15, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-16 23:01:06

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 Required 任务均已通过 ✅,PR 可合并。有 2 个可选任务失败,不影响合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
21(0) 21 19 2 0 0 0

2 任务状态汇总

2.1 Required任务 : 3/3 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 3 个必选任务通过 - - - - -

2.2 可选任务 — 16/18 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 10m36s Job -
Check PR Template 10s Job -
其余 16 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-15 18:46:19

📋 Review 摘要

PR 概述:将 worker 进程的权重同步信号广播从 broadcast_object_list 改为基于 CPU tensor 的 broadcast(gloo backend),以支持 CPU 场景下的多 rank 信号同步。
变更范围fastdeploy/worker/worker_process.pyPaddleDisWorkerProc 事件循环)
影响面 Tag[RL] [BugFix]

问题

级别 文件 概述
🔴 Bug worker_process.py:471 dist.new_group() 在事件循环主路径每次迭代都调用,导致性能劣化、资源泄漏和潜在死锁
🔴 Bug worker_process.py:536 dist.new_group() 在内层轮询循环每秒调用一次,同上问题
📝 PR 规范 PR 描述所有 section 均为空,Checklist 全部未勾选

📝 PR 规范检查

PR 标题含有官方 Tag [RL],格式合规;但变更位于 fastdeploy/worker/worker_process.py(Worker 层),而非 fastdeploy/rl/,建议改用 [BugFix][Optimization] 更准确反映实际影响面。PR 描述所有段落(Motivation / Modifications / Usage or Command / Accuracy Tests)均为空,Checklist 全部未勾选,需补充完整。

标题建议(可直接复制):

  • [BugFix] Fix dist.new_group() called repeatedly in event loop for cpu tensor broadcast

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
在多 rank(`self.ranks > 1`)场景下,原有 `broadcast_object_list`(默认 NCCL 组)无法支持 CPU tensor 广播。为支持 RL 训练等 CPU-only 通信场景,需改用 gloo backend 进行权重信号广播。

## Modifications
- `fastdeploy/worker/worker_process.py`- `_broadcast_model_weights_signal`:将广播实现从 `broadcast_object_list` 改为 `paddle.full` 构造 CPU int32 tensor + `paddle.distributed.broadcast`(gloo group),并通过 `.numpy()[0]` 读取结果。
  - `event_loop_normal`:为上述 broadcast 调用创建 gloo backend 的通信组。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

本 PR 使用 CPU tensor + gloo backend 替代 broadcast_object_list 的思路是正确的,但 dist.new_group() 被错误地放入事件循环热路径,每次迭代都创建新通信组,存在严重的性能劣化、资源泄漏和死锁风险,需修复后方可合入。

self.model_weights_signal[0] = int(self.model_weights_status.value[0])
if self.ranks > 1:
self.model_weights_signal[0] = self._broadcast_model_weights_signal(src=0, group=None)
group = dist.new_group(list(range(self.ranks)), backend="gloo")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug dist.new_group() 在事件循环主路径中每次迭代都被调用

new_group() 是集合通信操作,要求所有 rank 同时调用,且每次都会创建新的进程组对象。在热路径中反复调用将导致:

  1. 性能严重劣化:每次迭代都要跨节点建立新通信组,开销极大
  2. 内存/资源泄漏:创建的 group 对象无清理逻辑,持续积累
  3. 潜在死锁:若各 rank 调度时序稍有偏差(一个在主循环,另一个在内层循环),则永远无法对齐 new_group 调用

建议修复方式:
将 gloo group 的创建移到初始化阶段(__init__ 或首次进入循环前),并作为实例变量复用:

# 在 __init__ 或初始化方法中(仅创建一次)
if self.ranks > 1:
    self._gloo_broadcast_group = dist.new_group(
        list(range(self.ranks)), backend="gloo"
    )

# 在 event_loop_normal 中复用
self.model_weights_signal[0] = self._broadcast_model_weights_signal(
    src=0, group=self._gloo_broadcast_group
)

while self.model_weights_signal[0] != ModelWeightsStatus.UPDATING:
self.model_weights_signal[0] = self.model_weights_status.value[0]
if self.ranks > 1:
group = dist.new_group(list(range(self.ranks)), backend="gloo")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 同上:dist.new_group() 在内层轮询循环(while ... != UPDATING)中每秒调用一次

此处循环通过 time.sleep(1) 反复执行,每次都创建新的 gloo 进程组,与外层循环同样存在资源泄漏和死锁风险。

修复同上:统一改为在初始化时创建 self._gloo_broadcast_group,此处直接引用实例变量。

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 15, 2026

Codecov Report

❌ Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@9139986). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/worker_process.py 0.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7833   +/-   ##
==========================================
  Coverage           ?   63.27%           
==========================================
  Files              ?      462           
  Lines              ?    64279           
  Branches           ?     9851           
==========================================
  Hits               ?    40672           
  Misses             ?    20842           
  Partials           ?     2765           
Flag Coverage Δ
GPU 72.38% <0.00%> (?)
XPU 7.12% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-15 20:00:38

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⚠️ 存在 2 个 Required 任务失败,1 个 Required 任务运行中,请优先处理。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 35 4 1 1 0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run Stable Tests / stable_tests 2m8s PR问题:worker broadcast 改用 CPU tensor+gloo,疑似影响 stable 测试 检查 gloo backend 是否可用,或确认测试环境支持 Job -
Extracted partial CE model tasks to run in CI. / run_ce_cases 23m15s PR问题:event_loop 中循环创建 gloo group,疑似导致 CE 测试失败 将 gloo group 创建移至循环外,避免重复创建 Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
其余 7 个必选任务通过 - - - - -

2.2 可选任务 — 28/31 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 10m36s Job -
Check PR Template 10s Job -
⏸️ CI_HPU - - -
其余 28 个可选任务通过 - - -

3 失败详情(仅 required)

Run Stable Tests / stable_tests — 测试失败(置信度: 低)

Run Stable Tests / stable_tests

  • 状态: ❌ 失败
  • 错误类型: 测试失败
  • 置信度: 低
  • 根因摘要: PR 将 worker broadcast 改为 CPU tensor + gloo group,疑影响 stable 测试
  • 分析器: ci_analyze_unittest_fastdeploy(日志获取失败,基于 PR diff 分析)

根因详情:
本次 PR 修改了 fastdeploy/worker/worker_process.py 中的 _broadcast_model_weights_signal 方法,将原来的 paddle.distributed.broadcast_object_list(group=None) 改为使用 CPU tensor + paddle.distributed.broadcast + 显式 gloo group。由于无法获取实际日志(日志下载失败),无法确认具体失败的测试用例。stable_tests 仅运行了 2m8s 即失败,疑似在初始化或早期测试阶段即出错。

关键日志:

(日志获取失败,无法提取错误信息)
失败步骤: Run FastDeploy Stable Tests

修复建议:

  1. 确认 gloo backend 在 CI 环境中可用;检查 worker_process.py _broadcast_model_weights_signal 中新增的 gloo group 是否与测试环境兼容
  2. 若 gloo 不可用,考虑在创建 group 前增加环境检测或回退机制

修复建议摘要: 确认 gloo backend 可用,或添加环境兼容性检查

关联变更: fastdeploy/worker/worker_process.py L312-L323(_broadcast_model_weights_signal
链接: 查看日志

Extracted partial CE model tasks to run in CI. / run_ce_cases — 测试失败(置信度: 低)

Extracted partial CE model tasks to run in CI. / run_ce_cases

  • 状态: ❌ 失败
  • 错误类型: 测试失败
  • 置信度: 低
  • 根因摘要: event_loop 循环内重复调用 dist.new_group(),疑导致 CE 测试失败
  • 分析器: ci_analyze_unittest_fastdeploy(日志获取失败,基于 PR diff 分析)

根因详情:
event_loop_normal 的 while 循环体内(约 L533),PR 新增了 group = dist.new_group(list(range(self.ranks)), backend="gloo") 语句。此语句在每次循环迭代中都会创建一个新的进程组,这是不正确的——new_group 应仅调用一次,否则可能导致进程组资源耗尽或分布式协调错误,进而导致 CE 测试在运行约 23 分钟后失败。

关键日志:

(日志获取失败,无法提取错误信息)
失败步骤: Run CI unittest

修复建议:

  1. worker_process.py L533 附近 while 循环内的 group = dist.new_group(...) 移到 while 循环外部,仅创建一次
  2. 参考同文件 L468 处的正确写法(group 在循环外创建后传入)

修复建议摘要: 将 while 循环内的 new_group() 移至循环外,避免重复创建

关联变更: fastdeploy/worker/worker_process.py L529-L540(event_loop_normal while 循环)
链接: 查看日志

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-16 12:37:37

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 Required 任务全部通过 ✅,建议合并(3 个 Optional 任务失败,不阻塞合并)。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
24(0) 24 21 3 0 0 0

2 任务状态汇总

2.1 Required任务 : 4/4 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 4 个必选任务通过 - - - - -

2.2 可选任务 — 17/20 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 10m36s Job -
Check PR Template 10s Job -
CI_HPU 1h6m Job -
其余 17 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants