Skip to content

[Feature]report PD info to IM#8082

Open
ChowMingSing wants to merge 1 commit into
PaddlePaddle:developfrom
ChowMingSing:feature-im-report-v2
Open

[Feature]report PD info to IM#8082
ChowMingSing wants to merge 1 commit into
PaddlePaddle:developfrom
ChowMingSing:feature-im-report-v2

Conversation

@ChowMingSing

@ChowMingSing ChowMingSing commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-26 15:08:18

📋 Review 摘要

PR 概述:新增 IM 查询 FastDeploy/PD 注册信息、ready 健康检查和 /fastdeploy/server/info 汇报接口
变更范围fastdeploy/entrypoints/openai/api_server.py
影响面 Tag[APIServer] [PD Disaggregation]

问题

级别 文件 概述
🔴 Bug fastdeploy/entrypoints/openai/api_server.py:960 dp_rank 字符串与整数比较,is_master 永远不会置为 1
🔴 Bug fastdeploy/entrypoints/openai/api_server.py:985 async LLM 模式下 llm_engine 没有 .engine,新增 info 接口会 500

📝 PR 规范检查

标题包含官方 Tag,但当前 PR 描述各 section 仍是模板占位/空内容,建议替换为下面的完整描述。

标题建议(可直接复制):

  • [APIServer] Report PD info to IM
PR 描述建议(点击展开,可直接复制)
## Motivation
Report FastDeploy PD disaggregation/register information to IM, including server identity, role, resource information, connected decode nodes, and readiness status.

## Modifications
- Add `/register_info` for decode node registration metadata.
- Add `/v2/health/ready` for IM readiness checks backed by existing `/health`.
- Add `/fastdeploy/server/info` to report API server/PD fields, resource ranges, master flag, and connected decode node list.
- Start a background decode-node poller that reads `D_IP_LIST`/`DECODE_PORTS` and collects `/register_info` from decode nodes.

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

新增接口方向和变更范围清晰,但当前实现会在 master 识别和 async LLM 部署下产生错误结果/接口 500。建议先修复上述两个功能问题,并补充接口级测试后再合入。

with open(fed_member_file, 'r') as f:
fed_member_list = f.read().strip().split(',')
if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and \
dp_rank == 0:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug dp_rank 在上面已经被转成字符串,这里再和整数 0 比较,条件永远为 False。

配置了 FED_MEMBER_FILE 且当前 HOST_IP 是成员列表第一个、DP rank 为 0 时,is_master 仍会保持 0,IM 侧无法识别 master 节点。

建议修复方式:保留一个整数 rank 用于逻辑判断,只在拼接 pod_name 或写入响应时再转字符串。

dp_rank = cfg.parallel_config.local_data_parallel_id
# pod_name 拼接处使用 str(dp_rank)
if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and dp_rank == 0:
    is_master = 1
cfg_dict["dp_rank"] = str(dp_rank)

cfg_dict["is_stopping"] = "running"
cfg_dict["is_master"] = is_master
cfg_dict["container_host_ip"] = os.getenv("HOST_IP", "None")
cfg_dict["free_block_num"] = llm_engine.engine.resource_manager.available_block_num()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 这里直接访问 llm_engine.engine.resource_manager,在 FD_ENABLE_ASYNC_LLM=1 时会让新增接口返回 500。

load_engine() 在 async 模式下把全局 llm_engine 设置为 AsyncLLMAsyncLLM 继承的 EngineServiceClient 只在子进程里创建 EngineService,主进程对象没有 .engine 属性。文件里已有生命周期代码也用 not isinstance(llm_engine, AsyncLLM) 区分了同步引擎路径。

建议修复方式:对 AsyncLLM 单独走跨进程状态查询/control API 获取 free_block_num,或在 async 模式下返回明确的不可用值;不要在 API server 主进程直接读取 llm_engine.engine.resource_manager

@codecov-commenter

codecov-commenter commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 8.63309% with 127 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f4eda5a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/entrypoints/openai/api_server.py 8.63% 127 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8082   +/-   ##
==========================================
  Coverage           ?   67.39%           
==========================================
  Files              ?      475           
  Lines              ?    67048           
  Branches           ?    10335           
==========================================
  Hits               ?    45187           
  Misses             ?    18990           
  Partials           ?     2871           
Flag Coverage Δ
GPU 77.37% <8.63%> (?)
XPU 6.94% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 27, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-30 21:48:52 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: a931d80 | Merge base: f4eda5a (branch: develop)


1 Required任务 : 7/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 35 6 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题 Job
Pre Commit PR问题 Job
Approval 需要 Approval Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)

分析器: 通用分析(fallback)
失败用例: 覆盖率阈值校验

用例 错误摘要
fastdeploy/entrypoints/openai/api_server.py diff coverage 仅 8%,低于 80% 阈值

关键日志:

Failure. Coverage is below 80%.
fastdeploy/entrypoints/openai/api_server.py (8.6%): Missing lines 131-138,143-161,165-167,363,826-828,...,993-994
Total:   139 lines
Missing: 127 lines
Coverage: 8%
TEST_EXIT_CODE: 0
COVERAGE_EXIT_CODE: 9
  • 根因摘要: 新增接口和轮询逻辑缺少单测覆盖
    PR 新增了 _fetch_decode_node_register_info_poll_decode_nodeslaunch_decode_node_poller,并在 lifespan() 中启动轮询线程,同时新增 /register_info/v2/health/ready/fastdeploy/server/info 三个接口。现有单测全部通过,但这些新增分支大部分没有被测试命中,diff coverage 只覆盖 12/139 行,触发 80% 覆盖率门禁失败。

修复建议:

  1. tests/entrypoints/openai/test_api_server.py 或相邻测试中补充新增接口和 helper 的单测,覆盖 engine 未加载、正常 register_info、IM health 成功/失败、server info env 分支、decode 节点轮询更新等路径。
  2. launch_decode_node_poller() / _poll_decode_nodes() 建议通过 mock threading.Threadrequests.gettime.sleep 和环境变量做可退出测试,避免真实无限循环。

关联变更: fastdeploy/entrypoints/openai/api_server.py:129, fastdeploy/entrypoints/openai/api_server.py:141, fastdeploy/entrypoints/openai/api_server.py:164, fastdeploy/entrypoints/openai/api_server.py:363, fastdeploy/entrypoints/openai/api_server.py:820, fastdeploy/entrypoints/openai/api_server.py:869, fastdeploy/entrypoints/openai/api_server.py:883

🔴 Pre Commit — PR问题(置信度: 高)

分析器: 通用分析(fallback)
失败用例: pre-commit 格式检查

用例 错误摘要
fastdeploy/entrypoints/openai/api_server.py black 和 isort 均修改了该文件

关键日志:

black....................................................................Failed
- hook id: black
- files were modified by this hook
reformatted fastdeploy/entrypoints/openai/api_server.py
isort....................................................................Failed
- hook id: isort
- files were modified by this hook
  • 根因摘要: PR 文件未按 black/isort 格式化
    日志显示 black 重新格式化了 fastdeploy/entrypoints/openai/api_server.pyisort 也修正了同一文件。具体包括 requests import 分组,以及 pod_name 拼接、fed_member_file 读取、connected_decode_list 列表推导等新增代码块的格式。

修复建议:

  1. 本地执行 pre-commit run --files fastdeploy/entrypoints/openai/api_server.py,提交工具自动生成的格式化结果。

关联变更: fastdeploy/entrypoints/openai/api_server.py:26, fastdeploy/entrypoints/openai/api_server.py:836, fastdeploy/entrypoints/openai/api_server.py:908, fastdeploy/entrypoints/openai/api_server.py:955, fastdeploy/entrypoints/openai/api_server.py:987

🔴 Approval — 需要 Approval(置信度: 高)

分析器: 内置审批状态
失败用例: 无

用例 错误摘要
Approval 该 Job 需要人工 Approval,完成审批后 CI 才会继续执行

关键日志:

Process completed with exit code 6.
  • 根因摘要: Workflow 等待人工审批
    该 Job 是 Approval workflow 的内置审批门禁,不属于代码编译或单测失败。

修复建议:

  1. 请通过人工审批后重新触发后续 CI。

关联变更: 无

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants