Skip to content

[bugfix] free blocks even if AS write failed#7807

Open
zccjjj wants to merge 3 commits into
PaddlePaddle:developfrom
zccjjj:ASbugfix
Open

[bugfix] free blocks even if AS write failed#7807
zccjjj wants to merge 3 commits into
PaddlePaddle:developfrom
zccjjj:ASbugfix

Conversation

@zccjjj
Copy link
Copy Markdown
Contributor

@zccjjj zccjjj commented May 13, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 13, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 13, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-14 15:02:13

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⏳ Required 任务仍在进行中(1 个运行中,5 个等待中),暂无 required 失败。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
32(0) 32 21 1 4 6 0

2 任务状态汇总

2.1 Required任务 : 2/8 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run Four Cards Tests / run_4_cards_tests - 运行中 - Job -
⏸️ Extracted partial CE model tasks / run_ce_cases - 等待中 - - -
⏸️ Run Base Tests / base_tests - 等待中 - - -
⏸️ Run FastDeploy LogProb Tests / run_tests_logprob - 等待中 - - -
⏸️ Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 等待中 - - -
⏸️ Run Stable Tests / stable_tests - 等待中 - - -
其余 2 个必选任务通过 - - - - -

2.2 可选任务 — 19/24 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 15s Job -
其余 19 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 13, 2026

Codecov Report

❌ Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@c2df4c6). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/sched/resource_manager_v1.py 87.50% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7807   +/-   ##
==========================================
  Coverage           ?   63.67%           
==========================================
  Files              ?      461           
  Lines              ?    64092           
  Branches           ?     9806           
==========================================
  Hits               ?    40812           
  Misses             ?    20478           
  Partials           ?     2802           
Flag Coverage Δ
GPU 72.88% <87.50%> (?)
XPU 7.14% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-14 14:58:28

📋 Review 摘要

PR 概述:修复 AS(辅助存储)写入失败时未释放 KV Cache Block 的内存泄漏 bug
变更范围fastdeploy/engine/sched/resource_manager_v1.py(Scheduler / 资源管理器)
影响面 Tag[Scheduler]

📝 PR 规范检查

存在两处规范问题:① 标题 Tag 大小写不规范([bugfix][BugFix]);② PR 描述各 section 内容为空,仅保留了模板占位符。

标题建议(可直接复制):

  • [BugFix] free blocks even if AS write failed

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
AS(辅助存储)写入失败(`write_cache_to_storage` / `write_cache_to_storage_decode` 抛出异常)时,原代码中后续的 `_free_blocks` 调用会被跳过,导致 KV Cache Block 无法释放,造成内存泄漏。

## Modifications
- `fastdeploy/engine/sched/resource_manager_v1.py`:在 `_trigger_preempt`(decode/non-decode 两条路径)和 `finish_requests` 方法中,对 `write_cache_to_storage_decode` / `write_cache_to_storage` 的调用添加 `try-except` 异常捕获,确保即使写入失败也能正常执行后续的 `_free_blocks`,消除 block 泄漏。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🟡 建议 resource_manager_v1.py 文件名含 v1,是否存在 v2 或其他资源管理器实现需同步修复相同 bug?(参考 A4 多实现同步原则)
❓ 疑问 resource_manager_v1.py:442 llm_logger.warning 级别是否足够?存储写入失败可能是严重运维事件,建议升级为 error
❓ 疑问 resource_manager_v1.py:1730 finish_requests 中日志消息格式与 _trigger_preempt 不一致,建议统一

总体评价

修复逻辑正确,通过 try-except 保证了 _free_blocks 在 AS 写入失败时仍能执行,有效消除了 block 泄漏风险。建议补充 PR 描述内容,并确认是否有同名 v2 实现需要同步修复。

try:
self.cache_manager.write_cache_to_storage_decode(preempted_req)
except Exception as e:
llm_logger.warning(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

日志级别使用 warning,但存储写入失败可能是比较严重的运维事件(缓存数据丢失)。建议改为 error 级别,便于生产环境告警和问题排查。同样适用于第 454 行和第 1730 行。

# P instance / Mixed instance uses standard write method (relies on Radix Tree)
self.cache_manager.write_cache_to_storage(req)
except Exception as e:
llm_logger.warning(f"write_cache_to_storage failed req_id={req.request_id}: {e}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

日志消息格式与上方 _trigger_preempt 不一致(此处 write_cache_to_storage failed req_id=...,上方为 Failed to write cache to storage for preempted request ...),建议统一格式便于日志检索。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants