[bugfix] free blocks even if AS write failed#7807
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览⏳ Required 任务仍在进行中(1 个运行中,5 个等待中),暂无 required 失败。
2 任务状态汇总2.1 Required任务 : 2/8 通过
2.2 可选任务 — 19/24 通过
3 失败详情(仅 required)无 required 失败任务。 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7807 +/- ##
==========================================
Coverage ? 63.67%
==========================================
Files ? 461
Lines ? 64092
Branches ? 9806
==========================================
Hits ? 40812
Misses ? 20478
Partials ? 2802
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-14 14:58:28
📋 Review 摘要
PR 概述:修复 AS(辅助存储)写入失败时未释放 KV Cache Block 的内存泄漏 bug
变更范围:fastdeploy/engine/sched/resource_manager_v1.py(Scheduler / 资源管理器)
影响面 Tag:[Scheduler]
📝 PR 规范检查
存在两处规范问题:① 标题 Tag 大小写不规范([bugfix] → [BugFix]);② PR 描述各 section 内容为空,仅保留了模板占位符。
标题建议(可直接复制):
[BugFix] free blocks even if AS write failed
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
AS(辅助存储)写入失败(`write_cache_to_storage` / `write_cache_to_storage_decode` 抛出异常)时,原代码中后续的 `_free_blocks` 调用会被跳过,导致 KV Cache Block 无法释放,造成内存泄漏。
## Modifications
- `fastdeploy/engine/sched/resource_manager_v1.py`:在 `_trigger_preempt`(decode/non-decode 两条路径)和 `finish_requests` 方法中,对 `write_cache_to_storage_decode` / `write_cache_to_storage` 的调用添加 `try-except` 异常捕获,确保即使写入失败也能正常执行后续的 `_free_blocks`,消除 block 泄漏。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | resource_manager_v1.py |
文件名含 v1,是否存在 v2 或其他资源管理器实现需同步修复相同 bug?(参考 A4 多实现同步原则) |
| ❓ 疑问 | resource_manager_v1.py:442 |
llm_logger.warning 级别是否足够?存储写入失败可能是严重运维事件,建议升级为 error |
| ❓ 疑问 | resource_manager_v1.py:1730 |
finish_requests 中日志消息格式与 _trigger_preempt 不一致,建议统一 |
总体评价
修复逻辑正确,通过 try-except 保证了 _free_blocks 在 AS 写入失败时仍能执行,有效消除了 block 泄漏风险。建议补充 PR 描述内容,并确认是否有同名 v2 实现需要同步修复。
| try: | ||
| self.cache_manager.write_cache_to_storage_decode(preempted_req) | ||
| except Exception as e: | ||
| llm_logger.warning( |
There was a problem hiding this comment.
日志级别使用 warning,但存储写入失败可能是比较严重的运维事件(缓存数据丢失)。建议改为 error 级别,便于生产环境告警和问题排查。同样适用于第 454 行和第 1730 行。
| # P instance / Mixed instance uses standard write method (relies on Radix Tree) | ||
| self.cache_manager.write_cache_to_storage(req) | ||
| except Exception as e: | ||
| llm_logger.warning(f"write_cache_to_storage failed req_id={req.request_id}: {e}") |
There was a problem hiding this comment.
日志消息格式与上方 _trigger_preempt 不一致(此处 write_cache_to_storage failed req_id=...,上方为 Failed to write cache to storage for preempted request ...),建议统一格式便于日志检索。
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.