[Feature] GPU Model Runner V1#7810
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览所有必选任务(Required)均已通过,CI 无阻塞合并问题,建议通过。(可选任务有 1 个失败,不阻塞合并)
2 任务状态汇总2.1 Required任务 : 0/0 通过
2.2 可选任务 — 1/2 通过
3 失败详情(仅 required)无 required 失败任务。 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7810 +/- ##
==========================================
Coverage ? 15.80%
==========================================
Files ? 474
Lines ? 65574
Branches ? 9963
==========================================
Hits ? 10366
Misses ? 54722
Partials ? 486
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-14 12:23:41
📋 Review 摘要
PR 概述:新增 GPU Model Runner V1(GPUModelRunnerV1),提供新一代 GPU 推理执行路径,包含新的 KV cache 写入算子、FlashInfer Attention Backend 及采样模块。
变更范围:fastdeploy/worker/gpu/、custom_ops/gpu_ops/cache_kv/、fastdeploy/model_executor/layers/attention/、fastdeploy/config.py、fastdeploy/envs.py
影响面 Tag:[Feature] [OP] [FDConfig]
📝 PR 规范检查
PR 描述所有必填 section(Motivation / Modifications / Usage or Command / Accuracy Tests)均为空,仅保留了模板注释,Checklist 所有条目未勾选。需要按模板补全描述。
标题建议(可直接复制):
[Feature] Add GPU Model Runner V1
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
引入 GPU Model Runner V1(GPUModelRunnerV1),作为 GPU 推理的新一代执行路径。新路径通过独立的 `fastdeploy/worker/gpu/` 子模块实现更细粒度的批次管理、KV cache 写入和采样逻辑,并引入 FlashInfer Attention Backend 以支持高效的 Prefill/Decode 注意力计算。
## Modifications
- 新增 `fastdeploy/worker/gpu/` 子模块:`model_runner.py`、`input_batch.py`、`block_table.py`、`buffer_utils.py`、`forward_meta.py`、`request_state.py`、`async_output.py`、`gather_tokens_kernel.py` 等核心组件
- 新增 `fastdeploy/worker/gpu/sampler/`:`sampler.py`、`sampler_state.py`、`post_process.py` 采样模块
- 新增 `fastdeploy/model_executor/layers/attention/flashinfer_backend.py`:FlashInfer Attention Backend,支持 Prefill/Decode 分阶段计划(plan/run)
- 新增 `custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`:基于 CuTe 实现的 KV cache 写入算子,支持直接拷贝(非 FP8)和动态 FP8 E4M3 量化两条路径
- 新增 `custom_ops/gpu_ops/macros.h`:统一 `FD_CUDA_CHECK` 宏,替换各处散落的 `CUDA_CHECK` 宏
- `custom_ops/gpu_ops/cpp_extensions.cc`:新增 `get_cuda_view_from_cpu_tensor`、`reshape_and_cache_flash` Python 绑定,移除 `CudaError` 类
- `fastdeploy/config.py` / `fastdeploy/envs.py`:新增 `max_bad_words_num`、`bad_words_max_len` 环境变量配置
- `fastdeploy/worker/gpu_worker.py`、`worker/worker_process.py`:集成 GPUModelRunnerV1 入口,通过 `FD_ENABLE_GPU_MRV1` 环境变量控制开关
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/envs.py:58 |
FD_MAX_BDA_WORDS_NUM 存在 typo,应为 FD_MAX_BAD_WORDS_NUM |
| 🟡 建议 | fastdeploy/envs.py:60 |
FD_BDA_WORDS_MAX_LEN 存在 typo,应为 FD_BAD_WORDS_MAX_LEN |
| 🟡 建议 | custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu |
FP8 kernel 隐式假设 NHD 输入布局,缺少 layout 校验 |
| 🟡 建议 | tests/operators/ |
新增 custom op reshape_and_cache_flash 缺少单元测试(违反 A3 checklist) |
总体评价
本 PR 整体架构清晰,GPUModelRunnerV1 模块化设计合理,FD_CUDA_CHECK 宏统一化是良好的代码质量提升。主要需要关注 bad words 环境变量名的拼写 typo、FP8 kernel 的 layout 假设以及补充必要的单元测试。
| # Maximum length of stop sequences. | ||
| "FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")), | ||
| # Maximum number of bad words. | ||
| "FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")), |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| # Maximum number of bad words. | ||
| "FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")), | ||
| # Maximum length of bad words. | ||
| "FD_BDA_WORDS_MAX_LEN": lambda: int(os.getenv("FD_BDA_WORDS_MAX_LEN", "8")), |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-15 18:14:58
📋 Review 摘要
PR 概述:引入 GPU Model Runner V1(MRV1),新建 fastdeploy/worker/gpu/ 模块化推理包,新增基于 CuTe 实现的 reshape_and_cache_flash KV Cache 写入算子(支持 FP16/BF16 及动态 FP8 E4M3 量化)、FlashInfer Attention Backend,并统一 CUDA 错误检查宏为 FD_CUDA_CHECK。
变更范围:custom_ops/gpu_ops/、fastdeploy/worker/gpu/、fastdeploy/config.py、fastdeploy/envs.py、fastdeploy/model_executor/layers/attention/
影响面 Tag:[Feature] [OP] [KVCache] [FDConfig]
建议拆分方案:
- PR 1: CUDA 宏统一重构 —
macros.h,helper.h,get_block_shape_and_split_kv_block.cu,multiquery_decoder_attention_impl.cuh,wi4a16_weight_quantize.cu,cpp_extensions.cc(宏相关部分) - PR 2: 新增 reshape_and_cache_flash custom op —
cache_kv/reshape_and_cache_flash.cu,setup_ops.py,cpp_extensions.cc(算子注册) - PR 3: GPU Model Runner V1 核心 —
fastdeploy/worker/gpu/下所有新文件(11 个文件) - PR 4: FDConfig / Envs 扩展 + Worker 适配 —
config.py,envs.py,gpu_worker.py,worker_process.py,input_batch.py - PR 5: FlashInfer Attention Backend —
flashinfer_backend.py,pre_and_post_process.py
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/envs.py:58 |
环境变量名拼写错误:FD_MAX_BDA_WORDS_NUM 应为 FD_MAX_BAD_WORDS_NUM,导致用户无法通过正确环境变量配置 bad words |
| 🔴 Bug | custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu |
FP8 kernel 在 SM < 800 GPU 上静默向 KV Cache 写零,造成数据静默损坏 |
| 🟡 建议 | custom_ops/setup_ops.py:360 |
新增 custom op reshape_and_cache_flash 未补充 tests/operators/ 单测(checklist A3) |
| 🟡 建议 | — | 大 PR 建议按功能拆分(见上方建议拆分方案) |
关于
reshape_and_cache_flash.cuFP8 静默写零:reshape_and_cache_flash_cute_fp8_kernel中,当__CUDA_ARCH__ < 800时,gK_dst(tid)和gV_dst(tid)均写入常量0而非抛出错误;若在 SM < 800 的 GPU 上调用 FP8 路径,KV Cache 将全部静默损坏。建议在主机侧LaunchFP8Kernel入口添加 SM 检查:int device; int major; cudaGetDevice(&device); cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device); PD_CHECK(major >= 8, "FP8 KV cache requires SM80+, got SM", major * 10);
📝 PR 规范检查
PR body 的所有必填 section(## Motivation、## Modifications、## Usage or Command、## Accuracy Tests)均为空,仅保留了模板注释占位符,未填写任何实际内容。标题中 [Feature] 为合规 Tag,格式正确。
PR 描述建议(可直接复制):
## Motivation
引入 GPU Model Runner V1(MRV1),在现有 `gpu_model_runner.py` 之外新建 `fastdeploy/worker/gpu/` 包,提供更模块化、可维护的 GPU 推理路径。同步新增基于 CuTe 实现的 `reshape_and_cache_flash` KV Cache 写入算子(支持 FP16/BF16 及动态 FP8 E4M3 量化)和 FlashInfer Attention Backend,并统一 CUDA 错误检查宏为 `FD_CUDA_CHECK`。
## Modifications
- `custom_ops/gpu_ops/macros.h`:新增统一 `FD_CUDA_CHECK` 宏,移除各文件中分散的 `CUDA_CHECK` / `CHECK` 定义
- `custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`:新增 CuTe-based KV Cache 写入算子,支持 NHD/HND layout 及动态 FP8 E4M3 量化
- `custom_ops/setup_ops.py`:将 `reshape_and_cache_flash.cu` 加入编译列表
- `custom_ops/gpu_ops/cpp_extensions.cc`:注册新算子;新增 `copy_array_to_tensor`、`get_cuda_view_from_cpu_tensor`、`numpy_view_of_cpu_tensor` Python binding
- `fastdeploy/worker/gpu/`(新增包):包含 `model_runner.py`、`input_batch.py`、`block_table.py`、`forward_meta.py`、`request_state.py`、`sampler/` 等 GPU Model Runner V1 核心组件
- `fastdeploy/model_executor/layers/attention/flashinfer_backend.py`:新增 FlashInfer Paged Attention Backend
- `fastdeploy/config.py` / `fastdeploy/envs.py`:新增 `max_bad_words_num` / `bad_words_max_len` 配置字段
- `fastdeploy/worker/gpu_worker.py` / `worker_process.py`:适配 MRV1 路径
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
PR 整体实现了 GPU Model Runner V1 的核心框架,代码结构清晰,CuTe kernel 实现质量较高。但存在环境变量拼写错误(BDA → BAD)和 FP8 路径静默写零两个 P0 Bug 需在合入前修复;同时建议补充单测并完善 PR 描述以便后续维护。
| # Maximum length of stop sequences. | ||
| "FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")), | ||
| # Maximum number of bad words. | ||
| "FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")), |
There was a problem hiding this comment.
🔴 Bug 环境变量名拼写错误:FD_MAX_BDA_WORDS_NUM 应为 FD_MAX_BAD_WORDS_NUM(BDA → BAD)。
同行的 FD_BDA_WORDS_MAX_LEN 同理应改为 FD_BAD_WORDS_MAX_LEN。
config.py 中字段名已正确使用 max_bad_words_num / bad_words_max_len,但对应 env var 名写错,会导致用户设置 FD_MAX_BAD_WORDS_NUM=32 完全无效(始终使用默认值 16)。
建议修复:
# Maximum number of bad words.
"FD_MAX_BAD_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BAD_WORDS_NUM", "16")),
# Maximum length of bad words.
"FD_BAD_WORDS_MAX_LEN": lambda: int(os.getenv("FD_BAD_WORDS_MAX_LEN", "8")),同步更新 config.py 中的引用:
self.max_bad_words_num = envs.FD_MAX_BAD_WORDS_NUM
self.bad_words_max_len = envs.FD_BAD_WORDS_MAX_LEN| ] | ||
|
|
||
| # cache_kv | ||
| sources += ["gpu_ops/cache_kv/reshape_and_cache_flash.cu"] |
There was a problem hiding this comment.
🟡 建议 新增 custom op reshape_and_cache_flash 未在 tests/operators/ 下补充单测(checklist A3)。
建议在 tests/operators/ 中添加覆盖以下场景的测试:
- NHD layout(head_stride == head_dim)与 HND layout(head_stride > head_dim)
- head_dim = 64 和 head_dim = 128 两种模板路径
- kv_cache_dtype = 'auto'(非 FP8)路径
- kv_cache_dtype = 'fp8_e4m3'(动态量化)路径
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.