[Feature] GPU Model Runner V1 by ming1753 · Pull Request #7810 · PaddlePaddle/FastDeploy

ming1753 · 2026-05-13T16:06:32Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-13T16:06:39Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-13T16:17:12Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-15 18:36:41

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 922e813
Merge base: a0141b9 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

所有必选任务（Required）均已通过，CI 无阻塞合并问题，建议通过。（可选任务有 1 个失败，不阻塞合并）

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
2(0)	2	1	1	0	0	0

2 任务状态汇总

2.1 Required任务 : 0/0 通过

无必选任务（GitHub Branch Protection Rules 未配置必选检查，或 API 权限受限）。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Trigger Jenkins for PR`	12m9s	Job	-
✅	其余 1 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

codecov-commenter · 2026-05-13T17:22:51Z

Codecov Report

❌ Patch coverage is 0.19973% with 1499 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@a0141b9). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/gpu/model_runner.py	0.00%	478 Missing ⚠️
fastdeploy/worker/gpu/input_batch.py	0.00%	189 Missing ⚠️
...el_executor/layers/attention/flashinfer_backend.py	0.00%	183 Missing ⚠️
fastdeploy/worker/gpu/buffer_utils.py	0.00%	156 Missing ⚠️
fastdeploy/worker/gpu/sampler/post_process.py	0.00%	105 Missing ⚠️
fastdeploy/worker/gpu/gather_tokens_kernel.py	0.00%	82 Missing ⚠️
fastdeploy/worker/gpu/sampler/sampler_state.py	0.00%	73 Missing ⚠️
fastdeploy/worker/gpu/block_table.py	0.00%	67 Missing ⚠️
fastdeploy/worker/gpu/request_state.py	0.00%	58 Missing ⚠️
fastdeploy/worker/gpu/async_output.py	0.00%	44 Missing ⚠️
... and 7 more

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7810   +/-   ##
==========================================
  Coverage           ?   15.80%           
==========================================
  Files              ?      474           
  Lines              ?    65574           
  Branches           ?     9963           
==========================================
  Hits               ?    10366           
  Misses             ?    54722           
  Partials           ?      486

Flag	Coverage Δ
XPU	`15.80% <0.19%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-14 12:23:41

📋 Review 摘要

PR 概述：新增 GPU Model Runner V1（GPUModelRunnerV1），提供新一代 GPU 推理执行路径，包含新的 KV cache 写入算子、FlashInfer Attention Backend 及采样模块。

变更范围：fastdeploy/worker/gpu/、custom_ops/gpu_ops/cache_kv/、fastdeploy/model_executor/layers/attention/、fastdeploy/config.py、fastdeploy/envs.py

影响面 Tag：[Feature] [OP] [FDConfig]

📝 PR 规范检查

PR 描述所有必填 section（Motivation / Modifications / Usage or Command / Accuracy Tests）均为空，仅保留了模板注释，Checklist 所有条目未勾选。需要按模板补全描述。

标题建议（可直接复制）：

[Feature] Add GPU Model Runner V1

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation

引入 GPU Model Runner V1（GPUModelRunnerV1），作为 GPU 推理的新一代执行路径。新路径通过独立的 `fastdeploy/worker/gpu/` 子模块实现更细粒度的批次管理、KV cache 写入和采样逻辑，并引入 FlashInfer Attention Backend 以支持高效的 Prefill/Decode 注意力计算。

## Modifications

- 新增 `fastdeploy/worker/gpu/` 子模块：`model_runner.py`、`input_batch.py`、`block_table.py`、`buffer_utils.py`、`forward_meta.py`、`request_state.py`、`async_output.py`、`gather_tokens_kernel.py` 等核心组件
- 新增 `fastdeploy/worker/gpu/sampler/`：`sampler.py`、`sampler_state.py`、`post_process.py` 采样模块
- 新增 `fastdeploy/model_executor/layers/attention/flashinfer_backend.py`：FlashInfer Attention Backend，支持 Prefill/Decode 分阶段计划（plan/run）
- 新增 `custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`：基于 CuTe 实现的 KV cache 写入算子，支持直接拷贝（非 FP8）和动态 FP8 E4M3 量化两条路径
- 新增 `custom_ops/gpu_ops/macros.h`：统一 `FD_CUDA_CHECK` 宏，替换各处散落的 `CUDA_CHECK` 宏
- `custom_ops/gpu_ops/cpp_extensions.cc`：新增 `get_cuda_view_from_cpu_tensor`、`reshape_and_cache_flash` Python 绑定，移除 `CudaError` 类
- `fastdeploy/config.py` / `fastdeploy/envs.py`：新增 `max_bad_words_num`、`bad_words_max_len` 环境变量配置
- `fastdeploy/worker/gpu_worker.py`、`worker/worker_process.py`：集成 GPUModelRunnerV1 入口，通过 `FD_ENABLE_GPU_MRV1` 环境变量控制开关

## Usage or Command

N/A

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`fastdeploy/envs.py:58`	`FD_MAX_BDA_WORDS_NUM` 存在 typo，应为 `FD_MAX_BAD_WORDS_NUM`
🟡 建议	`fastdeploy/envs.py:60`	`FD_BDA_WORDS_MAX_LEN` 存在 typo，应为 `FD_BAD_WORDS_MAX_LEN`
🟡 建议	`custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`	FP8 kernel 隐式假设 NHD 输入布局，缺少 layout 校验
🟡 建议	`tests/operators/`	新增 custom op `reshape_and_cache_flash` 缺少单元测试（违反 A3 checklist）

总体评价

本 PR 整体架构清晰，GPUModelRunnerV1 模块化设计合理，FD_CUDA_CHECK 宏统一化是良好的代码质量提升。主要需要关注 bad words 环境变量名的拼写 typo、FP8 kernel 的 layout 假设以及补充必要的单元测试。

Sign in to view

    # Maximum length of stop sequences.
    "FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")),
+    # Maximum number of bad words.
+    "FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")),


Sign in to view

+    # Maximum number of bad words.
+    "FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")),
+    # Maximum length of bad words.
+    "FD_BDA_WORDS_MAX_LEN": lambda: int(os.getenv("FD_BDA_WORDS_MAX_LEN", "8")),


PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-15 18:14:58

📋 Review 摘要

PR 概述：引入 GPU Model Runner V1（MRV1），新建 fastdeploy/worker/gpu/ 模块化推理包，新增基于 CuTe 实现的 reshape_and_cache_flash KV Cache 写入算子（支持 FP16/BF16 及动态 FP8 E4M3 量化）、FlashInfer Attention Backend，并统一 CUDA 错误检查宏为 FD_CUDA_CHECK。

变更范围：custom_ops/gpu_ops/、fastdeploy/worker/gpu/、fastdeploy/config.py、fastdeploy/envs.py、fastdeploy/model_executor/layers/attention/

影响面 Tag：[Feature] [OP] [KVCache] [FDConfig]

⚠️ 本 PR 变更量较大（29 文件 / 5021 行），建议拆分以降低审查难度和合入风险。

建议拆分方案：

PR 1: CUDA 宏统一重构 — macros.h, helper.h, get_block_shape_and_split_kv_block.cu, multiquery_decoder_attention_impl.cuh, wi4a16_weight_quantize.cu, cpp_extensions.cc（宏相关部分）
PR 2: 新增 reshape_and_cache_flash custom op — cache_kv/reshape_and_cache_flash.cu, setup_ops.py, cpp_extensions.cc（算子注册）
PR 3: GPU Model Runner V1 核心 — fastdeploy/worker/gpu/ 下所有新文件（11 个文件）
PR 4: FDConfig / Envs 扩展 + Worker 适配 — config.py, envs.py, gpu_worker.py, worker_process.py, input_batch.py
PR 5: FlashInfer Attention Backend — flashinfer_backend.py, pre_and_post_process.py

问题

级别	文件	概述
🔴 Bug	`fastdeploy/envs.py:58`	环境变量名拼写错误：`FD_MAX_BDA_WORDS_NUM` 应为 `FD_MAX_BAD_WORDS_NUM`，导致用户无法通过正确环境变量配置 bad words
🔴 Bug	`custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`	FP8 kernel 在 SM < 800 GPU 上静默向 KV Cache 写零，造成数据静默损坏
🟡 建议	`custom_ops/setup_ops.py:360`	新增 custom op `reshape_and_cache_flash` 未补充 `tests/operators/` 单测（checklist A3）
🟡 建议	—	大 PR 建议按功能拆分（见上方建议拆分方案）

关于 reshape_and_cache_flash.cu FP8 静默写零：reshape_and_cache_flash_cute_fp8_kernel 中，当 __CUDA_ARCH__ < 800 时，gK_dst(tid) 和 gV_dst(tid) 均写入常量 0 而非抛出错误；若在 SM < 800 的 GPU 上调用 FP8 路径，KV Cache 将全部静默损坏。建议在主机侧 LaunchFP8Kernel 入口添加 SM 检查：
int device; int major;
cudaGetDevice(&device);
cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device);
PD_CHECK(major >= 8, "FP8 KV cache requires SM80+, got SM", major * 10);

📝 PR 规范检查

PR body 的所有必填 section（## Motivation、## Modifications、## Usage or Command、## Accuracy Tests）均为空，仅保留了模板注释占位符，未填写任何实际内容。标题中 [Feature] 为合规 Tag，格式正确。

PR 描述建议（可直接复制）：

## Motivation
引入 GPU Model Runner V1（MRV1），在现有 `gpu_model_runner.py` 之外新建 `fastdeploy/worker/gpu/` 包，提供更模块化、可维护的 GPU 推理路径。同步新增基于 CuTe 实现的 `reshape_and_cache_flash` KV Cache 写入算子（支持 FP16/BF16 及动态 FP8 E4M3 量化）和 FlashInfer Attention Backend，并统一 CUDA 错误检查宏为 `FD_CUDA_CHECK`。

## Modifications
- `custom_ops/gpu_ops/macros.h`：新增统一 `FD_CUDA_CHECK` 宏，移除各文件中分散的 `CUDA_CHECK` / `CHECK` 定义
- `custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`：新增 CuTe-based KV Cache 写入算子，支持 NHD/HND layout 及动态 FP8 E4M3 量化
- `custom_ops/setup_ops.py`：将 `reshape_and_cache_flash.cu` 加入编译列表
- `custom_ops/gpu_ops/cpp_extensions.cc`：注册新算子；新增 `copy_array_to_tensor`、`get_cuda_view_from_cpu_tensor`、`numpy_view_of_cpu_tensor` Python binding
- `fastdeploy/worker/gpu/`（新增包）：包含 `model_runner.py`、`input_batch.py`、`block_table.py`、`forward_meta.py`、`request_state.py`、`sampler/` 等 GPU Model Runner V1 核心组件
- `fastdeploy/model_executor/layers/attention/flashinfer_backend.py`：新增 FlashInfer Paged Attention Backend
- `fastdeploy/config.py` / `fastdeploy/envs.py`：新增 `max_bad_words_num` / `bad_words_max_len` 配置字段
- `fastdeploy/worker/gpu_worker.py` / `worker_process.py`：适配 MRV1 路径

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

PR 整体实现了 GPU Model Runner V1 的核心框架，代码结构清晰，CuTe kernel 实现质量较高。但存在环境变量拼写错误（BDA → BAD）和 FP8 路径静默写零两个 P0 Bug 需在合入前修复；同时建议补充单测并完善 PR 描述以便后续维护。

PaddlePaddle-bot · 2026-05-15T10:22:20Z

    # Maximum length of stop sequences.
    "FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")),
+    # Maximum number of bad words.
+    "FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")),


🔴 Bug 环境变量名拼写错误：FD_MAX_BDA_WORDS_NUM 应为 FD_MAX_BAD_WORDS_NUM（BDA → BAD）。

同行的 FD_BDA_WORDS_MAX_LEN 同理应改为 FD_BAD_WORDS_MAX_LEN。

config.py 中字段名已正确使用 max_bad_words_num / bad_words_max_len，但对应 env var 名写错，会导致用户设置 FD_MAX_BAD_WORDS_NUM=32 完全无效（始终使用默认值 16）。

建议修复：

# Maximum number of bad words. "FD_MAX_BAD_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BAD_WORDS_NUM", "16")), # Maximum length of bad words. "FD_BAD_WORDS_MAX_LEN": lambda: int(os.getenv("FD_BAD_WORDS_MAX_LEN", "8")),

同步更新 config.py 中的引用：

self.max_bad_words_num = envs.FD_MAX_BAD_WORDS_NUM self.bad_words_max_len = envs.FD_BAD_WORDS_MAX_LEN

PaddlePaddle-bot · 2026-05-15T10:22:20Z

    ]

+    # cache_kv
+    sources += ["gpu_ops/cache_kv/reshape_and_cache_flash.cu"]


🟡 建议 新增 custom op reshape_and_cache_flash 未在 tests/operators/ 下补充单测（checklist A3）。

建议在 tests/operators/ 中添加覆盖以下场景的测试：

NHD layout（head_stride == head_dim）与 HND layout（head_stride > head_dim）

head_dim = 64 和 head_dim = 128 两种模板路径

kv_cache_dtype = 'auto'（非 FP8）路径

kv_cache_dtype = 'fp8_e4m3'（动态量化）路径

ming1753 had a problem deploying to Metax_ci May 13, 2026 16:06 — with GitHub Actions Failure

ming1753 marked this pull request as draft May 13, 2026 16:07

ming1753 changed the title ~~0513 code backup~~ [Feature] GPU Model Runner V1 May 13, 2026

This comment was marked as outdated.

Sign in to view

ming1753 had a problem deploying to Metax_ci May 14, 2026 04:10 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 14, 2026

View reviewed changes

GPU Model Runner V1

922e813

ming1753 force-pushed the mrv1 branch from 0a91e38 to 922e813 Compare May 15, 2026 10:01

ming1753 had a problem deploying to Metax_ci May 15, 2026 10:01 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] GPU Model Runner V1#7810

[Feature] GPU Model Runner V1#7810
ming1753 wants to merge 1 commit into
PaddlePaddle:developfrom
ming1753:mrv1

ming1753 commented May 13, 2026

Uh oh!

paddle-bot Bot commented May 13, 2026

Uh oh!

PaddlePaddle-bot commented May 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 13, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 15, 2026

Uh oh!

PaddlePaddle-bot May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ming1753 commented May 13, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 13, 2026

Uh oh!

PaddlePaddle-bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 0/0 通过

2.2 可选任务 — 1/2 通过

3 失败详情（仅 required）

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PaddlePaddle-bot commented May 13, 2026 •

edited

Loading

codecov-commenter commented May 13, 2026 •

edited

Loading