Skip to content

[Feature] GPU Model Runner V1#7810

Draft
ming1753 wants to merge 1 commit into
PaddlePaddle:developfrom
ming1753:mrv1
Draft

[Feature] GPU Model Runner V1#7810
ming1753 wants to merge 1 commit into
PaddlePaddle:developfrom
ming1753:mrv1

Conversation

@ming1753
Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 13, 2026

Thanks for your contribution!

@ming1753 ming1753 marked this pull request as draft May 13, 2026 16:07
@ming1753 ming1753 changed the title 0513 code backup [Feature] GPU Model Runner V1 May 13, 2026
@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 13, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-15 18:36:41

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有必选任务(Required)均已通过,CI 无阻塞合并问题,建议通过。(可选任务有 1 个失败,不阻塞合并)

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
2(0) 2 1 1 0 0 0

2 任务状态汇总

2.1 Required任务 : 0/0 通过

无必选任务(GitHub Branch Protection Rules 未配置必选检查,或 API 权限受限)。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 12m9s Job -
其余 1 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 13, 2026

Codecov Report

❌ Patch coverage is 0.19973% with 1499 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@a0141b9). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/gpu/model_runner.py 0.00% 478 Missing ⚠️
fastdeploy/worker/gpu/input_batch.py 0.00% 189 Missing ⚠️
...el_executor/layers/attention/flashinfer_backend.py 0.00% 183 Missing ⚠️
fastdeploy/worker/gpu/buffer_utils.py 0.00% 156 Missing ⚠️
fastdeploy/worker/gpu/sampler/post_process.py 0.00% 105 Missing ⚠️
fastdeploy/worker/gpu/gather_tokens_kernel.py 0.00% 82 Missing ⚠️
fastdeploy/worker/gpu/sampler/sampler_state.py 0.00% 73 Missing ⚠️
fastdeploy/worker/gpu/block_table.py 0.00% 67 Missing ⚠️
fastdeploy/worker/gpu/request_state.py 0.00% 58 Missing ⚠️
fastdeploy/worker/gpu/async_output.py 0.00% 44 Missing ⚠️
... and 7 more
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7810   +/-   ##
==========================================
  Coverage           ?   15.80%           
==========================================
  Files              ?      474           
  Lines              ?    65574           
  Branches           ?     9963           
==========================================
  Hits               ?    10366           
  Misses             ?    54722           
  Partials           ?      486           
Flag Coverage Δ
XPU 15.80% <0.19%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-14 12:23:41

📋 Review 摘要

PR 概述:新增 GPU Model Runner V1(GPUModelRunnerV1),提供新一代 GPU 推理执行路径,包含新的 KV cache 写入算子、FlashInfer Attention Backend 及采样模块。

变更范围fastdeploy/worker/gpu/custom_ops/gpu_ops/cache_kv/fastdeploy/model_executor/layers/attention/fastdeploy/config.pyfastdeploy/envs.py

影响面 Tag[Feature] [OP] [FDConfig]


📝 PR 规范检查

PR 描述所有必填 section(Motivation / Modifications / Usage or Command / Accuracy Tests)均为空,仅保留了模板注释,Checklist 所有条目未勾选。需要按模板补全描述。

标题建议(可直接复制):

  • [Feature] Add GPU Model Runner V1

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation

引入 GPU Model Runner V1(GPUModelRunnerV1),作为 GPU 推理的新一代执行路径。新路径通过独立的 `fastdeploy/worker/gpu/` 子模块实现更细粒度的批次管理、KV cache 写入和采样逻辑,并引入 FlashInfer Attention Backend 以支持高效的 Prefill/Decode 注意力计算。

## Modifications

- 新增 `fastdeploy/worker/gpu/` 子模块:`model_runner.py``input_batch.py``block_table.py``buffer_utils.py``forward_meta.py``request_state.py``async_output.py``gather_tokens_kernel.py` 等核心组件
- 新增 `fastdeploy/worker/gpu/sampler/``sampler.py``sampler_state.py``post_process.py` 采样模块
- 新增 `fastdeploy/model_executor/layers/attention/flashinfer_backend.py`:FlashInfer Attention Backend,支持 Prefill/Decode 分阶段计划(plan/run)
- 新增 `custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`:基于 CuTe 实现的 KV cache 写入算子,支持直接拷贝(非 FP8)和动态 FP8 E4M3 量化两条路径
- 新增 `custom_ops/gpu_ops/macros.h`:统一 `FD_CUDA_CHECK` 宏,替换各处散落的 `CUDA_CHECK`- `custom_ops/gpu_ops/cpp_extensions.cc`:新增 `get_cuda_view_from_cpu_tensor``reshape_and_cache_flash` Python 绑定,移除 `CudaError`- `fastdeploy/config.py` / `fastdeploy/envs.py`:新增 `max_bad_words_num``bad_words_max_len` 环境变量配置
- `fastdeploy/worker/gpu_worker.py``worker/worker_process.py`:集成 GPUModelRunnerV1 入口,通过 `FD_ENABLE_GPU_MRV1` 环境变量控制开关

## Usage or Command

N/A

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🟡 建议 fastdeploy/envs.py:58 FD_MAX_BDA_WORDS_NUM 存在 typo,应为 FD_MAX_BAD_WORDS_NUM
🟡 建议 fastdeploy/envs.py:60 FD_BDA_WORDS_MAX_LEN 存在 typo,应为 FD_BAD_WORDS_MAX_LEN
🟡 建议 custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu FP8 kernel 隐式假设 NHD 输入布局,缺少 layout 校验
🟡 建议 tests/operators/ 新增 custom op reshape_and_cache_flash 缺少单元测试(违反 A3 checklist)

总体评价

本 PR 整体架构清晰,GPUModelRunnerV1 模块化设计合理,FD_CUDA_CHECK 宏统一化是良好的代码质量提升。主要需要关注 bad words 环境变量名的拼写 typo、FP8 kernel 的 layout 假设以及补充必要的单元测试。

Comment thread fastdeploy/envs.py
# Maximum length of stop sequences.
"FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")),
# Maximum number of bad words.
"FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")),

This comment was marked as outdated.

Comment thread fastdeploy/envs.py
# Maximum number of bad words.
"FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")),
# Maximum length of bad words.
"FD_BDA_WORDS_MAX_LEN": lambda: int(os.getenv("FD_BDA_WORDS_MAX_LEN", "8")),

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-15 18:14:58

📋 Review 摘要

PR 概述:引入 GPU Model Runner V1(MRV1),新建 fastdeploy/worker/gpu/ 模块化推理包,新增基于 CuTe 实现的 reshape_and_cache_flash KV Cache 写入算子(支持 FP16/BF16 及动态 FP8 E4M3 量化)、FlashInfer Attention Backend,并统一 CUDA 错误检查宏为 FD_CUDA_CHECK

变更范围custom_ops/gpu_ops/fastdeploy/worker/gpu/fastdeploy/config.pyfastdeploy/envs.pyfastdeploy/model_executor/layers/attention/

影响面 Tag[Feature] [OP] [KVCache] [FDConfig]

⚠️ 本 PR 变更量较大(29 文件 / 5021 行),建议拆分以降低审查难度和合入风险。

建议拆分方案

  • PR 1: CUDA 宏统一重构 — macros.h, helper.h, get_block_shape_and_split_kv_block.cu, multiquery_decoder_attention_impl.cuh, wi4a16_weight_quantize.cu, cpp_extensions.cc(宏相关部分)
  • PR 2: 新增 reshape_and_cache_flash custom op — cache_kv/reshape_and_cache_flash.cu, setup_ops.py, cpp_extensions.cc(算子注册)
  • PR 3: GPU Model Runner V1 核心 — fastdeploy/worker/gpu/ 下所有新文件(11 个文件)
  • PR 4: FDConfig / Envs 扩展 + Worker 适配 — config.py, envs.py, gpu_worker.py, worker_process.py, input_batch.py
  • PR 5: FlashInfer Attention Backend — flashinfer_backend.py, pre_and_post_process.py

问题

级别 文件 概述
🔴 Bug fastdeploy/envs.py:58 环境变量名拼写错误:FD_MAX_BDA_WORDS_NUM 应为 FD_MAX_BAD_WORDS_NUM,导致用户无法通过正确环境变量配置 bad words
🔴 Bug custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu FP8 kernel 在 SM < 800 GPU 上静默向 KV Cache 写零,造成数据静默损坏
🟡 建议 custom_ops/setup_ops.py:360 新增 custom op reshape_and_cache_flash 未补充 tests/operators/ 单测(checklist A3)
🟡 建议 大 PR 建议按功能拆分(见上方建议拆分方案)

关于 reshape_and_cache_flash.cu FP8 静默写零reshape_and_cache_flash_cute_fp8_kernel 中,当 __CUDA_ARCH__ < 800 时,gK_dst(tid)gV_dst(tid) 均写入常量 0 而非抛出错误;若在 SM < 800 的 GPU 上调用 FP8 路径,KV Cache 将全部静默损坏。建议在主机侧 LaunchFP8Kernel 入口添加 SM 检查:

int device; int major;
cudaGetDevice(&device);
cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device);
PD_CHECK(major >= 8, "FP8 KV cache requires SM80+, got SM", major * 10);

📝 PR 规范检查

PR body 的所有必填 section(## Motivation## Modifications## Usage or Command## Accuracy Tests)均为空,仅保留了模板注释占位符,未填写任何实际内容。标题中 [Feature] 为合规 Tag,格式正确。

PR 描述建议(可直接复制):

## Motivation
引入 GPU Model Runner V1(MRV1),在现有 `gpu_model_runner.py` 之外新建 `fastdeploy/worker/gpu/` 包,提供更模块化、可维护的 GPU 推理路径。同步新增基于 CuTe 实现的 `reshape_and_cache_flash` KV Cache 写入算子(支持 FP16/BF16 及动态 FP8 E4M3 量化)和 FlashInfer Attention Backend,并统一 CUDA 错误检查宏为 `FD_CUDA_CHECK`## Modifications
- `custom_ops/gpu_ops/macros.h`:新增统一 `FD_CUDA_CHECK` 宏,移除各文件中分散的 `CUDA_CHECK` / `CHECK` 定义
- `custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`:新增 CuTe-based KV Cache 写入算子,支持 NHD/HND layout 及动态 FP8 E4M3 量化
- `custom_ops/setup_ops.py`:将 `reshape_and_cache_flash.cu` 加入编译列表
- `custom_ops/gpu_ops/cpp_extensions.cc`:注册新算子;新增 `copy_array_to_tensor``get_cuda_view_from_cpu_tensor``numpy_view_of_cpu_tensor` Python binding
- `fastdeploy/worker/gpu/`(新增包):包含 `model_runner.py``input_batch.py``block_table.py``forward_meta.py``request_state.py``sampler/` 等 GPU Model Runner V1 核心组件
- `fastdeploy/model_executor/layers/attention/flashinfer_backend.py`:新增 FlashInfer Paged Attention Backend
- `fastdeploy/config.py` / `fastdeploy/envs.py`:新增 `max_bad_words_num` / `bad_words_max_len` 配置字段
- `fastdeploy/worker/gpu_worker.py` / `worker_process.py`:适配 MRV1 路径

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

PR 整体实现了 GPU Model Runner V1 的核心框架,代码结构清晰,CuTe kernel 实现质量较高。但存在环境变量拼写错误(BDA → BAD)和 FP8 路径静默写零两个 P0 Bug 需在合入前修复;同时建议补充单测并完善 PR 描述以便后续维护。

Comment thread fastdeploy/envs.py
# Maximum length of stop sequences.
"FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")),
# Maximum number of bad words.
"FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 环境变量名拼写错误:FD_MAX_BDA_WORDS_NUM 应为 FD_MAX_BAD_WORDS_NUMBDABAD)。

同行的 FD_BDA_WORDS_MAX_LEN 同理应改为 FD_BAD_WORDS_MAX_LEN

config.py 中字段名已正确使用 max_bad_words_num / bad_words_max_len,但对应 env var 名写错,会导致用户设置 FD_MAX_BAD_WORDS_NUM=32 完全无效(始终使用默认值 16)。

建议修复:

# Maximum number of bad words.
"FD_MAX_BAD_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BAD_WORDS_NUM", "16")),
# Maximum length of bad words.
"FD_BAD_WORDS_MAX_LEN": lambda: int(os.getenv("FD_BAD_WORDS_MAX_LEN", "8")),

同步更新 config.py 中的引用:

self.max_bad_words_num = envs.FD_MAX_BAD_WORDS_NUM
self.bad_words_max_len = envs.FD_BAD_WORDS_MAX_LEN

Comment thread custom_ops/setup_ops.py
]

# cache_kv
sources += ["gpu_ops/cache_kv/reshape_and_cache_flash.cu"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 新增 custom op reshape_and_cache_flash 未在 tests/operators/ 下补充单测(checklist A3)。

建议在 tests/operators/ 中添加覆盖以下场景的测试:

  1. NHD layout(head_stride == head_dim)与 HND layout(head_stride > head_dim)
  2. head_dim = 64 和 head_dim = 128 两种模板路径
  3. kv_cache_dtype = 'auto'(非 FP8)路径
  4. kv_cache_dtype = 'fp8_e4m3'(动态量化)路径

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants