[XPU][Speculative Decoding] Enable CudaGraph capture for MTP draft model by Clarity256 · Pull Request #8061 · PaddlePaddle/FastDeploy

Clarity256 · 2026-06-17T05:19:17Z

Motivation

Draft model 前向推理启用 step_use_cudagraph 门控逻辑，并在 multi-step 执行中仅对首步进行 capture。
Draft model 推理路径中传递 forward_meta 和 use_cudagraph 到 xpu_pre_process，确保 cu_seqlens_q_output / batch_id_per_token_output 在 cudagraph 模式下使用 copy_ 原地更新，保证 tensor 地址稳定性。
新增 padding_cudagraph_inputs() 方法处理 draft model 的 buffer padding，并在 graph replay 时按 real_token_num 切片 model output。
Target model 侧投机解码 warmup 流程适配（capture size 计算、accept_all_drafts 参数传递、TP>1 下 expected_decode_len 修正）。
将 padding_sampling_params（Python 侧 CPU 实现）替换为 build_sampling_params XPU 自定义算子（[XPU][OP] Add build_sampling_params kernel for MTP speculative decoding #8032），在算子内部完成 infer_seed 的原地更新，避免在 cudagraph 外额外操作。
increment_value 改为与投机解码 token 数联动（(num_speculative_tokens + 1) * 4）。
Draft model 中 last_seq_lens_this_time 使用 copy_() 替代 clone()，避免 CUDAGraph replay 时产生新 tensor 导致内存持续增长。

Modifications

fastdeploy/spec_decode/mtp_xpu.py：draft model 启用 step_use_cudagraph 门控；_propose 新增 cudagraph padding 逻辑与 output slicing；_initialize_forward_meta 传递 cudagraph 参数；last_seq_lens_this_time 改为 copy_() 原地更新。
fastdeploy/worker/xpu_model_runner.py：increment_value 与投机解码 token 数联动；warmup capture 流程适配 speculative decoding；infer_seed 更新移入 build_sampling_params 算子内部；draft model propose 传递 step_use_cudagraph；修正 TP>1 时 dummy_prefill_inputs 的 expected_decode_len。
fastdeploy/model_executor/layers/sample/sampler.py：forward_xpu 改用 build_sampling_params XPU 算子替代 padding_sampling_params；新增 increment_value 参数。
fastdeploy/model_executor/xpu_pre_and_post_process.py：cudagraph 模式下改用 copy_ 原地更新 cu_seqlens_q_output 和 batch_id_per_token_output，保证 graph 捕获的 tensor 地址稳定。
tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py → test_mtp_cudagraph.py：重命名测试脚本以符合 CI 命名规范。

Usage or Command

Accuracy Tests

MTP with CUDAGraph：输出与参考结果一致（见 PR 截图）

- MTP without CUDAGraph：输出与参考结果一致（见 PR 截图）

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

CLAassistant · 2026-06-17T05:19:24Z

All committers have signed the CLA.

codecov-commenter · 2026-06-17T06:05:11Z

Codecov Report

❌ Patch coverage is 0% with 38 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@7a60f79). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/xpu_model_runner.py	0.00%	17 Missing ⚠️
fastdeploy/spec_decode/mtp_xpu.py	0.00%	13 Missing ⚠️
...tdeploy/model_executor/xpu_pre_and_post_process.py	0.00%	5 Missing ⚠️
fastdeploy/model_executor/layers/sample/sampler.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #8061   +/-   ##
==========================================
  Coverage           ?   68.01%           
==========================================
  Files              ?      475           
  Lines              ?    66931           
  Branches           ?    10326           
==========================================
  Hits               ?    45525           
  Misses             ?    18502           
  Partials           ?     2904

Flag	Coverage Δ
GPU	`78.12% <0.00%> (?)`
XPU	`6.95% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Enable step_use_cudagraph for draft model with proper gating logic - Pass forward_meta and use_cudagraph to xpu_pre_process in draft path - Add padding_cudagraph_inputs() for draft model buffer management - Slice model output by real_token_num when graph is active - Adapt target model warmup and execute_model for MTP+CudaGraph - Use build_sampling_params kernel in verify path (replaces padding_sampling_params) - Fix memory issue by using copy_ instead of clone for seq_lens_this_time - Fix expected_decode_len for TP>1 in dummy_prefill Co-Authored-By: Clarity256 <1140021759@qq.com>

PaddlePaddle-bot · 2026-06-18T02:18:33Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-29 05:33:52 UTC+08:00

CI报告基于以下代码生成（30分钟更新一次）:
PR commit: 2355d0e | Merge base: 74a363e (branch: develop)

1 Required任务 : 8/10 通过

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	36	6	0	0	0

任务	错误类型	置信度	日志
`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	PR问题：diff coverage 未达 80% 阈值	中	Job
`Approval`	需要 Approval	高	Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题：diff coverage 未达 80% 阈值（置信度: 中）

错误类型: PR问题 | 置信度: 中
分析器: 通用分析(fallback)
失败用例: 未拿到具体 pytest 失败；失败发生在覆盖率校验阶段

用例	错误摘要
`diff-cover python_coverage_all.xml --diff-file=diff.txt --fail-under=80`	PR diff 覆盖率未达到 80% 阈值，workflow 将该状态映射为 exit code 9

关键日志:

[FAILURE]: Process completed with exit code 9.
.github/workflows/_unit_test_coverage.yml:254 diff-cover ... --fail-under=80 || COVERAGE_EXIT_CODE=9
.github/workflows/_unit_test_coverage.yml:387-404 COVERAGE_EXIT_CODE=9 时退出 9

根因摘要: PR diff coverage 未达 80%
CI 深度日志抓取未返回 log_file_path / unittest_details / diff_coverage.json，因此无法列出精确未覆盖行；但 workflow 中 exit code 9 与 diff-cover 阈值失败路径一致。PR 在 fastdeploy/model_executor/layers/sample/sampler.py 新增/修改 XPU sampling 路径，该文件未被 scripts/.coveragerc omit；而 xpu_pre_and_post_process.py、spec_decode/mtp_xpu.py、worker/xpu*.py 已被 omit，因此更可能是 sampler.py 中新增 XPU 分支没有被主单测覆盖。

修复建议:

为 sampler.py 中新增的 XPU sampling/import 路径补充可在 run_tests_with_coverage 中执行的单测，或按仓库覆盖率规则将纯 XPU 硬件路径从主 Python diff coverage 中排除/迁移到 XPU coverage 统计。
重新查看该 Job 输出里的 GPU Patch Coverage Details 或 diff_coverage.json，确认具体未覆盖行后再调整。

关联变更: fastdeploy/model_executor/layers/sample/sampler.py:64-68, fastdeploy/model_executor/layers/sample/sampler.py:1233-1284, fastdeploy/model_executor/layers/sample/sampler.py:1339-1401; 覆盖率配置 scripts/.coveragerc:26-52

🔴 Approval — 需要 Approval（置信度: 高）

错误类型: 需要 Approval | 置信度: 高
分析器: 内置 approval_required
失败用例: 无

用例	错误摘要
`Approval`	该 Job 需要人工 Approval，完成审批后 CI 才会继续执行

关键日志:

[FAILURE]: Process completed with exit code 6.

根因摘要: 工作流等待人工审批
该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。

修复建议:

请通过人工审批。

关联变更: 不适用

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-07-02 14:06:49

📋 Review 摘要

PR 概述：为 XPU MTP draft model 引入 CUDAGraph capture，并调整 speculative sampling 的参数构建与 seed 更新。
变更范围：XPU sampler、xpu_pre_process、MTP proposer、XPU model runner、XPU CI case
影响面 Tag：[XPU] [Speculative Decoding] [Graph Optimization] [OP]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/worker/xpu_model_runner.py:1452`	`SpecMethod.NAIVE` speculative 路径不再推进 `infer_seed`，随机采样会复用同一组 seed

历史 Findings 修复情况

Finding	问题	状态
F1	`cudagraph_only_prefill=True` 的 prefill capture 被禁用	⚠️ 仍存在
F2	MTP draft model 的 `moe_phase` 未同步	⚠️ 仍存在

📝 PR 规范检查

标题包含两个官方 Tag，而当前 FastDeploy 规范要求标题必须且仅包含一个官方 Tag；另外 Usage or Command 章节为空。建议改为以下内容。

标题建议（可直接复制）：

[XPU] Enable CUDAGraph capture for MTP draft model

PR 描述建议（点击展开，可直接复制）

## Motivation
1. Draft model 前向推理启用 `step_use_cudagraph` 门控逻辑，并在 multi-step 执行中仅对首步进行 capture。
2. Draft model 推理路径中传递 `forward_meta` 和 `use_cudagraph` 到 `xpu_pre_process`，确保 `cu_seqlens_q_output` / `batch_id_per_token_output` 在 cudagraph 模式下使用 `copy_` 原地更新，保证 tensor 地址稳定性。
3. 新增 `padding_cudagraph_inputs()` 方法处理 draft model 的 buffer padding，并在 graph replay 时按 `real_token_num` 切片 model output。
4. Target model 侧投机解码 warmup 流程适配 capture size 计算、`accept_all_drafts` 参数传递、TP>1 下 `expected_decode_len` 修正。
5. 将 `padding_sampling_params` 替换为 `build_sampling_params` XPU 自定义算子，在算子内部完成 `infer_seed` 的原地更新，避免在 cudagraph 外额外操作。
6. `increment_value` 改为与投机解码 token 数联动：`(num_speculative_tokens + 1) * 4`。
7. Draft model 中 `last_seq_lens_this_time` 使用 `copy_()` 替代 `clone()`，避免 CUDAGraph replay 时产生新 tensor 导致内存持续增长。

## Modifications
- `fastdeploy/spec_decode/mtp_xpu.py`：draft model 启用 `step_use_cudagraph` 门控；`_propose` 新增 cudagraph padding 逻辑与 output slicing；`_initialize_forward_meta` 传递 cudagraph 参数；`last_seq_lens_this_time` 改为 `copy_()` 原地更新。
- `fastdeploy/worker/xpu_model_runner.py`：`increment_value` 与投机解码 token 数联动；warmup capture 流程适配 speculative decoding；`infer_seed` 更新移入 `build_sampling_params` 算子内部；draft model propose 传递 `step_use_cudagraph`；修正 TP>1 时 `dummy_prefill_inputs` 的 `expected_decode_len`。
- `fastdeploy/model_executor/layers/sample/sampler.py`：`forward_xpu` 改用 `build_sampling_params` XPU 算子替代 `padding_sampling_params`；新增 `increment_value` 参数。
- `fastdeploy/model_executor/xpu_pre_and_post_process.py`：cudagraph 模式下改用 `copy_` 原地更新 `cu_seqlens_q_output` 和 `batch_id_per_token_output`，保证 graph 捕获的 tensor 地址稳定。
- `tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py` → `tests/xpu_ci/4cards_cases/test_mtp_cudagraph.py`：重命名测试脚本以符合 CI 命名规范。

## Usage or Command
N/A

## Accuracy Tests
- MTP with CUDAGraph：输出与参考结果一致（见 PR 截图）。
- MTP without CUDAGraph：输出与参考结果一致（见 PR 截图）。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

MTP CUDAGraph 主路径的实现方向基本清楚，但这次把 speculative 场景的 infer_seed 更新整体移到 sampler/op 内后，遗漏了仍依赖 post-step 更新的 NAIVE speculative 路径。建议先修复该随机采样回归；本轮仅基于任务给出的 5 个 PR diff 文件审查，未拉取 CI 日志。

PaddlePaddle-bot · 2026-07-02T06:08:41Z

            # 7. Updata 'infer_seed' and step_paddle()
-            self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
-            self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
+            if not self.speculative_decoding:


🔴 Bug SpecMethod.NAIVE 的 infer_seed 不再推进。

这里把 seed 更新限制为非 speculative 后，NAIVE speculative 路径也会被跳过；但 NAIVE 会走 _normal_sample_xpu()，其中 top_k_top_p_sampling(..., topp_seed=sampling_metadata.seed) 只读取 seed，并不会像 build_sampling_params 一样原地更新 infer_seed。SpecMethod.NAIVE 是合法模式（配置会把 num_speculative_tokens 修正为 0），因此每个 decode step 会复用同一组随机种子，导致随机采样序列退化。

建议修复方式：保留 NAIVE 的 post-step seed 更新，例如将条件改为 if not self.speculative_decoding or self.spec_method == SpecMethod.NAIVE:；或者在 _normal_sample_xpu() 内用等价逻辑推进 infer_seed。

Clarity256 had a problem deploying to Metax_ci June 17, 2026 05:19 — with GitHub Actions Failure

Clarity256 force-pushed the feature/xpu-mtp-cudagraph-capture branch from 9e45d1d to 72c0f92 Compare June 17, 2026 05:28

Clarity256 had a problem deploying to Metax_ci June 17, 2026 05:28 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Clarity256 force-pushed the feature/xpu-mtp-cudagraph-capture branch from 72c0f92 to 2355d0e Compare June 17, 2026 07:27

Clarity256 had a problem deploying to Metax_ci June 17, 2026 07:27 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Merge branch 'develop' into feature/xpu-mtp-cudagraph-capture

7db4fc5

cmcamdy had a problem deploying to Metax_ci July 2, 2026 05:51 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[XPU][Speculative Decoding] Enable CudaGraph capture for MTP draft model#8061

[XPU][Speculative Decoding] Enable CudaGraph capture for MTP draft model#8061
Clarity256 wants to merge 2 commits into
PaddlePaddle:developfrom
Clarity256:feature/xpu-mtp-cudagraph-capture

Clarity256 commented Jun 17, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 17, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Jun 17, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

Clarity256 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

CLAassistant commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 Required任务 : 8/10 通过

2 失败详情

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Clarity256 commented Jun 17, 2026 •

edited

Loading

CLAassistant commented Jun 17, 2026 •

edited

Loading

codecov-commenter commented Jun 17, 2026 •

edited

Loading

PaddlePaddle-bot commented Jun 18, 2026 •

edited

Loading