[XPU][Speculative Decoding] Enable CudaGraph capture for MTP draft model#8061
[XPU][Speculative Decoding] Enable CudaGraph capture for MTP draft model#8061Clarity256 wants to merge 2 commits into
Conversation
9e45d1d to
72c0f92
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #8061 +/- ##
==========================================
Coverage ? 68.01%
==========================================
Files ? 475
Lines ? 66931
Branches ? 10326
==========================================
Hits ? 45525
Misses ? 18502
Partials ? 2904
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
- Enable step_use_cudagraph for draft model with proper gating logic - Pass forward_meta and use_cudagraph to xpu_pre_process in draft path - Add padding_cudagraph_inputs() for draft model buffer management - Slice model output by real_token_num when graph is active - Adapt target model warmup and execute_model for MTP+CudaGraph - Use build_sampling_params kernel in verify path (replaces padding_sampling_params) - Fix memory issue by using copy_ instead of clone for seq_lens_this_time - Fix expected_decode_len for TP>1 in dummy_prefill Co-Authored-By: Clarity256 <1140021759@qq.com>
72c0f92 to
2355d0e
Compare
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 8/10 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题:diff coverage 未达 80% 阈值(置信度: 中)错误类型: PR问题 | 置信度: 中
关键日志:
修复建议:
关联变更: 🔴 Approval — 需要 Approval(置信度: 高)错误类型: 需要 Approval | 置信度: 高
关键日志:
修复建议:
关联变更: 不适用 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-07-02 14:06:49
📋 Review 摘要
PR 概述:为 XPU MTP draft model 引入 CUDAGraph capture,并调整 speculative sampling 的参数构建与 seed 更新。
变更范围:XPU sampler、xpu_pre_process、MTP proposer、XPU model runner、XPU CI case
影响面 Tag:[XPU] [Speculative Decoding] [Graph Optimization] [OP]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/worker/xpu_model_runner.py:1452 |
SpecMethod.NAIVE speculative 路径不再推进 infer_seed,随机采样会复用同一组 seed |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | cudagraph_only_prefill=True 的 prefill capture 被禁用 |
|
| F2 | MTP draft model 的 moe_phase 未同步 |
📝 PR 规范检查
标题包含两个官方 Tag,而当前 FastDeploy 规范要求标题必须且仅包含一个官方 Tag;另外 Usage or Command 章节为空。建议改为以下内容。
标题建议(可直接复制):
[XPU] Enable CUDAGraph capture for MTP draft model
PR 描述建议(点击展开,可直接复制)
## Motivation
1. Draft model 前向推理启用 `step_use_cudagraph` 门控逻辑,并在 multi-step 执行中仅对首步进行 capture。
2. Draft model 推理路径中传递 `forward_meta` 和 `use_cudagraph` 到 `xpu_pre_process`,确保 `cu_seqlens_q_output` / `batch_id_per_token_output` 在 cudagraph 模式下使用 `copy_` 原地更新,保证 tensor 地址稳定性。
3. 新增 `padding_cudagraph_inputs()` 方法处理 draft model 的 buffer padding,并在 graph replay 时按 `real_token_num` 切片 model output。
4. Target model 侧投机解码 warmup 流程适配 capture size 计算、`accept_all_drafts` 参数传递、TP>1 下 `expected_decode_len` 修正。
5. 将 `padding_sampling_params` 替换为 `build_sampling_params` XPU 自定义算子,在算子内部完成 `infer_seed` 的原地更新,避免在 cudagraph 外额外操作。
6. `increment_value` 改为与投机解码 token 数联动:`(num_speculative_tokens + 1) * 4`。
7. Draft model 中 `last_seq_lens_this_time` 使用 `copy_()` 替代 `clone()`,避免 CUDAGraph replay 时产生新 tensor 导致内存持续增长。
## Modifications
- `fastdeploy/spec_decode/mtp_xpu.py`:draft model 启用 `step_use_cudagraph` 门控;`_propose` 新增 cudagraph padding 逻辑与 output slicing;`_initialize_forward_meta` 传递 cudagraph 参数;`last_seq_lens_this_time` 改为 `copy_()` 原地更新。
- `fastdeploy/worker/xpu_model_runner.py`:`increment_value` 与投机解码 token 数联动;warmup capture 流程适配 speculative decoding;`infer_seed` 更新移入 `build_sampling_params` 算子内部;draft model propose 传递 `step_use_cudagraph`;修正 TP>1 时 `dummy_prefill_inputs` 的 `expected_decode_len`。
- `fastdeploy/model_executor/layers/sample/sampler.py`:`forward_xpu` 改用 `build_sampling_params` XPU 算子替代 `padding_sampling_params`;新增 `increment_value` 参数。
- `fastdeploy/model_executor/xpu_pre_and_post_process.py`:cudagraph 模式下改用 `copy_` 原地更新 `cu_seqlens_q_output` 和 `batch_id_per_token_output`,保证 graph 捕获的 tensor 地址稳定。
- `tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py` → `tests/xpu_ci/4cards_cases/test_mtp_cudagraph.py`:重命名测试脚本以符合 CI 命名规范。
## Usage or Command
N/A
## Accuracy Tests
- MTP with CUDAGraph:输出与参考结果一致(见 PR 截图)。
- MTP without CUDAGraph:输出与参考结果一致(见 PR 截图)。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
MTP CUDAGraph 主路径的实现方向基本清楚,但这次把 speculative 场景的 infer_seed 更新整体移到 sampler/op 内后,遗漏了仍依赖 post-step 更新的 NAIVE speculative 路径。建议先修复该随机采样回归;本轮仅基于任务给出的 5 个 PR diff 文件审查,未拉取 CI 日志。
| # 7. Updata 'infer_seed' and step_paddle() | ||
| self.share_inputs["infer_seed"].add_(self.infer_seed_increment) | ||
| self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED | ||
| if not self.speculative_decoding: |
There was a problem hiding this comment.
🔴 Bug SpecMethod.NAIVE 的 infer_seed 不再推进。
这里把 seed 更新限制为非 speculative 后,NAIVE speculative 路径也会被跳过;但 NAIVE 会走 _normal_sample_xpu(),其中 top_k_top_p_sampling(..., topp_seed=sampling_metadata.seed) 只读取 seed,并不会像 build_sampling_params 一样原地更新 infer_seed。SpecMethod.NAIVE 是合法模式(配置会把 num_speculative_tokens 修正为 0),因此每个 decode step 会复用同一组随机种子,导致随机采样序列退化。
建议修复方式:保留 NAIVE 的 post-step seed 更新,例如将条件改为 if not self.speculative_decoding or self.spec_method == SpecMethod.NAIVE:;或者在 _normal_sample_xpu() 内用等价逻辑推进 infer_seed。
Motivation
step_use_cudagraph门控逻辑,并在 multi-step 执行中仅对首步进行 capture。forward_meta和use_cudagraph到xpu_pre_process,确保cu_seqlens_q_output/batch_id_per_token_output在 cudagraph 模式下使用copy_原地更新,保证 tensor 地址稳定性。padding_cudagraph_inputs()方法处理 draft model 的 buffer padding,并在 graph replay 时按real_token_num切片 model output。accept_all_drafts参数传递、TP>1 下expected_decode_len修正)。padding_sampling_params(Python 侧 CPU 实现)替换为build_sampling_paramsXPU 自定义算子([XPU][OP] Add build_sampling_params kernel for MTP speculative decoding #8032),在算子内部完成infer_seed的原地更新,避免在 cudagraph 外额外操作。increment_value改为与投机解码 token 数联动((num_speculative_tokens + 1) * 4)。last_seq_lens_this_time使用copy_()替代clone(),避免 CUDAGraph replay 时产生新 tensor 导致内存持续增长。Modifications
fastdeploy/spec_decode/mtp_xpu.py:draft model 启用step_use_cudagraph门控;_propose新增 cudagraph padding 逻辑与 output slicing;_initialize_forward_meta传递 cudagraph 参数;last_seq_lens_this_time改为copy_()原地更新。fastdeploy/worker/xpu_model_runner.py:increment_value与投机解码 token 数联动;warmup capture 流程适配 speculative decoding;infer_seed更新移入build_sampling_params算子内部;draft model propose 传递step_use_cudagraph;修正 TP>1 时dummy_prefill_inputs的expected_decode_len。fastdeploy/model_executor/layers/sample/sampler.py:forward_xpu改用build_sampling_paramsXPU 算子替代padding_sampling_params;新增increment_value参数。fastdeploy/model_executor/xpu_pre_and_post_process.py:cudagraph 模式下改用copy_原地更新cu_seqlens_q_output和batch_id_per_token_output,保证 graph 捕获的 tensor 地址稳定。tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py → test_mtp_cudagraph.py:重命名测试脚本以符合 CI 命名规范。Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.