Skip to content

[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4#7817

Open
lizexu123 wants to merge 20 commits into
PaddlePaddle:developfrom
lizexu123:kkc
Open

[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4#7817
lizexu123 wants to merge 20 commits into
PaddlePaddle:developfrom
lizexu123:kkc

Conversation

@lizexu123
Copy link
Copy Markdown
Collaborator

@lizexu123 lizexu123 commented May 14, 2026

Motivation

1、修复在eb5跑fp4时,audio_token_num为None,导致会判断 NoneType >0的bug,以及加载eb5旗舰版的问题
支持fp4 通信量化,以hidden_size = 7168为例子

2、当前 FastDeploy 的 CUDA Graph 捕获是整体模型级别的,粒度较粗,存在一些灵活性限制。本 PR 引入 Block-wise CUDA Graph 机制,支持在单个算子/层级别(如 Linear、RMSNorm)独立捕获和回放 CUDA Graph,从而实现更细粒度的图优化,提升 prefill 阶段的推理性能。

3、支持block_wise_fp8 dense在线量化+nvfp4离线量化配置
--quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}' \

4、支持fp4 deepep通信
开启fp4通信量化 export FD_USE_NVFP4_COMM_QUANT=1

# 启用 block-wise CUDA Graph
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1

# 自定义预捕获的 token 数(可选)
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"

# 如果想确认prefill哪些linear进入cuda_graph
export FD_BLOCK_WISE_DEBUG=1

# 开启fp4通信量化
export FD_USE_NVFP4_COMM_QUANT=1

支持了prefill阶段进cuda_graph,kernel间空隙有所减少,如下图所示。
image
上图为之前的空隙
image
优化后基本无空隙

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 14, 2026

Thanks for your contribution!

@lizexu123 lizexu123 changed the title Kkc [Feature] Support FP4 communication quantization and block_wise_cuda_graph May 14, 2026
@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 14, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-16 09:21:54

CI报告基于以下代码生成(30分钟更新一次):


⚠️ CI 状态获取失败

本次 CI 状态检查由于 GitHub API 网络连接超时(连续 3 次请求均失败:TLS 握手超时 / 命令超时 / 响应解析失败),无法获取 PR #7817 的 CI 运行状态。

可能原因:

  • CI Agent 所在环境与 GitHub API 之间网络不稳定
  • GitHub API 服务临时不可用

建议操作:

  • 请稍后访问 CI 详情页面 手动查看 CI 运行状态
  • 30 分钟后 CI Agent 将自动重试并更新本评论

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 14, 2026

Codecov Report

❌ Patch coverage is 28.84615% with 37 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@e3541c2). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...deploy/model_executor/layers/quantization/nvfp4.py 0.00% 15 Missing ⚠️
...loy/model_executor/layers/quantization/__init__.py 16.66% 8 Missing and 2 partials ⚠️
...oy/model_executor/layers/quantization/mix_quant.py 33.33% 5 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py 0.00% 3 Missing ⚠️
fastdeploy/model_executor/utils.py 81.81% 1 Missing and 1 partial ⚠️
..._executor/layers/moe/fused_moe_deepgemm_backend.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7817   +/-   ##
==========================================
  Coverage           ?   63.39%           
==========================================
  Files              ?      462           
  Lines              ?    64320           
  Branches           ?     9859           
==========================================
  Hits               ?    40773           
  Misses             ?    20773           
  Partials           ?     2774           
Flag Coverage Δ
GPU 72.51% <28.84%> (?)
XPU 7.11% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lizexu123 lizexu123 changed the title [Feature] Support FP4 communication quantization and block_wise_cuda_graph [Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4 May 15, 2026
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-15 18:23:26

📋 Review 摘要

PR 概述:支持 FP4 通信量化(DeepEP prefill dispatch)、mix_quant 覆盖 NVFP4 checkpoint 的混合量化配置,以及相关 BugFix(audio_token_num 为 None、block-wise CUDA Graph 清理)。
变更范围custom_ops/gpu_ops/moe/fastdeploy/model_executor/layers/quantization/fastdeploy/model_executor/utils.pyfastdeploy/worker/gpu_model_runner.py
影响面 Tag[Quantization] [OP] [Graph Optimization]

问题

级别 文件 概述
🟡 建议 nvfp4.py:671 fc1_latent_proj/fc2_latent_proj 加入签名但函数体内未使用
🟡 建议 nvfp4.py:860 apply_ep_decode 同样新增未使用的 fc1_latent_proj/fc2_latent_proj
❓ 疑问 utils.py:136 weight._is_initialized() 使用了 Paddle 内部私有 API
❓ 疑问 prefill_permute_to_masked_gemm.cu UINT8 switch case 闭括号后缺少 break,依赖运行时 return/throw 防止 fall-through,存在维护风险

📝 PR 规范检查

标题格式合规(含 [Feature] Tag)✓。但 ## Modifications## Usage or Command## Accuracy Tests 三个 section 内容为空(仅含 HTML 注释占位符),Checklist 条目全部未勾选,不符合描述模板要求。

标题建议(可直接复制):

  • [Feature][BugFix][Quantization] Support FP4 comm quant, mix_quant+NVFP4 hybrid config, and fix audio_token_num NoneType bug

PR 描述建议(可直接复制):

## Motivation
1. 修复 EB5 旗舰版运行 FP4 时 `audio_token_num``None` 导致 `NoneType > 0` 类型比较异常,以及加载 EB5 旗舰版权重的问题
2. 新增 Block-wise CUDA Graph 资源清理逻辑(`clear_all_block_wise_graphs`),在 `clear_parameters` 时正确释放 block-wise graph 占用的资源
3. 支持 `mix_quant` 配置覆盖离线 NVFP4 checkpoint,实现 dense 层 block_wise_fp8 在线量化 + MoE 层 nvfp4 离线量化混合配置(适用于 EB5-800B-FP4 等模型)
4. 支持 FP4 DeepEP 通信量化(`FD_USE_NVFP4_COMM_QUANT=1`),在 prefill dispatch 前将 BF16 激活量化为 FP4,减少约 2x 通信数据量

## Modifications
- `custom_ops/gpu_ops/moe/prefill_permute_to_masked_gemm.cu`:新增 `SWIZZLE_SCALE` 模板参数,支持将 FP8 scale 直接写入 flashinfer cutedsl 所需的 swizzled 布局;新增 UINT8 x dtype dispatch 分支(FP4 packed 数据)
- `custom_ops/gpu_ops/cpp_extensions.cc``PrefillPermuteToMaskedGemm` 绑定新增 `swizzle_scale: bool = false` 参数
- `fastdeploy/model_executor/layers/quantization/nvfp4.py``apply_ep_prefill` 新增 FP4 通信量化路径,dispatch 前调用 `fp4_quantize` 将激活量化;`flashinfer_cutedsl_moe_masked` 调用区分 FP4 预量化路径(transposed layout)和 BF16 路径
- `fastdeploy/model_executor/layers/quantization/__init__.py`:新增 `mix_quant_overrides_nvfp4` 逻辑,使 `--quantization mix_quant` 能够覆盖 model config.json 中的 NVFP4 配置,dense Linear 保持 bf16 不误标为 is_quantized
- `fastdeploy/model_executor/layers/quantization/mix_quant.py`:新增 `moe_quant_config` 字段和 `_build_moe_sub_config` 方法,正确将原始离线量化配置传递给 MoE 子层(如 modelopt_fp4)
- `fastdeploy/model_executor/utils.py``process_weight_transpose` 增加未初始化权重检测;hybrid mix_quant 场景下跳过离线量化子层的增量钩子并推迟到 `process_final_after_loading`
- `fastdeploy/model_executor/forward_meta.py``ForwardMeta.audio_token_num` 增加默认值 `0`,修复 None 类型比较 bug
- `fastdeploy/model_executor/layers/linear.py`:添加注释说明 `forward_cuda` 不能被 block-wise CUDA Graph wrap 装饰(含集体通信)
- `fastdeploy/worker/gpu_model_runner.py``clear_parameters` 新增 block-wise CUDA Graph 清理入口
- `custom_ops/gpu_ops/helper.h`:统一 C++ 指针符号风格(`int*` 替代 `int *`## Usage or Command
```bash
# 开启 FP4 通信量化(DeepEP prefill dispatch)
export FD_USE_NVFP4_COMM_QUANT=1

# 启用 block-wise CUDA Graph(可选)
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"

# mix_quant 覆盖 NVFP4 checkpoint(适用于 EB5-800B-FP4 等模型)
--quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}'
```

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

FP4 通信量化和混合量化配置的实现思路清晰,关键路径均有对应测试用例。两处新增但未使用的参数(fc1_latent_proj/fc2_latent_proj)建议评估是否保留或移除;UINT8 switch case 建议补充 break 以增强代码健壮性。PR 描述的 Modifications/Usage/Accuracy Tests section 需要补全。

gate: nn.Layer,
topk_ids_hookfunc: Callable = None,
shared_experts: nn.Layer = None,
fc1_latent_proj: nn.Layer = None,

This comment was marked as outdated.

gate: nn.Layer,
topk_ids_hookfunc: Callable = None,
shared_experts: nn.Layer = None,
fc1_latent_proj: nn.Layer = None,

This comment was marked as outdated.


def process_weight_transpose(layer, weight_name):
weight = getattr(layer, weight_name)
if not weight._is_initialized():

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-15 18:37:52

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

必选任务有 1 个仍在运行中,等待完成。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
38(0) 38 35 1 1 1 0

2 任务状态汇总

2.1 Required任务 : 7/8 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
其余 7 个必选任务通过 - - - - -

2.2 可选任务 — 28/30 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 13s Job -
⏸️ CI_HPU - Job -
其余 28 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-15 18:52:32

📋 Review 摘要

PR 概述:支持 FP4 通信量化(DeepEP prefill dispatch)、dense block_wise_fp8 + MoE nvfp4 混合量化,并修复 EB5 旗舰版加载及 audio_token_num 为 None 的 Bug
变更范围custom_ops/gpu_ops/moe/fastdeploy/model_executor/layers/quantization/fastdeploy/worker/gpu_model_runner.py
影响面 Tag[Quantization] [OP] [Graph Optimization]

问题

级别 文件 概述
🟡 建议 fastdeploy/model_executor/layers/quantization/nvfp4.py:672 fc1_latent_proj/fc2_latent_projapply_ep_prefill 中添加但未使用
🟡 建议 fastdeploy/model_executor/layers/quantization/nvfp4.py:861 fc1_latent_proj/fc2_latent_projapply_ep_decode 中添加但未使用
🟡 建议 tests/operators/test_permute_prefill_masked_gemm.py:49 swizzle_scale=True 路径(核心新功能)缺少测试覆盖
❓ 疑问 fastdeploy/model_executor/layers/quantization/nvfp4.py:691 view(float32) 依赖隐含假设,缺少运行时断言
📝 PR 规范 ## Modifications## Usage or Command## Accuracy Tests 均为空;Checklist 全部未勾选

📝 PR 规范检查

PR body 中 ## Modifications## Usage or Command## Accuracy Tests 三个必填 section 内容为空(仅保留注释占位符),Checklist 全部未勾选。标题格式合规([Feature] 为官方 Tag)。

标题建议(可直接复制):

  • [Feature][BugFix][Quantization] Support FP4 comm quant, dense block_wise_fp8+MoE nvfp4 mix_quant, fix audio_token_num bug

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
1. 修复在 EB5 旗舰版运行 FP4 时 `audio_token_num``None` 导致 `NoneType > 0` 的 Bug,以及加载 EB5 旗舰版的问题。
2. 支持 FP4 通信量化:dispatch 前将激活量化为 FP4(UINT8 packed),减少约 2x 通信量。
3. 支持 `block_wise_fp8` dense 在线量化 + `nvfp4` MoE 离线量化的混合 mix_quant 配置(如 EB5-800B-FP4)。
4. 引入 Block-wise CUDA Graph 机制,支持 prefill 阶段 Linear 层级别的 CUDA Graph 捕获与回放,减少 kernel 间空隙。

## Modifications
- `fastdeploy/model_executor/forward_meta.py`: 修复 `audio_token_num` 默认值 `None``0`
- `fastdeploy/envs.py`: 新增 `FD_USE_NVFP4_COMM_QUANT` 环境变量
- `custom_ops/gpu_ops/moe/prefill_permute_to_masked_gemm.cu`: 新增 `SWIZZLE_SCALE` 模板参数,支持直接写入 flashinfer cutedsl swizzled 布局的 FP8 scale;新增 `UINT8 + FLOAT32``BF16 + UINT8` dtype dispatch 分支
- `custom_ops/gpu_ops/cpp_extensions.cc`: 同步更新 `PrefillPermuteToMaskedGemm` 签名,添加 `swizzle_scale: bool = false`
- `fastdeploy/model_executor/layers/quantization/nvfp4.py`: `apply_ep_prefill` 中新增 FP4 通信量化路径;将量化后的 dispatch_input/dispatch_scale 传入 `ep_prefill_runner.dispatch`;新增 `swizzle_scale` 参数透传
- `fastdeploy/model_executor/layers/quantization/__init__.py`: 新增 `mix_quant_overrides_nvfp4` 逻辑,支持 CLI mix_quant 配置覆盖模型 NVFP4 checkpoint
- `fastdeploy/model_executor/layers/quantization/mix_quant.py`: 新增 `moe_quant_config` 字段;重构 `_build_moe_sub_config` 方法,为 offline MoE(modelopt_fp4)正确构建子配置
- `fastdeploy/model_executor/utils.py`: 修复混合量化场景下权重加载增量 hook 提前触发问题;`process_weight_transpose` 增加未初始化权重的防御检查
- `fastdeploy/worker/gpu_model_runner.py`: `clear_parameters` 中新增 block-wise CUDA Graph 清理
- `fastdeploy/model_executor/layers/linear.py`: 添加注释说明 `forward_cuda` 不可用 block-wise CUDA Graph 的原因
- `tests/operators/test_permute_prefill_masked_gemm.py`: 显式传 `swizzle_scale=False`
- `tests/quantization/test_modelopt_nvfp4.py`: 新增 `TestFlashInferCuteDSLMoEHelpers` 辅助函数单测

## Usage or Command
```bash
# 开启 FP4 通信量化(DeepEP prefill dispatch)
export FD_USE_NVFP4_COMM_QUANT=1

# 启用 block-wise CUDA Graph(prefill 阶段)
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"

# 支持 dense block_wise_fp8 在线量化 + MoE nvfp4 离线量化(EB5-800B-FP4 等)
--quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}'
```

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现逻辑清晰,FP4 通信量化、mix_quant 混合配置的处理方式合理;主要需要补充 swizzle_scale=True 路径的单元测试、清理未使用的 fc1_latent_proj/fc2_latent_proj 参数,以及完善 PR 描述中的必填章节。

topk_ids_hookfunc: Callable = None,
shared_experts: nn.Layer = None,
fc1_latent_proj: nn.Layer = None,
fc2_latent_proj: nn.Layer = None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 fc1_latent_projfc2_latent_proj 两个参数在本函数体内没有任何实际使用,属于死代码。

如果是为后续 Latent MoE 路径预留的接口,建议在参数上方添加 # TODO: 注释说明用途;如果是误加,直接删除即可,避免误导后续开发者。

x_fp4, x_fp4_scale = fp4_quantize(
x, layer.up_gate_proj_input_scale_quant, sf_vec_size=16, is_sf_swizzled_layout=False
)
x_fp4_scale = x_fp4_scale.view(paddle.float32) # float8_e4m3fn -> float32
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 x_fp4_scale.view(paddle.float32)float8_e4m3fn tensor 按字节重解释为 float32,要求 tensor 总字节数(num_tokens × hidden_scale)为 4 的整数倍,否则 view 会抛出运行时异常。

虽然当前典型场景(hidden_size=7168, sf_vec_size=16 → hidden_scale=448,448%4=0)是安全的,但没有显式断言保护。建议在 view 之前添加:

assert x_fp4_scale.numel() % 4 == 0, (
    f"FP4 scale numel={x_fp4_scale.numel()} must be divisible by 4 for view(float32)"
)

topk_ids_hookfunc: Callable = None,
shared_experts: nn.Layer = None,
fc1_latent_proj: nn.Layer = None,
fc2_latent_proj: nn.Layer = None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议apply_ep_prefillfc1_latent_projfc2_latent_proj 参数在 apply_ep_decode 函数体内也未使用。建议统一处理(补充 TODO 说明或删除)。

topk_ids = topk_ids.cast(paddle.int64)

results = prefill_permute_to_masked_gemm(x, scale, topk_ids, num_local_experts, max_token_num)
results = prefill_permute_to_masked_gemm(x, scale, topk_ids, num_local_experts, max_token_num, False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 本次只将原有调用改为显式传 swizzle_scale=False,但没有覆盖 swizzle_scale=True 路径的测试——而这正是本 PR 新增的核心 CUDA kernel 分支(SWIZZLE_SCALE=true 模板实例 + swizzled 内存写入逻辑)。

建议补充一个 swizzle_scale=True 的测试用例,使用 UINT8 + FLOAT32 输入,验证:

  1. 输出 permute_scale 的 shape 和数值是否符合 flashinfer cutedsl swizzled 布局预期;
  2. swizzle_scale=False + 手动 swizzle 的结果一致。

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-15 22:26:07

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前有 2 个 Required 任务失败,需优先处理后方可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 36 4 0 1 0

2 任务状态汇总

2.1 Required任务 : 8/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h22m PR问题:diff覆盖率仅36%,未达80%阈值 为nvfp4.py等新增文件补充单元测试 Job -
Approval 8s PR问题:修改fastdeploy/envs.py需RD审批 请@jiangjiajun/@liuyuanle等RD成员审批 Job -
其余 8 个必选任务通过 - - - - -

2.2 可选任务 — 28/31 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 13s Job -
Trigger Jenkins for PR 1m43s Job -
⏸️ CI_HPU - - -
其余 28 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 覆盖率不达标
  • 置信度: 高
  • 根因摘要: PR新增文件覆盖率仅36%,未达80%阈值
  • 分析器: ci_analyze_unittest_fastdeploy

覆盖率明细:

文件 覆盖率 未覆盖行
fastdeploy/model_executor/layers/quantization/nvfp4.py 0% L109, L683-697, L777-802 (15行)
fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py 0% L83
fastdeploy/worker/gpu_model_runner.py 0% L3056, L3057, L3061
fastdeploy/model_executor/layers/quantization/__init__.py 33.3% L101, L106, L109, L112-114, L116, L118
fastdeploy/model_executor/layers/quantization/mix_quant.py 44.4% L99-103
fastdeploy/model_executor/utils.py 90.9% L192
fastdeploy/model_executor/forward_meta.py 100%

根因详情:
本次 PR 新增了 FP4 通信量化、dense block_wise_fp8 及 MoE nvfp4 支持,变更量共 700 行(可测量 52 行)。单元测试均通过(TEST_EXIT_CODE=0),但 diff 覆盖率仅 36%(33/52 行未覆盖),远低于 80% 阈值。核心覆盖缺口集中在 nvfp4.pyfused_moe_deepgemm_backend.pygpu_model_runner.py,这三个文件均为新增功能代码,暂无对应测试。

关键日志:

COVERAGE_EXIT_CODE: 9
GPU Patch Coverage Details:
{
  "total_percent_covered": 36,
  "total_num_lines": 52,
  "total_num_violations": 33,
  "num_changed_lines": 700
}
##[error]Process completed with exit code 9.

修复建议:

  1. fastdeploy/model_executor/layers/quantization/nvfp4.py L109、L683-697、L777-802 等新增代码添加单元测试
  2. fastdeploy/worker/gpu_model_runner.py L3056-3061 新增的 FP4 相关逻辑添加覆盖测试
  3. fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py L83 添加测试
  4. 若上述代码依赖特定硬件(如 H20 GPU)无法在 CI 环境中运行,可在对应测试或文件顶部添加豁免注释(具体豁免格式参考项目规范)

修复建议摘要: 为nvfp4.py等新增文件添加单元测试或申请豁免

关联变更: 本次 PR 引入 FP4/FP8 量化支持,修改涉及 nvfp4.pymix_quant.pygpu_model_runner.py 等多个文件
链接: 查看日志

Approval — 审批缺失(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 审批缺失
  • 置信度: 高
  • 根因摘要: 修改fastdeploy/envs.py需FastDeploy RD成员审批
  • 分析器: ci_analyze_infra

根因详情:
本次 PR 修改了 fastdeploy/envs.py,该文件属于受保护文件,需要至少一位指定的 FastDeploy RD 成员审批后方可合并。审批脚本 scripts/check_approval.sh 检测到缺少所需审批,返回退出码 6。

关键日志:

0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle),
   rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying [fastdeploy/envs.py].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 请以下任意一位 FastDeploy RD 成员对此 PR 进行 Approve:@jiangjiajun、@liuyuanle、@chenjian26、@wanglongzhi

修复建议摘要: 请@jiangjiajun/@liuyuanle等RD成员进行Approve

链接: 查看日志

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-15 23:57:43

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前有 2 个 Required 任务失败,阻塞合并,需优先处理。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
28(0) 28 24 3 0 1 0

2 任务状态汇总

2.1 Required任务 : 6/8 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h22m PR问题:新增代码覆盖率未达 80% 阈值 补充新增代码的单元测试或申请豁免 Job -
Approval 8s PR问题:envs.py 修改需指定 RD 审批 请 jiangjiajun/liuyuanle/rainyfly/Wanglongzhi2001 审批 Job -
其余 6 个必选任务通过 - - - - -

2.2 可选任务 — 18/20 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 1m43s Job -
⏸️ CI_HPU - - -
其余 18 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不足(置信度: 高)

run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 代码覆盖率不足
  • 置信度: 高
  • 根因摘要: 新增代码覆盖率未达80%阈值,覆盖率验证步骤失败
  • 分析器: ci_analyze_unittest_fastdeploy

根因详情:
本次 PR 新增了 FP4 通信量化、Block-wise CUDA Graph、block_wise_fp8 以及 nvfp4 等大量新特性代码,但对应的单元测试覆盖不足。"Run FastDeploy Unit Tests and Coverage"步骤本身通过(单测均已执行),"Check Unit Test Success"步骤也通过,但"Verify Code Coverage Threshold (80%)"步骤以 exit code 9 失败,说明 PR 新增代码的差异覆盖率未达到 80% 阈值。

关键日志:

Step "Run FastDeploy Unit Tests and Coverage" → success
Step "Check Unit Test Success"               → success
Step "Verify Code Coverage Threshold (80%)"  → failure (exit code 9)

修复建议:

  1. 为本 PR 新增的 FP4 通信量化、block_wise_fp8、nvfp4 等相关文件补充单元测试,使新增代码的行覆盖率 ≥ 80%
  2. 若部分新增代码确实难以在 CI 中测试(如 GPU 专属路径),可在 CI 配置中申请相应文件的覆盖率豁免

修复建议摘要: 为新增 FP4/fp8/nvfp4 代码补充单元测试或申请豁免

关联变更: PR 新增了 FP4 通信量化、Block-wise CUDA Graph、block_wise_fp8、moe nvfp4 相关代码
链接: 查看日志

Approval — 审批未通过(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 审批未通过
  • 置信度: 高
  • 根因摘要: fastdeploy/envs.py 修改需要指定 RD 成员审批
  • 分析器: 通用分析(fallback)

根因详情:
PR 修改了 fastdeploy/envs.py 文件,该文件属于受保护文件,需要 FastDeploy 核心 RD 成员中至少 1 人 Approve。当前审批脚本检测到 1 个审批错误(exit code 6)。需要以下人员之一审批:Jiang-Jia-Jun(jiangjiajun)、yuanlehome(liuyuanle)、rainyfly(chenjian26)、Wanglongzhi2001(wanglongzhi)。

关键日志:

0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle),
   rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying
   [fastdeploy/envs.py].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 请 jiangjiajun / liuyuanle / chenjian26 / wanglongzhi 中任意一人在 GitHub 上对此 PR 点击 "Approve",CI 会自动重新检查

修复建议摘要: 请 jiangjiajun/liuyuanle/rainyfly/Wanglongzhi2001 之一审批此 PR

关联变更: fastdeploy/envs.py(受保护文件,需核心 RD 审批)
链接: 查看日志

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants