[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4#7817
[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4#7817lizexu123 wants to merge 20 commits into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次):
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7817 +/- ##
==========================================
Coverage ? 63.39%
==========================================
Files ? 462
Lines ? 64320
Branches ? 9859
==========================================
Hits ? 40773
Misses ? 20773
Partials ? 2774
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-15 18:23:26
📋 Review 摘要
PR 概述:支持 FP4 通信量化(DeepEP prefill dispatch)、mix_quant 覆盖 NVFP4 checkpoint 的混合量化配置,以及相关 BugFix(audio_token_num 为 None、block-wise CUDA Graph 清理)。
变更范围:custom_ops/gpu_ops/moe/、fastdeploy/model_executor/layers/quantization/、fastdeploy/model_executor/utils.py、fastdeploy/worker/gpu_model_runner.py
影响面 Tag:[Quantization] [OP] [Graph Optimization]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | nvfp4.py:671 |
fc1_latent_proj/fc2_latent_proj 加入签名但函数体内未使用 |
| 🟡 建议 | nvfp4.py:860 |
apply_ep_decode 同样新增未使用的 fc1_latent_proj/fc2_latent_proj |
| ❓ 疑问 | utils.py:136 |
weight._is_initialized() 使用了 Paddle 内部私有 API |
| ❓ 疑问 | prefill_permute_to_masked_gemm.cu |
UINT8 switch case 闭括号后缺少 break,依赖运行时 return/throw 防止 fall-through,存在维护风险 |
📝 PR 规范检查
标题格式合规(含 [Feature] Tag)✓。但 ## Modifications、## Usage or Command、## Accuracy Tests 三个 section 内容为空(仅含 HTML 注释占位符),Checklist 条目全部未勾选,不符合描述模板要求。
标题建议(可直接复制):
[Feature][BugFix][Quantization] Support FP4 comm quant, mix_quant+NVFP4 hybrid config, and fix audio_token_num NoneType bug
PR 描述建议(可直接复制):
## Motivation
1. 修复 EB5 旗舰版运行 FP4 时 `audio_token_num` 为 `None` 导致 `NoneType > 0` 类型比较异常,以及加载 EB5 旗舰版权重的问题
2. 新增 Block-wise CUDA Graph 资源清理逻辑(`clear_all_block_wise_graphs`),在 `clear_parameters` 时正确释放 block-wise graph 占用的资源
3. 支持 `mix_quant` 配置覆盖离线 NVFP4 checkpoint,实现 dense 层 block_wise_fp8 在线量化 + MoE 层 nvfp4 离线量化混合配置(适用于 EB5-800B-FP4 等模型)
4. 支持 FP4 DeepEP 通信量化(`FD_USE_NVFP4_COMM_QUANT=1`),在 prefill dispatch 前将 BF16 激活量化为 FP4,减少约 2x 通信数据量
## Modifications
- `custom_ops/gpu_ops/moe/prefill_permute_to_masked_gemm.cu`:新增 `SWIZZLE_SCALE` 模板参数,支持将 FP8 scale 直接写入 flashinfer cutedsl 所需的 swizzled 布局;新增 UINT8 x dtype dispatch 分支(FP4 packed 数据)
- `custom_ops/gpu_ops/cpp_extensions.cc`:`PrefillPermuteToMaskedGemm` 绑定新增 `swizzle_scale: bool = false` 参数
- `fastdeploy/model_executor/layers/quantization/nvfp4.py`:`apply_ep_prefill` 新增 FP4 通信量化路径,dispatch 前调用 `fp4_quantize` 将激活量化;`flashinfer_cutedsl_moe_masked` 调用区分 FP4 预量化路径(transposed layout)和 BF16 路径
- `fastdeploy/model_executor/layers/quantization/__init__.py`:新增 `mix_quant_overrides_nvfp4` 逻辑,使 `--quantization mix_quant` 能够覆盖 model config.json 中的 NVFP4 配置,dense Linear 保持 bf16 不误标为 is_quantized
- `fastdeploy/model_executor/layers/quantization/mix_quant.py`:新增 `moe_quant_config` 字段和 `_build_moe_sub_config` 方法,正确将原始离线量化配置传递给 MoE 子层(如 modelopt_fp4)
- `fastdeploy/model_executor/utils.py`:`process_weight_transpose` 增加未初始化权重检测;hybrid mix_quant 场景下跳过离线量化子层的增量钩子并推迟到 `process_final_after_loading`
- `fastdeploy/model_executor/forward_meta.py`:`ForwardMeta.audio_token_num` 增加默认值 `0`,修复 None 类型比较 bug
- `fastdeploy/model_executor/layers/linear.py`:添加注释说明 `forward_cuda` 不能被 block-wise CUDA Graph wrap 装饰(含集体通信)
- `fastdeploy/worker/gpu_model_runner.py`:`clear_parameters` 新增 block-wise CUDA Graph 清理入口
- `custom_ops/gpu_ops/helper.h`:统一 C++ 指针符号风格(`int*` 替代 `int *`)
## Usage or Command
```bash
# 开启 FP4 通信量化(DeepEP prefill dispatch)
export FD_USE_NVFP4_COMM_QUANT=1
# 启用 block-wise CUDA Graph(可选)
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"
# mix_quant 覆盖 NVFP4 checkpoint(适用于 EB5-800B-FP4 等模型)
--quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}'
```
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
FP4 通信量化和混合量化配置的实现思路清晰,关键路径均有对应测试用例。两处新增但未使用的参数(fc1_latent_proj/fc2_latent_proj)建议评估是否保留或移除;UINT8 switch case 建议补充 break 以增强代码健壮性。PR 描述的 Modifications/Usage/Accuracy Tests section 需要补全。
| gate: nn.Layer, | ||
| topk_ids_hookfunc: Callable = None, | ||
| shared_experts: nn.Layer = None, | ||
| fc1_latent_proj: nn.Layer = None, |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| gate: nn.Layer, | ||
| topk_ids_hookfunc: Callable = None, | ||
| shared_experts: nn.Layer = None, | ||
| fc1_latent_proj: nn.Layer = None, |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
|
|
||
| def process_weight_transpose(layer, weight_name): | ||
| weight = getattr(layer, weight_name) | ||
| if not weight._is_initialized(): |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览必选任务有 1 个仍在运行中,等待完成。
2 任务状态汇总2.1 Required任务 : 7/8 通过
2.2 可选任务 — 28/30 通过
3 失败详情(仅 required)无 required 失败任务。 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-15 18:52:32
📋 Review 摘要
PR 概述:支持 FP4 通信量化(DeepEP prefill dispatch)、dense block_wise_fp8 + MoE nvfp4 混合量化,并修复 EB5 旗舰版加载及 audio_token_num 为 None 的 Bug
变更范围:custom_ops/gpu_ops/moe/、fastdeploy/model_executor/layers/quantization/、fastdeploy/worker/gpu_model_runner.py
影响面 Tag:[Quantization] [OP] [Graph Optimization]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/model_executor/layers/quantization/nvfp4.py:672 |
fc1_latent_proj/fc2_latent_proj 在 apply_ep_prefill 中添加但未使用 |
| 🟡 建议 | fastdeploy/model_executor/layers/quantization/nvfp4.py:861 |
fc1_latent_proj/fc2_latent_proj 在 apply_ep_decode 中添加但未使用 |
| 🟡 建议 | tests/operators/test_permute_prefill_masked_gemm.py:49 |
swizzle_scale=True 路径(核心新功能)缺少测试覆盖 |
| ❓ 疑问 | fastdeploy/model_executor/layers/quantization/nvfp4.py:691 |
view(float32) 依赖隐含假设,缺少运行时断言 |
| 📝 PR 规范 | — | ## Modifications、## Usage or Command、## Accuracy Tests 均为空;Checklist 全部未勾选 |
📝 PR 规范检查
PR body 中 ## Modifications、## Usage or Command、## Accuracy Tests 三个必填 section 内容为空(仅保留注释占位符),Checklist 全部未勾选。标题格式合规([Feature] 为官方 Tag)。
标题建议(可直接复制):
[Feature][BugFix][Quantization] Support FP4 comm quant, dense block_wise_fp8+MoE nvfp4 mix_quant, fix audio_token_num bug
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
1. 修复在 EB5 旗舰版运行 FP4 时 `audio_token_num` 为 `None` 导致 `NoneType > 0` 的 Bug,以及加载 EB5 旗舰版的问题。
2. 支持 FP4 通信量化:dispatch 前将激活量化为 FP4(UINT8 packed),减少约 2x 通信量。
3. 支持 `block_wise_fp8` dense 在线量化 + `nvfp4` MoE 离线量化的混合 mix_quant 配置(如 EB5-800B-FP4)。
4. 引入 Block-wise CUDA Graph 机制,支持 prefill 阶段 Linear 层级别的 CUDA Graph 捕获与回放,减少 kernel 间空隙。
## Modifications
- `fastdeploy/model_executor/forward_meta.py`: 修复 `audio_token_num` 默认值 `None` → `0`
- `fastdeploy/envs.py`: 新增 `FD_USE_NVFP4_COMM_QUANT` 环境变量
- `custom_ops/gpu_ops/moe/prefill_permute_to_masked_gemm.cu`: 新增 `SWIZZLE_SCALE` 模板参数,支持直接写入 flashinfer cutedsl swizzled 布局的 FP8 scale;新增 `UINT8 + FLOAT32` 和 `BF16 + UINT8` dtype dispatch 分支
- `custom_ops/gpu_ops/cpp_extensions.cc`: 同步更新 `PrefillPermuteToMaskedGemm` 签名,添加 `swizzle_scale: bool = false`
- `fastdeploy/model_executor/layers/quantization/nvfp4.py`: `apply_ep_prefill` 中新增 FP4 通信量化路径;将量化后的 dispatch_input/dispatch_scale 传入 `ep_prefill_runner.dispatch`;新增 `swizzle_scale` 参数透传
- `fastdeploy/model_executor/layers/quantization/__init__.py`: 新增 `mix_quant_overrides_nvfp4` 逻辑,支持 CLI mix_quant 配置覆盖模型 NVFP4 checkpoint
- `fastdeploy/model_executor/layers/quantization/mix_quant.py`: 新增 `moe_quant_config` 字段;重构 `_build_moe_sub_config` 方法,为 offline MoE(modelopt_fp4)正确构建子配置
- `fastdeploy/model_executor/utils.py`: 修复混合量化场景下权重加载增量 hook 提前触发问题;`process_weight_transpose` 增加未初始化权重的防御检查
- `fastdeploy/worker/gpu_model_runner.py`: `clear_parameters` 中新增 block-wise CUDA Graph 清理
- `fastdeploy/model_executor/layers/linear.py`: 添加注释说明 `forward_cuda` 不可用 block-wise CUDA Graph 的原因
- `tests/operators/test_permute_prefill_masked_gemm.py`: 显式传 `swizzle_scale=False`
- `tests/quantization/test_modelopt_nvfp4.py`: 新增 `TestFlashInferCuteDSLMoEHelpers` 辅助函数单测
## Usage or Command
```bash
# 开启 FP4 通信量化(DeepEP prefill dispatch)
export FD_USE_NVFP4_COMM_QUANT=1
# 启用 block-wise CUDA Graph(prefill 阶段)
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"
# 支持 dense block_wise_fp8 在线量化 + MoE nvfp4 离线量化(EB5-800B-FP4 等)
--quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}'
```
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
整体实现逻辑清晰,FP4 通信量化、mix_quant 混合配置的处理方式合理;主要需要补充 swizzle_scale=True 路径的单元测试、清理未使用的 fc1_latent_proj/fc2_latent_proj 参数,以及完善 PR 描述中的必填章节。
| topk_ids_hookfunc: Callable = None, | ||
| shared_experts: nn.Layer = None, | ||
| fc1_latent_proj: nn.Layer = None, | ||
| fc2_latent_proj: nn.Layer = None, |
There was a problem hiding this comment.
🟡 建议 fc1_latent_proj 和 fc2_latent_proj 两个参数在本函数体内没有任何实际使用,属于死代码。
如果是为后续 Latent MoE 路径预留的接口,建议在参数上方添加 # TODO: 注释说明用途;如果是误加,直接删除即可,避免误导后续开发者。
| x_fp4, x_fp4_scale = fp4_quantize( | ||
| x, layer.up_gate_proj_input_scale_quant, sf_vec_size=16, is_sf_swizzled_layout=False | ||
| ) | ||
| x_fp4_scale = x_fp4_scale.view(paddle.float32) # float8_e4m3fn -> float32 |
There was a problem hiding this comment.
❓ 疑问 x_fp4_scale.view(paddle.float32) 将 float8_e4m3fn tensor 按字节重解释为 float32,要求 tensor 总字节数(num_tokens × hidden_scale)为 4 的整数倍,否则 view 会抛出运行时异常。
虽然当前典型场景(hidden_size=7168, sf_vec_size=16 → hidden_scale=448,448%4=0)是安全的,但没有显式断言保护。建议在 view 之前添加:
assert x_fp4_scale.numel() % 4 == 0, (
f"FP4 scale numel={x_fp4_scale.numel()} must be divisible by 4 for view(float32)"
)| topk_ids_hookfunc: Callable = None, | ||
| shared_experts: nn.Layer = None, | ||
| fc1_latent_proj: nn.Layer = None, | ||
| fc2_latent_proj: nn.Layer = None, |
There was a problem hiding this comment.
🟡 建议 同 apply_ep_prefill:fc1_latent_proj 和 fc2_latent_proj 参数在 apply_ep_decode 函数体内也未使用。建议统一处理(补充 TODO 说明或删除)。
| topk_ids = topk_ids.cast(paddle.int64) | ||
|
|
||
| results = prefill_permute_to_masked_gemm(x, scale, topk_ids, num_local_experts, max_token_num) | ||
| results = prefill_permute_to_masked_gemm(x, scale, topk_ids, num_local_experts, max_token_num, False) |
There was a problem hiding this comment.
🟡 建议 本次只将原有调用改为显式传 swizzle_scale=False,但没有覆盖 swizzle_scale=True 路径的测试——而这正是本 PR 新增的核心 CUDA kernel 分支(SWIZZLE_SCALE=true 模板实例 + swizzled 内存写入逻辑)。
建议补充一个 swizzle_scale=True 的测试用例,使用 UINT8 + FLOAT32 输入,验证:
- 输出
permute_scale的 shape 和数值是否符合 flashinfer cutedsl swizzled 布局预期; - 与
swizzle_scale=False+ 手动 swizzle 的结果一致。
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前有 2 个 Required 任务失败,需优先处理后方可合并。
2 任务状态汇总2.1 Required任务 : 8/10 通过
2.2 可选任务 — 28/31 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
覆盖率明细:
根因详情: 关键日志: 修复建议:
修复建议摘要: 为nvfp4.py等新增文件添加单元测试或申请豁免 关联变更: 本次 PR 引入 FP4/FP8 量化支持,修改涉及 Approval — 审批缺失(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请@jiangjiajun/@liuyuanle等RD成员进行Approve 链接: 查看日志 |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前有 2 个 Required 任务失败,阻塞合并,需优先处理。
2 任务状态汇总2.1 Required任务 : 6/8 通过
2.2 可选任务 — 18/20 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不足(置信度: 高)run_tests_with_coverage
根因详情: 关键日志: 修复建议:
修复建议摘要: 为新增 FP4/fp8/nvfp4 代码补充单元测试或申请豁免 关联变更: PR 新增了 FP4 通信量化、Block-wise CUDA Graph、block_wise_fp8、moe nvfp4 相关代码 Approval — 审批未通过(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请 jiangjiajun/liuyuanle/rainyfly/Wanglongzhi2001 之一审批此 PR 关联变更: |
Motivation
1、修复在eb5跑fp4时,audio_token_num为None,导致会判断 NoneType >0的bug,以及加载eb5旗舰版的问题
支持fp4 通信量化,以hidden_size = 7168为例子
2、当前 FastDeploy 的 CUDA Graph 捕获是整体模型级别的,粒度较粗,存在一些灵活性限制。本 PR 引入 Block-wise CUDA Graph 机制,支持在单个算子/层级别(如 Linear、RMSNorm)独立捕获和回放 CUDA Graph,从而实现更细粒度的图优化,提升 prefill 阶段的推理性能。
3、支持block_wise_fp8 dense在线量化+nvfp4离线量化配置
--quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}' \4、支持fp4 deepep通信
开启fp4通信量化 export FD_USE_NVFP4_COMM_QUANT=1支持了prefill阶段进cuda_graph,kernel间空隙有所减少,如下图所示。


上图为之前的空隙
优化后基本无空隙
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.