[Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777) by ShaneGZhu · Pull Request #7832 · PaddlePaddle/FastDeploy

ShaneGZhu · 2026-05-15T08:42:59Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

启动服务时加上参数--enable-moe-scores-elementwise-fuse

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…dle#7777) [Cherry-Pick]

paddle-bot · 2026-05-15T08:51:49Z

Thanks for your contribution!

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-15 16:55:19

📋 Review 摘要

PR 概述：新增 grouped_topk 融合 CUDA kernel，将 cast+sigmoid+bias+noaux_tc 四步合并为单次 kernel launch，并通过 enable_moe_scores_elementwise_fuse 开关控制启用。
变更范围：custom_ops/gpu_ops/、fastdeploy/model_executor/layers/moe/、fastdeploy/scheduler/config.py、fastdeploy/engine/
影响面 Tag：[OP] [Optimization] [Engine] [Scheduler]

📝 PR 规范检查

存在两处规范问题：① 标题 Tag 大小写偏差；② PR 描述所有 section 均为空。

标题建议（可直接复制）：

[Cherry-Pick][OP][Optimization] Kernel fusion: cast+sigmoid+bias+noaux_tc(#7777)

PR 描述建议（可直接复制）：

## Motivation

通过将 MoE 路由中的 cast + sigmoid + bias + noaux_tc 四个操作融合到单一 CUDA kernel（`grouped_topk`），减少显存带宽占用和 kernel launch 开销，提升 MoE 路由性能（适用于 DeepSeek-V3/R1、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 等模型）。

## Modifications

- `custom_ops/gpu_ops/grouped_topk_kernels.cu`：新增融合 kernel，一次 pass 完成 cast+sigmoid+bias+grouped_topk 计算，支持 float32/float16/bfloat16 输入
- `custom_ops/gpu_ops/cpp_extensions.cc`：声明并通过 `PD_BUILD_STATIC_OP` 注册 `grouped_topk` custom op
- `custom_ops/setup_ops.py`：将 `grouped_topk_kernels.cu` 加入两处编译源文件列表
- `fastdeploy/model_executor/layers/moe/moe.py`：`get_moe_scores` 新增 `use_fused_cast` 参数，`use_fused_cast=True` 时走 `grouped_topk` 融合路径
- `fastdeploy/scheduler/config.py` + `engine/args_utils.py` + `worker/worker_process.py`：新增 `enable_moe_scores_elementwise_fuse` 开关（默认 False）
- `fused_moe_{blackwell,cutlass,deepgemm,triton}_backend.py`：同步新增 `use_fused_cast` 传参（仅 CUDA + flag 开启时生效）
- `fastdeploy/model_executor/layers/moe/ep.py`：EPLB 路径显式禁用 fusion（TODO），非 EPLB 路径启用
- `tests/operators/test_grouped_topk_op.py`：新增 grouped_topk 算子单测，覆盖多种模型配置和数据类型

## Usage or Command

```bash
# 启用 MoE 路由 elementwise 融合（默认关闭）
fastdeploy serve ... --enable-moe-scores-elementwise-fuse
```

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`fastdeploy/model_executor/layers/moe/ep.py:541`	`use_fused_cast=True` 硬编码，绕过 `enable_moe_scores_elementwise_fuse` 配置开关，与其他所有 backend 行为不一致

总体评价

新增融合 kernel 思路清晰，Custom Op 注册、编译集成、单测均已就绪。主要问题是 ep.py 非-EPLB 路径硬编码 use_fused_cast=True，绕过了统一的配置开关，建议与其他 backend 保持一致。

PaddlePaddle-bot · 2026-05-15T08:57:26Z

                    layer.gate_correction_bias,
                    getattr(layer, "renormalize", True),
                    topk_reduce_func=getattr(layer, "topk_reduce_func", None),
+                    use_fused_cast=True,  # NOTE(zhushengguang): Kernel Fusion can be used in non-EPLB scenarios.


🟡 建议 use_fused_cast=True 被硬编码，未检查 enable_moe_scores_elementwise_fuse 配置开关。

其他所有 backend（blackwell/cutlass/deepgemm/triton）均已正确检查该 flag：

use_fused = ( layer.fd_config.scheduler_config.enable_moe_scores_elementwise_fuse and current_platform.is_cuda() )

而此处直接写死 True，导致：当用户未开启 --enable-moe-scores-elementwise-fuse（默认 False）时，EP 非-EPLB 路径仍然走融合 kernel，与其他 backend 行为不一致，也违背了该 flag 的语义。

建议改为与其他 backend 保持一致：

use_fused = ( layer.fd_config.scheduler_config.enable_moe_scores_elementwise_fuse and current_platform.is_cuda() ) # ... use_fused_cast=use_fused,

PaddlePaddle-bot · 2026-05-15T09:43:06Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-16 02:53:08

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: d43e7d0
Merge base: 88a7479 (branch: release/online/20260415)
查看完整 Diff
CI 详情

1 任务总览

存在 4 个 Required 任务失败，需优先处理后才可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
31(0)	31	23	6	0	0	0

2 任务状态汇总

2.1 Required任务 : 2/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run Base Tests / base_tests`	14m2s	PR问题：`mxfp4.py:38` paddle.compat.enable_torch_proxy 为 None，服务启动失败	为 mxfp4.py:38 的 paddle.compat 调用添加 None 检查	Job	-
❌	`Run Four Cards Tests / run_4_cards_tests`	25m32s	PR问题：`batch_invariant_ops.py:809` enable_torch_proxy 为 None，3 测试失败	batch_invariant_ops.py:809 添加 paddle.compat 调用保护	Job	-
❌	`xpu_8cards_case_test / run_xpu_8cards_cases`	48m2s	PR问题：moe.py:45 noaux_tc 导入失败 + mxfp4.py TypeError，PD分离服务启动失败	moe.py 中 noaux_tc 导入加 CUDA 平台检测保护	Job	-
❌	`run_ce_cases`	8m4s	PR问题：同 base_tests，mxfp4.py:38 TypeError 致服务启动失败	修复 mxfp4.py:38 paddle.compat 兼容性（同上）	Job	-
🚫	`run_tests_with_coverage`	-	已取消（上游依赖失败）	-	-	-
🚫	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	已取消	-	-	-
✅	其余 2 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 21/23 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	12m28s	Job	-
❌	`Trigger Jenkins for PR`	1m4s	Job	-
✅	其余 21 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run Base Tests / base_tests — 测试失败（置信度: 高）

Run Base Tests / base_tests

状态: ❌ 失败
错误类型: 测试失败（服务启动崩溃）
置信度: 高
根因摘要: mxfp4.py:38 调用 paddle.compat.enable_torch_proxy 为 None，worker 启动失败
分析器: ci_analyze_unittest_fastdeploy

根因详情:
Worker 进程初始化链路：initialize_fd_config → parse_quant_config → get_quantization_config → from .mxfp4 import MXFP4Config，触发 mxfp4.py 模块级代码 L38：paddle.compat.enable_torch_proxy(scope={"flashinfer"})。该函数在当前 Paddle 版本中为 None，调用时抛出 TypeError: 'NoneType' object is not callable。此 PR cherry-pick #7777 引入的变更极可能修改了 mxfp4.py 或量化配置导入链路，导致所有依赖量化初始化的 GPU 测试全线失败（exit code 8）。

关键日志:

File "fastdeploy/model_executor/layers/quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
ERROR: Failed to launch worker processes
+ exit 8

修复建议:

fastdeploy/model_executor/layers/quantization/mxfp4.py L38：添加 None 检查：if callable(getattr(paddle.compat, 'enable_torch_proxy', None)): paddle.compat.enable_torch_proxy(scope={"flashinfer"})
确认此 PR 对 mxfp4.py 的修改是否依赖特定 Paddle 版本，若是，需同步更新 CI Paddle 版本

修复建议摘要: 为 mxfp4.py:38 的 paddle.compat 调用加 None 检查

关联变更: fastdeploy/model_executor/layers/quantization/mxfp4.py
链接: 查看日志

Run Four Cards Tests / run_4_cards_tests — 测试失败（置信度: 高）

Run Four Cards Tests / run_4_cards_tests

状态: ❌ 失败
错误类型: 测试失败
置信度: 高
根因摘要: batch_invariant_ops.py:809 enable_torch_proxy 为 None，3 个测试超时失败
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_determinism_long.py::test_deterministic_long_sequence`	TypeError + Pytest timeout 10min	batch_invariant_ops.py:809 paddle.compat.enable_torch_proxy 为 None
`test_ernie_21b_tp1_dp4.py`	服务启动失败	mxfp4.py:38 同 base_tests 根因
`test_ernie_21b_tp1_dp4_mtp.py`	服务启动失败	mxfp4.py:38 同 base_tests 根因

根因详情:
test_determinism_long 启动 worker 时，init_deterministic_mode → enable_batch_invariant_mode → paddle.compat.enable_torch_proxy() 触发 TypeError，pytest 等待10分钟后超时失败。ERNIE-21B 相关测试同样因 mxfp4.py 导入时 L38 TypeError 导致服务无法启动。

关键日志:

File "batch_invariant_ops/batch_invariant_ops.py", line 809, in enable_batch_invariant_mode
    paddle.compat.enable_torch_proxy()
TypeError: 'NoneType' object is not callable
test_determinism_long.py::test_deterministic_long_sequence
Pytest timeout (10 min)

修复建议:

fastdeploy/model_executor/layers/batch_invariant_ops/batch_invariant_ops.py L809：为 paddle.compat.enable_torch_proxy 添加 None 检查（与 mxfp4.py 修复方案一致）
修复 mxfp4.py:38（同 base_tests 建议），覆盖 ERNIE-21B 相关测试失败

修复建议摘要: batch_invariant_ops.py:809 添加 paddle.compat 调用 None 保护

关联变更: fastdeploy/model_executor/layers/batch_invariant_ops/batch_invariant_ops.py L809
链接: 查看日志

xpu_8cards_case_test / run_xpu_8cards_cases — 测试失败（置信度: 高）

xpu_8cards_case_test / run_xpu_8cards_cases

状态: ❌ 失败
错误类型: 测试失败（服务启动崩溃）
置信度: 高
根因摘要: moe.py:45 noaux_tc 导入失败 + mxfp4.py:38 TypeError，PD分离服务无法启动
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_pd_21b_ep4tp1.py::test_pd_separation`	PD分离服务启动失败	noaux_tc 无 XPU 编译产物 + mxfp4 TypeError
`test_pd_21b_ep4tp4.py::test_pd_separation`	PD分离服务启动失败	同上
`test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation`	PD分离服务启动失败	同上
`test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation`	PD分离服务启动失败	同上

根因详情:
XPU 环境存在两个问题：① moe.py:45 出现 import noaux_tc Failed!，此为本 PR cherry-pick #7777 引入的 noaux_tc 内核，在 XPU 环境无对应编译产物（仅支持 CUDA/GPU）；② Worker 进程在 mxfp4.py:38 触发相同的 paddle.compat.enable_torch_proxy TypeError，导致4个 PD 分离服务（EP4TP1/EP4TP4/CUDAGraph/混合并行）全部无法启动，历时 44 分钟后全部失败。

关键日志:

WARNING  moe.py[line:45] import noaux_tc Failed!
File "fastdeploy/model_executor/layers/quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
pytest.fail("PD分离服务启动失败")  [4次]
======================== 4 failed in 2653.31s (0:44:13) ========================

修复建议:

fastdeploy/model_executor/layers/activation/moe.py L45：对 noaux_tc 导入加 try/except，在非 CUDA 环境下跳过（需添加平台检测：if is_cuda_available(): from .noaux_tc import ...）
修复 mxfp4.py:38 paddle.compat 兼容性（同 base_tests 建议）

修复建议摘要: moe.py noaux_tc 导入需加 CUDA 平台检测保护

关联变更: fastdeploy/model_executor/layers/activation/moe.py L45，mxfp4.py L38
链接: 查看日志

Extracted partial CE model tasks to run in CI. / run_ce_cases — 测试失败（置信度: 高）

Extracted partial CE model tasks to run in CI. / run_ce_cases

状态: ❌ 失败
错误类型: 测试失败（服务启动崩溃）
置信度: 高
根因摘要: mxfp4.py:38 paddle.compat.enable_torch_proxy 为 None，CE 服务启动失败
分析器: ci_analyze_unittest_fastdeploy

根因详情:
CE 模型测试 test_EB_Lite_serving.py 启动服务时，量化配置解析触发 mxfp4.py 模块导入，L38 paddle.compat.enable_torch_proxy(scope={"flashinfer"}) 抛出 TypeError，worker 进程崩溃，API 服务无法就绪，测试以 exit_code=1 失败。根因与 base_tests 完全一致，修复 mxfp4.py:38 即可同时覆盖此测试。

关键日志:

File "fastdeploy/model_executor/layers/quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
[ERROR] api_server: Failed to initialize FastDeploy LLM engine, service exit now!
[ERROR] test_EB_Lite_serving.py 起服务或执行异常，exit_code=1

修复建议:

修复 fastdeploy/model_executor/layers/quantization/mxfp4.py L38（与 base_tests 建议完全一致）
此处为同一根因，修复一处即可覆盖所有 GPU 相关测试（base_tests + ce_cases）

修复建议摘要: 修复 mxfp4.py:38 paddle.compat 调用，添加 None 检查

关联变更: fastdeploy/model_executor/layers/quantization/mxfp4.py L38
链接: 查看日志

PaddlePaddle-bot · 2026-05-15T10:03:03Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-15 17:56:21

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: d43e7d0
Merge base: 88a7479 (branch: release/online/20260415)
查看完整 Diff
CI 详情

1 任务总览

⚠️ 4 个 Required 任务失败，需优先处理；2 个 Required 任务仍在运行中。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
30(0)	30	22	6	2	0	0

2 任务状态汇总

2.1 Required任务 : 2/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`xpu_8cards_case_test / run_xpu_8cards_cases`	48m2s	环境问题：mxfp4.py:38 paddle.compat.enable_torch_proxy 为 None	mxfp4.py:38 添加 callable 检查	Job	-
❌	`Run Base Tests / base_tests`	14m2s	环境问题：mxfp4.py:38 TypeError，Worker 进程崩溃退出码 8	mxfp4.py:38 添加 callable 检查或升级 Paddle	Job	-
❌	`Run Four Cards Tests / run_4_cards_tests`	25m32s	环境问题：3个e2e测试失败，推断与 mxfp4.py 相同	同 base_tests 修复 mxfp4 环境问题	Job	-
❌	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	8m4s	环境问题：test_EB_Lite_serving 服务无法启动，mxfp4.py:38 TypeError	mxfp4.py:38 添加 callable 检查	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
✅	其余 2 个必选任务通过（`run_tests_logprob`、`stable_tests`）	-	-	-	-	-

2.2 可选任务 — 20/22 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	12m28s	Job	-
❌	`Trigger Jenkins for PR` (CI_METAX)	1m4s	Job	-
✅	其余 20 个可选任务通过	-	-	-

3 失败详情（仅 required）

xpu_8cards_case_test / run_xpu_8cards_cases — 环境问题（置信度: 高）

状态: ❌ 失败
错误类型: 环境问题 — 模块导入崩溃 + PR引入XPU op警告
置信度: 高
根因摘要: XPU环境 mxfp4.py:38 paddle.compat.enable_torch_proxy 为 None 致 Worker 崩溃；PR新增 grouped_topk CUDA op 在 XPU 未编译（有 try-except 保护，非直接崩溃原因）
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_pd_21b_ep4tp1.py::test_pd_separation`	Failed: PD分离服务启动失败	mxfp4.py:38 TypeError
`test_pd_21b_ep4tp4.py::test_pd_separation`	Failed: PD分离服务启动失败	mxfp4.py:38 TypeError
`test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation`	Failed: PD分离服务启动失败	mxfp4.py:38 TypeError
`test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation`	Failed: PD分离服务启动失败	mxfp4.py:38 TypeError

根因详情:
Worker 进程在 initialize_fd_config → parse_quant_config → get_quantization_config 调用链中触发 mxfp4.py 的懒加载。mxfp4.py:38 在模块级调用 paddle.compat.enable_torch_proxy(scope={"flashinfer"})，该函数在 XPU 版 Paddle 中为 None（不可调用），导致 TypeError。所有 4 个 PD 分离测试因此无法启动 Worker，均以「PD分离服务启动失败」结束。mxfp4.py 不在本 PR 的改动范围内，且 PR 新增的 grouped_topk CUDA op 在 XPU 环境触发了 WARNING: import noaux_tc Failed!（已有 try-except 保护，不是崩溃直接原因）。

关键日志:

WARNING  moe.py: import noaux_tc Failed!
File "/workspace/FastDeploy/fastdeploy/model_executor/layers/quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation
4 failed in 2653.31s (0:44:13)

修复建议:

fastdeploy/model_executor/layers/quantization/mxfp4.py:38 添加平台兼容检查：if callable(getattr(paddle.compat, 'enable_torch_proxy', None)):
确认 grouped_topk 是否需要在 XPU 编译（若 CUDA-only，现有 try-except 保护已足够，无需额外处理）
建议验证 merge base (88a7479) 是否也有此失败，以判断是否为预存问题

修复建议摘要: mxfp4.py:38 添加 callable 检查；XPU grouped_topk 警告可忽略

关联变更: PR 在 fastdeploy/model_executor/layers/moe/moe.py:36 新增 grouped_topk import，XPU 环境触发警告（已保护）
链接: 查看日志

Run Base Tests / base_tests — 环境问题（置信度: 高）

状态: ❌ 失败
错误类型: 环境问题 — 模块导入崩溃
置信度: 高
根因摘要: mxfp4.py:38 paddle.compat.enable_torch_proxy 为 None 致 Worker 崩溃，退出码 8
分析器: ci_analyze_unittest_fastdeploy

根因详情:
测试使用已安装的 FastDeploy 包（/usr/local/lib/python3.10/dist-packages/fastdeploy/...），启动 ernie-4_5-21b-a3b-bf16-paddle 模型（wint4 量化）时，Worker 进程的 initialize_fd_config → parse_quant_config → get_quantization_config 调用链触发 mxfp4.py 模块导入，第 38 行的 paddle.compat.enable_torch_proxy(scope={"flashinfer"}) 在当前 CI Paddle 版本中为 None，抛出 TypeError。mxfp4.py 不在本 PR 改动文件中，判断为已安装包与 Paddle 版本间的兼容性环境问题。

关键日志:

File ".../fastdeploy/model_executor/layers/quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
ERROR: Failed to launch worker processes
+ exit 8

修复建议:

fastdeploy/model_executor/layers/quantization/mxfp4.py:38：添加 if callable(getattr(paddle.compat, 'enable_torch_proxy', None)): 防护
检查 CI runner 的 Paddle 版本，确认 paddle.compat.enable_torch_proxy 是否被正确导出
验证 merge base 是否也存在此失败

修复建议摘要: mxfp4.py:38 添加 callable 检查或升级 CI Paddle 版本

关联变更: 本 PR 未改动 mxfp4.py，失败与本 PR 代码变更无直接关联
链接: 查看日志

Run Four Cards Tests / run_4_cards_tests — 环境问题（置信度: 中）

状态: ❌ 失败
错误类型: 环境问题（推断）— 3个 e2e 测试文件失败
置信度: 中（无详细 traceback）
根因摘要: 3个 ERNIE 21B TP1-DP4 e2e 测试失败，推断为 mxfp4.py:38 环境兼容性问题
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_determinism_long.py`	exit code 1	推断：mxfp4.py:38 TypeError（无详细日志）
`test_ernie_21b_tp1_dp4.py`	exit code 1	推断：mxfp4.py:38 TypeError（无详细日志）
`test_ernie_21b_tp1_dp4_mtp.py`	exit code 1	推断：mxfp4.py:38 TypeError（无详细日志）

根因详情:
step_log 中仅有 1 个测试通过（test_vocab_parallel_embedding_deterministic，不涉及 wint4 量化）。另 3 个测试均为 ERNIE 21B TP1-DP4 的 e2e 服务测试，在相同环境中启动量化模型服务时极可能触发与 base_tests 相同的 mxfp4.py:38 TypeError。详细日志未在 step_log 中展示，置信度为中。

关键日志:

3 test file(s) failed:
/workspace/FastDeploy/tests/e2e/4cards_cases/test_determinism_long.py
/workspace/FastDeploy/tests/e2e/4cards_cases/test_ernie_21b_tp1_dp4.py
/workspace/FastDeploy/tests/e2e/4cards_cases/test_ernie_21b_tp1_dp4_mtp.py
##[error]Process completed with exit code 1.

修复建议:

修复 mxfp4.py:38 callable 检查（同 base_tests）
查看 4cards 测试详细日志（GitHub Artifacts）以确认根因

修复建议摘要: 修复 mxfp4.py:38；可下载 Artifacts 查看详细日志

链接: 查看日志

Extracted partial CE model tasks to run in CI. / run_ce_cases — 环境问题（置信度: 高）

状态: ❌ 失败
错误类型: 环境问题 — Worker 启动崩溃
置信度: 高
根因摘要: test_EB_Lite_serving.py 服务无法启动，mxfp4.py:38 TypeError 致 Worker 崩溃
分析器: ci_analyze_unittest_fastdeploy

根因详情:
CE 模型服务测试 test_EB_Lite_serving.py 启动 FastDeploy LLM 服务时，Worker 进程因 mxfp4.py:38 TypeError 无法完成初始化，服务报 ERROR: Failed to initialize FastDeploy LLM engine，测试框架收到非零退出码后报错。根因与 base_tests 完全一致，mxfp4.py 不在本 PR 改动范围内。

关键日志:

File ".../mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
[ERROR] api_server.py: Failed to initialize FastDeploy LLM engine, service exit now!
[ERROR] test_EB_Lite_serving.py 起服务或执行异常，exit_code=1

修复建议:

mxfp4.py:38 添加 callable(getattr(paddle.compat, 'enable_torch_proxy', None)) 检查
验证 merge base 是否同样失败

修复建议摘要: mxfp4.py:38 添加 callable 检查，修复环境兼容性

关联变更: 本 PR 未改动 mxfp4.py
链接: 查看日志

ShaneGZhu added 2 commits May 15, 2026 10:09

[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc (PaddlePad…

1ff1715

…dle#7777) [Cherry-Pick]

Kernel fusion for blackwell and deepgemm backend in non-EPLB scenarios

d43e7d0

ShaneGZhu had a problem deploying to Metax_ci May 15, 2026 08:43 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)#7832

[Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)#7832
ShaneGZhu wants to merge 2 commits into
PaddlePaddle:release/online/20260415from
ShaneGZhu:cp-0415

ShaneGZhu commented May 15, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 15, 2026

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 15, 2026

Uh oh!

PaddlePaddle-bot commented May 15, 2026 •

edited

Loading

Run Base Tests / base_tests

Run Four Cards Tests / run_4_cards_tests

xpu_8cards_case_test / run_xpu_8cards_cases

Extracted partial CE model tasks to run in CI. / run_ce_cases

Uh oh!

PaddlePaddle-bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShaneGZhu commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 15, 2026

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 2/8 通过

2.2 可选任务 — 21/23 通过

3 失败详情（仅 required）

Run Base Tests / base_tests

Run Four Cards Tests / run_4_cards_tests

xpu_8cards_case_test / run_xpu_8cards_cases

Extracted partial CE model tasks to run in CI. / run_ce_cases

Uh oh!

PaddlePaddle-bot commented May 15, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 2/8 通过

2.2 可选任务 — 20/22 通过

3 失败详情（仅 required）

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShaneGZhu commented May 15, 2026 •

edited

Loading

PaddlePaddle-bot commented May 15, 2026 •

edited

Loading