[unitest] small change in test_deepgemm_precision.py by zhoutianzi666 · Pull Request #7834 · PaddlePaddle/FastDeploy

zhoutianzi666 · 2026-05-15T10:22:49Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

codecov-commenter · 2026-05-15T11:06:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@79dd64a). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7834   +/-   ##
==========================================
  Coverage           ?   63.27%           
==========================================
  Files              ?      462           
  Lines              ?    64276           
  Branches           ?     9851           
==========================================
  Hits               ?    40670           
  Misses             ?    20841           
  Partials           ?     2765

Flag	Coverage Δ
GPU	`72.38% <ø> (?)`
XPU	`7.12% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-05-15T11:14:25Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-16 17:36:22

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 4a7d2c2
Merge base: 79dd64a (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

所有 Required 任务均已通过 ✅，3 个 Optional 任务失败（不阻塞合并）。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
39(0)	39	36	3	0	0	0

2 任务状态汇总

2.1 Required任务 : 8/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
✅	全部 8 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 28/31 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	10m35s	Job	-
❌	`Check PR Template`	12s	Job	-
❌	`CI_HPU`	1h13m	Job	-
✅	其余 28 个可选任务通过	-	-	-

3 失败详情（仅 required）

无

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-15 22:20:43

📋 Review 摘要

PR 概述：改进 test_deepgemm_precision.py 和 test_flashmla_precision.py 中的性能计时方式，使用 CUDA Event 替代简单循环，添加预热逻辑并打印 TFLOPS/带宽指标。
变更范围：tests/operators/
影响面 Tag：[CI]

问题

级别	文件	概述
📝 PR 规范	—	标题 `[unitest]` 非官方 Tag（拼写错误），应改为 `[CI]`
📝 PR 规范	—	PR 描述所有 section 内容为空，仅含模板占位符
❓ 疑问	`tests/operators/test_deepgemm_precision.py:454`	预热循环 100 次 vs flashmla 的 10 次，存在 OOM 风险
❓ 疑问	`tests/operators/test_deepgemm_precision.py:471`	`[-1:]` 只取最后一次测量，通常应取最小值或平均值
❓ 疑问	`tests/operators/test_flashmla_precision.py:37`	`kv_len` 从 128K 大幅缩减至 8K，带宽结果可能不具代表性

📝 PR 规范检查

标题 [unitest] 存在拼写错误（应为 unittest）且不在官方 Tag 列表中；本次变更为测试文件修改，应使用 [CI]。PR 描述所有 section（Motivation / Modifications / Usage or Command / Accuracy Tests）内容均为模板占位符，未填写实际内容。

标题建议（可直接复制）：

[CI] Improve timing accuracy in test_deepgemm_precision and test_flashmla_precision

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
改进 `test_deepgemm_precision.py` 和 `test_flashmla_precision.py` 中的性能计时方式，使用 CUDA Event 替代原先的简单循环，提高计时精度，并添加预热逻辑以减少 GPU 首次执行的影响。

## Modifications
- `tests/operators/test_deepgemm_precision.py`：使用 `paddle.device.cuda.Event` 计时，添加 100 次预热循环，计算并打印 TFLOPS；取消注释 `test_main` 中的三处 `one_invoke` 调用；清理 paddle profiler 相关注释代码
- `tests/operators/test_flashmla_precision.py`：使用 `paddle.device.cuda.Event` 计时，添加 10 次预热循环，计算并打印带宽（TB/s）；将 `kv_len` 从 `1024 * 128` 调整为 `1024 * 8`；清理 paddle profiler 相关注释代码

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体改进方向合理，CUDA Event 计时比简单循环计时更精确。建议修正标题 Tag、补全 PR 描述，并确认预热次数差异和 kv_len 缩减的合理性。

Sign in to view

+        for i in range(test_loops):
+            # 这行代码放在这里是为了让event的计时更准确！
+            # 太棒啦！
+            for j in range(100):


Sign in to view

-        print(baseline_out - deepgemm_output)
+            end_events[i].record()
+
+        total_time = np.array([round(s.elapsed_time(e), 10) for s, e in zip(start_events, end_events)])[-1:]


Sign in to view

        bsz = 128
-        kv_len = 1024 * 128
+        kv_len = 1024 * 8
        page_size = 64


PaddlePaddle-bot · 2026-05-15T18:48:09Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-16 02:47:05

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 1f7549b
Merge base: 79dd64a (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

❌ 存在 1 个 Required 失败任务需要处理，其余 Required 任务已通过。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	38	3	0	1	0

2 任务状态汇总

2.1 Required任务 : 9/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Pre Commit`	44s	PR问题：flake8 F841，test_flashmla_precision.py:60 未使用变量 `a`	删除或重命名 test_flashmla_precision.py:60 未使用变量	Job	-
✅	其余 9 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	34m40s	Job	-
❌	`Check PR Template`	18s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 29 个可选任务通过	-	-	-

3 失败详情（仅 required）

Pre Commit — 代码规范（置信度: 高）

Pre Commit

状态: ❌ 失败
错误类型: 代码规范
置信度: 高
根因摘要: flake8 F841：test_flashmla_precision.py:60 存在未使用变量 a
分析器: 通用分析(fallback)

根因详情:
Pre-commit 在 tests/operators/test_flashmla_precision.py 第60行检测到 flake8 F841 错误：局部变量 a 被赋值但从未被使用。该文件是本次 PR 的修改文件之一。其余代码风格检查（black、isort、ruff 等）均通过。

关键日志:

flake8...................................................................Failed
- hook id: flake8
- exit code: 1

tests/operators/test_flashmla_precision.py:60:21: F841 local variable 'a' is assigned to but never used

修复建议:

删除 tests/operators/test_flashmla_precision.py 第60行的未使用变量赋值语句，或将变量名改为 _ 以明确表示忽略
本地验证：pre-commit run --files tests/operators/test_flashmla_precision.py tests/operators/test_deepgemm_precision.py

修复建议摘要: 删除或重命名 test_flashmla_precision.py:60 未使用变量 a

关联变更: PR 修改文件：tests/operators/test_flashmla_precision.py、tests/operators/test_deepgemm_precision.py

链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-16 12:09:26

📋 Review 摘要

PR 概述：改进 deepgemm 和 flashmla 算子单元测试的计时方式，使用 CUDA Event 精确计时替代原有固定循环，并新增 TFLOPS / 带宽指标输出

变更范围：tests/operators/

影响面 Tag：[CI]

问题

级别	文件	概述
🟡 建议	`tests/operators/test_deepgemm_precision.py:476`	精度输出被注释，one_invoke 方法不再有任何精度输出，与文件名 precision 不符
🟡 建议	`tests/operators/test_deepgemm_precision.py:471`	只取最后一次计时 `[-1:]` 而非均值，计时稳定性无保障
❓ 疑问	`tests/operators/test_flashmla_precision.py:36`	kv_len 从 1024×128 缩减至 1024×8，带宽测量代表性存疑

📝 PR 规范检查

PR 标题使用了非官方 Tag [unitest]（官方 Tag 列表中无此项）；PR 描述各 section 均为空，仅保留模板注释占位符，不符合模板要求。

标题建议（可直接复制）：

[CI] Improve CUDA Event timing precision in deepgemm and flashmla unit tests

PR 描述建议（可直接复制）：

## Motivation

改进 `test_deepgemm_precision.py` 和 `test_flashmla_precision.py` 中的算子计时方式，使用 CUDA Event 精确计时替代原有固定循环，输出 TFLOPS / 带宽指标，同时删除废弃的 profiler 注释代码。

## Modifications

- `tests/operators/test_deepgemm_precision.py`：
  - 新增 `import numpy as np`
  - 使用 `paddle.device.cuda.Event` 对 `fp8_gemm_nt` 进行精确计时
  - 将预热方式改为内层 100 次循环，确保 GPU 显存充分占用
  - 新增 TFLOPS 计算与输出
  - `test_main` 新增多组 M/N/K 配置调用（4096×4096×4096、4096×2048×7168、4096×65536×1536）
  - 删除废弃的 profiler 注释代码

- `tests/operators/test_flashmla_precision.py`：
  - 新增 `import numpy as np`
  - 将 `kv_len` 从 `1024 * 128` 缩减为 `1024 * 8`
  - 使用 `paddle.device.cuda.Event` 对 `mla_blackwell` 进行精确计时
  - 新增内存带宽计算与输出（TB/s）
  - 删除废弃的 profiler 注释代码

## Usage or Command

N/A

## Accuracy Tests

N/A

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

改动思路合理，CUDA Event 计时方式更精确。建议恢复 test_deepgemm_precision.py 中的精度输出，并补全 PR 描述各 section 内容。

PaddlePaddle-bot · 2026-05-16T04:16:31Z

+        print(total_time)
+        print(flops)
+
+        # print(baseline_out - deepgemm_output)


🟡 建议 精度输出被完全注释

文件名为 test_deepgemm_precision.py，但本次改动将 print(baseline_out - deepgemm_output) 注释掉（原 assert 也是注释状态），one_invoke 方法已完全无精度输出，与文件名 precision 不符。

建议至少保留 print 以维持精度可观测性：

print(baseline_out - deepgemm_output) # 保留精度可观测性 # assert (baseline_out - deepgemm_output).abs().max().item() < 0.1

PaddlePaddle-bot · 2026-05-16T04:16:31Z

-        print(baseline_out - deepgemm_output)
+            end_events[i].record()
+
+        total_time = np.array([round(s.elapsed_time(e), 10) for s, e in zip(start_events, end_events)])[-1:]


🟡 建议 仅取最后一次计时值，缺乏统计稳定性

[-1:] 只取最后一次迭代的时间，若最后一次恰好存在调度抖动，结果不可靠。建议取中位数或均值（排除首次热身）：

total_time = float(np.median(np.array([round(s.elapsed_time(e), 10) for s, e in zip(start_events, end_events)])[1:]))

PaddlePaddle-bot · 2026-05-16T04:16:32Z


        bsz = 128
-        kv_len = 1024 * 128
+        kv_len = 1024 * 8


❓ 疑问 kv_len 大幅缩减可能影响带宽测量代表性

kv_len 从 1024 * 128（131072）缩减至 1024 * 8（8192），缩小了 16 倍。短序列下 KV 数据量小，L2 缓存命中率更高，测得带宽可能偏高，难以代表长序列服务场景的真实带宽瓶颈。如果仅为缩短 CI 运行时间，建议在注释中说明选取此值的理由。

commit

88d944f

zhoutianzi666 temporarily deployed to Metax_ci May 15, 2026 10:22 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

zhoutianzi666 changed the title ~~commit~~ [unitest] small change in test_deepgemm_precision.py May 15, 2026

commit

16f8f98

zhoutianzi666 temporarily deployed to Metax_ci May 15, 2026 13:51 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

commit

1f7549b

zhoutianzi666 temporarily deployed to Metax_ci May 15, 2026 14:12 — with GitHub Actions Inactive

PaddlePaddle-bot reviewed May 15, 2026

View reviewed changes

commit

4a7d2c2

zhoutianzi666 deployed to Metax_ci May 16, 2026 04:03 — with GitHub Actions Active

PaddlePaddle-bot reviewed May 16, 2026

View reviewed changes

Conversation

zhoutianzi666 commented May 15, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 8/8 通过

2.2 可选任务 — 28/31 通过

3 失败详情（仅 required）

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 15, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 9/10 通过

2.2 可选任务 — 29/32 通过

3 失败详情（仅 required）

Pre Commit

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented May 15, 2026 •

edited

Loading

PaddlePaddle-bot commented May 15, 2026 •

edited

Loading