Skip to content

[unitest] small change in test_deepgemm_precision.py#7834

Open
zhoutianzi666 wants to merge 4 commits into
PaddlePaddle:developfrom
zhoutianzi666:make_time_more_precisi
Open

[unitest] small change in test_deepgemm_precision.py#7834
zhoutianzi666 wants to merge 4 commits into
PaddlePaddle:developfrom
zhoutianzi666:make_time_more_precisi

Conversation

@zhoutianzi666
Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@79dd64a). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7834   +/-   ##
==========================================
  Coverage           ?   63.27%           
==========================================
  Files              ?      462           
  Lines              ?    64276           
  Branches           ?     9851           
==========================================
  Hits               ?    40670           
  Misses             ?    20841           
  Partials           ?     2765           
Flag Coverage Δ
GPU 72.38% <ø> (?)
XPU 7.12% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 15, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-16 17:36:22

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 Required 任务均已通过 ✅,3 个 Optional 任务失败(不阻塞合并)。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
39(0) 39 36 3 0 0 0

2 任务状态汇总

2.1 Required任务 : 8/8 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
全部 8 个必选任务通过 - - - - -

2.2 可选任务 — 28/31 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 10m35s Job -
Check PR Template 12s Job -
CI_HPU 1h13m Job -
其余 28 个可选任务通过 - - -

3 失败详情(仅 required)

@zhoutianzi666 zhoutianzi666 changed the title commit [unitest] small change in test_deepgemm_precision.py May 15, 2026
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-15 22:20:43

📋 Review 摘要

PR 概述:改进 test_deepgemm_precision.pytest_flashmla_precision.py 中的性能计时方式,使用 CUDA Event 替代简单循环,添加预热逻辑并打印 TFLOPS/带宽指标。
变更范围tests/operators/
影响面 Tag[CI]

问题

级别 文件 概述
📝 PR 规范 标题 [unitest] 非官方 Tag(拼写错误),应改为 [CI]
📝 PR 规范 PR 描述所有 section 内容为空,仅含模板占位符
❓ 疑问 tests/operators/test_deepgemm_precision.py:454 预热循环 100 次 vs flashmla 的 10 次,存在 OOM 风险
❓ 疑问 tests/operators/test_deepgemm_precision.py:471 [-1:] 只取最后一次测量,通常应取最小值或平均值
❓ 疑问 tests/operators/test_flashmla_precision.py:37 kv_len 从 128K 大幅缩减至 8K,带宽结果可能不具代表性

📝 PR 规范检查

标题 [unitest] 存在拼写错误(应为 unittest)且不在官方 Tag 列表中;本次变更为测试文件修改,应使用 [CI]。PR 描述所有 section(Motivation / Modifications / Usage or Command / Accuracy Tests)内容均为模板占位符,未填写实际内容。

标题建议(可直接复制):

  • [CI] Improve timing accuracy in test_deepgemm_precision and test_flashmla_precision

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
改进 `test_deepgemm_precision.py``test_flashmla_precision.py` 中的性能计时方式,使用 CUDA Event 替代原先的简单循环,提高计时精度,并添加预热逻辑以减少 GPU 首次执行的影响。

## Modifications
- `tests/operators/test_deepgemm_precision.py`:使用 `paddle.device.cuda.Event` 计时,添加 100 次预热循环,计算并打印 TFLOPS;取消注释 `test_main` 中的三处 `one_invoke` 调用;清理 paddle profiler 相关注释代码
- `tests/operators/test_flashmla_precision.py`:使用 `paddle.device.cuda.Event` 计时,添加 10 次预热循环,计算并打印带宽(TB/s);将 `kv_len``1024 * 128` 调整为 `1024 * 8`;清理 paddle profiler 相关注释代码

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体改进方向合理,CUDA Event 计时比简单循环计时更精确。建议修正标题 Tag、补全 PR 描述,并确认预热次数差异和 kv_len 缩减的合理性。

for i in range(test_loops):
# 这行代码放在这里是为了让event的计时更准确!
# 太棒啦!
for j in range(100):

This comment was marked as outdated.

print(baseline_out - deepgemm_output)
end_events[i].record()

total_time = np.array([round(s.elapsed_time(e), 10) for s, e in zip(start_events, end_events)])[-1:]

This comment was marked as outdated.

bsz = 128
kv_len = 1024 * 128
kv_len = 1024 * 8
page_size = 64

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-16 02:47:05

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

❌ 存在 1 个 Required 失败任务需要处理,其余 Required 任务已通过。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 38 3 0 1 0

2 任务状态汇总

2.1 Required任务 : 9/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Pre Commit 44s PR问题:flake8 F841,test_flashmla_precision.py:60 未使用变量 a 删除或重命名 test_flashmla_precision.py:60 未使用变量 Job -
其余 9 个必选任务通过 - - - - -

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 34m40s Job -
Check PR Template 18s Job -
⏸️ CI_HPU - - -
其余 29 个可选任务通过 - - -

3 失败详情(仅 required)

Pre Commit — 代码规范(置信度: 高)

Pre Commit

  • 状态: ❌ 失败
  • 错误类型: 代码规范
  • 置信度: 高
  • 根因摘要: flake8 F841:test_flashmla_precision.py:60 存在未使用变量 a
  • 分析器: 通用分析(fallback)

根因详情:
Pre-commit 在 tests/operators/test_flashmla_precision.py 第60行检测到 flake8 F841 错误:局部变量 a 被赋值但从未被使用。该文件是本次 PR 的修改文件之一。其余代码风格检查(black、isort、ruff 等)均通过。

关键日志:

flake8...................................................................Failed
- hook id: flake8
- exit code: 1

tests/operators/test_flashmla_precision.py:60:21: F841 local variable 'a' is assigned to but never used

修复建议:

  1. 删除 tests/operators/test_flashmla_precision.py 第60行的未使用变量赋值语句,或将变量名改为 _ 以明确表示忽略
  2. 本地验证:pre-commit run --files tests/operators/test_flashmla_precision.py tests/operators/test_deepgemm_precision.py

修复建议摘要: 删除或重命名 test_flashmla_precision.py:60 未使用变量 a

关联变更: PR 修改文件:tests/operators/test_flashmla_precision.pytests/operators/test_deepgemm_precision.py

链接: 查看日志

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-16 12:09:26

📋 Review 摘要

PR 概述:改进 deepgemm 和 flashmla 算子单元测试的计时方式,使用 CUDA Event 精确计时替代原有固定循环,并新增 TFLOPS / 带宽指标输出

变更范围tests/operators/

影响面 Tag[CI]

问题

级别 文件 概述
🟡 建议 tests/operators/test_deepgemm_precision.py:476 精度输出被注释,one_invoke 方法不再有任何精度输出,与文件名 precision 不符
🟡 建议 tests/operators/test_deepgemm_precision.py:471 只取最后一次计时 [-1:] 而非均值,计时稳定性无保障
❓ 疑问 tests/operators/test_flashmla_precision.py:36 kv_len 从 1024×128 缩减至 1024×8,带宽测量代表性存疑

📝 PR 规范检查

PR 标题使用了非官方 Tag [unitest](官方 Tag 列表中无此项);PR 描述各 section 均为空,仅保留模板注释占位符,不符合模板要求。

标题建议(可直接复制):

  • [CI] Improve CUDA Event timing precision in deepgemm and flashmla unit tests

PR 描述建议(可直接复制):

## Motivation

改进 `test_deepgemm_precision.py``test_flashmla_precision.py` 中的算子计时方式,使用 CUDA Event 精确计时替代原有固定循环,输出 TFLOPS / 带宽指标,同时删除废弃的 profiler 注释代码。

## Modifications

- `tests/operators/test_deepgemm_precision.py`- 新增 `import numpy as np`
  - 使用 `paddle.device.cuda.Event``fp8_gemm_nt` 进行精确计时
  - 将预热方式改为内层 100 次循环,确保 GPU 显存充分占用
  - 新增 TFLOPS 计算与输出
  - `test_main` 新增多组 M/N/K 配置调用(4096×4096×4096、4096×2048×7168、4096×65536×1536)
  - 删除废弃的 profiler 注释代码

- `tests/operators/test_flashmla_precision.py`- 新增 `import numpy as np`
  -`kv_len``1024 * 128` 缩减为 `1024 * 8`
  - 使用 `paddle.device.cuda.Event``mla_blackwell` 进行精确计时
  - 新增内存带宽计算与输出(TB/s)
  - 删除废弃的 profiler 注释代码

## Usage or Command

N/A

## Accuracy Tests

N/A

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

改动思路合理,CUDA Event 计时方式更精确。建议恢复 test_deepgemm_precision.py 中的精度输出,并补全 PR 描述各 section 内容。

print(total_time)
print(flops)

# print(baseline_out - deepgemm_output)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 精度输出被完全注释

文件名为 test_deepgemm_precision.py,但本次改动将 print(baseline_out - deepgemm_output) 注释掉(原 assert 也是注释状态),one_invoke 方法已完全无精度输出,与文件名 precision 不符。

建议至少保留 print 以维持精度可观测性:

print(baseline_out - deepgemm_output)  # 保留精度可观测性
# assert (baseline_out - deepgemm_output).abs().max().item() < 0.1

print(baseline_out - deepgemm_output)
end_events[i].record()

total_time = np.array([round(s.elapsed_time(e), 10) for s, e in zip(start_events, end_events)])[-1:]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 仅取最后一次计时值,缺乏统计稳定性

[-1:] 只取最后一次迭代的时间,若最后一次恰好存在调度抖动,结果不可靠。建议取中位数或均值(排除首次热身):

total_time = float(np.median(np.array([round(s.elapsed_time(e), 10) for s, e in zip(start_events, end_events)])[1:]))


bsz = 128
kv_len = 1024 * 128
kv_len = 1024 * 8
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 kv_len 大幅缩减可能影响带宽测量代表性

kv_len1024 * 128(131072)缩减至 1024 * 8(8192),缩小了 16 倍。短序列下 KV 数据量小,L2 缓存命中率更高,测得带宽可能偏高,难以代表长序列服务场景的真实带宽瓶颈。如果仅为缩短 CI 运行时间,建议在注释中说明选取此值的理由。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants