[unitest] small change in test_deepgemm_precision.py#7834
[unitest] small change in test_deepgemm_precision.py#7834zhoutianzi666 wants to merge 4 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7834 +/- ##
==========================================
Coverage ? 63.27%
==========================================
Files ? 462
Lines ? 64276
Branches ? 9851
==========================================
Hits ? 40670
Misses ? 20841
Partials ? 2765
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览所有 Required 任务均已通过 ✅,3 个 Optional 任务失败(不阻塞合并)。
2 任务状态汇总2.1 Required任务 : 8/8 通过
2.2 可选任务 — 28/31 通过
3 失败详情(仅 required)无 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-15 22:20:43
📋 Review 摘要
PR 概述:改进 test_deepgemm_precision.py 和 test_flashmla_precision.py 中的性能计时方式,使用 CUDA Event 替代简单循环,添加预热逻辑并打印 TFLOPS/带宽指标。
变更范围:tests/operators/
影响面 Tag:[CI]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 📝 PR 规范 | — | 标题 [unitest] 非官方 Tag(拼写错误),应改为 [CI] |
| 📝 PR 规范 | — | PR 描述所有 section 内容为空,仅含模板占位符 |
| ❓ 疑问 | tests/operators/test_deepgemm_precision.py:454 |
预热循环 100 次 vs flashmla 的 10 次,存在 OOM 风险 |
| ❓ 疑问 | tests/operators/test_deepgemm_precision.py:471 |
[-1:] 只取最后一次测量,通常应取最小值或平均值 |
| ❓ 疑问 | tests/operators/test_flashmla_precision.py:37 |
kv_len 从 128K 大幅缩减至 8K,带宽结果可能不具代表性 |
📝 PR 规范检查
标题 [unitest] 存在拼写错误(应为 unittest)且不在官方 Tag 列表中;本次变更为测试文件修改,应使用 [CI]。PR 描述所有 section(Motivation / Modifications / Usage or Command / Accuracy Tests)内容均为模板占位符,未填写实际内容。
标题建议(可直接复制):
[CI] Improve timing accuracy in test_deepgemm_precision and test_flashmla_precision
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
改进 `test_deepgemm_precision.py` 和 `test_flashmla_precision.py` 中的性能计时方式,使用 CUDA Event 替代原先的简单循环,提高计时精度,并添加预热逻辑以减少 GPU 首次执行的影响。
## Modifications
- `tests/operators/test_deepgemm_precision.py`:使用 `paddle.device.cuda.Event` 计时,添加 100 次预热循环,计算并打印 TFLOPS;取消注释 `test_main` 中的三处 `one_invoke` 调用;清理 paddle profiler 相关注释代码
- `tests/operators/test_flashmla_precision.py`:使用 `paddle.device.cuda.Event` 计时,添加 10 次预热循环,计算并打印带宽(TB/s);将 `kv_len` 从 `1024 * 128` 调整为 `1024 * 8`;清理 paddle profiler 相关注释代码
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [ ] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
整体改进方向合理,CUDA Event 计时比简单循环计时更精确。建议修正标题 Tag、补全 PR 描述,并确认预热次数差异和 kv_len 缩减的合理性。
| for i in range(test_loops): | ||
| # 这行代码放在这里是为了让event的计时更准确! | ||
| # 太棒啦! | ||
| for j in range(100): |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| print(baseline_out - deepgemm_output) | ||
| end_events[i].record() | ||
|
|
||
| total_time = np.array([round(s.elapsed_time(e), 10) for s, e in zip(start_events, end_events)])[-1:] |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| bsz = 128 | ||
| kv_len = 1024 * 128 | ||
| kv_len = 1024 * 8 | ||
| page_size = 64 |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览❌ 存在 1 个 Required 失败任务需要处理,其余 Required 任务已通过。
2 任务状态汇总2.1 Required任务 : 9/10 通过
2.2 可选任务 — 29/32 通过
3 失败详情(仅 required)Pre Commit — 代码规范(置信度: 高)Pre Commit
根因详情: 关键日志: 修复建议:
修复建议摘要: 删除或重命名 test_flashmla_precision.py:60 未使用变量 关联变更: PR 修改文件: 链接: 查看日志 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-16 12:09:26
📋 Review 摘要
PR 概述:改进 deepgemm 和 flashmla 算子单元测试的计时方式,使用 CUDA Event 精确计时替代原有固定循环,并新增 TFLOPS / 带宽指标输出
变更范围:tests/operators/
影响面 Tag:[CI]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | tests/operators/test_deepgemm_precision.py:476 |
精度输出被注释,one_invoke 方法不再有任何精度输出,与文件名 precision 不符 |
| 🟡 建议 | tests/operators/test_deepgemm_precision.py:471 |
只取最后一次计时 [-1:] 而非均值,计时稳定性无保障 |
| ❓ 疑问 | tests/operators/test_flashmla_precision.py:36 |
kv_len 从 1024×128 缩减至 1024×8,带宽测量代表性存疑 |
📝 PR 规范检查
PR 标题使用了非官方 Tag [unitest](官方 Tag 列表中无此项);PR 描述各 section 均为空,仅保留模板注释占位符,不符合模板要求。
标题建议(可直接复制):
[CI] Improve CUDA Event timing precision in deepgemm and flashmla unit tests
PR 描述建议(可直接复制):
## Motivation
改进 `test_deepgemm_precision.py` 和 `test_flashmla_precision.py` 中的算子计时方式,使用 CUDA Event 精确计时替代原有固定循环,输出 TFLOPS / 带宽指标,同时删除废弃的 profiler 注释代码。
## Modifications
- `tests/operators/test_deepgemm_precision.py`:
- 新增 `import numpy as np`
- 使用 `paddle.device.cuda.Event` 对 `fp8_gemm_nt` 进行精确计时
- 将预热方式改为内层 100 次循环,确保 GPU 显存充分占用
- 新增 TFLOPS 计算与输出
- `test_main` 新增多组 M/N/K 配置调用(4096×4096×4096、4096×2048×7168、4096×65536×1536)
- 删除废弃的 profiler 注释代码
- `tests/operators/test_flashmla_precision.py`:
- 新增 `import numpy as np`
- 将 `kv_len` 从 `1024 * 128` 缩减为 `1024 * 8`
- 使用 `paddle.device.cuda.Event` 对 `mla_blackwell` 进行精确计时
- 新增内存带宽计算与输出(TB/s)
- 删除废弃的 profiler 注释代码
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [ ] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
改动思路合理,CUDA Event 计时方式更精确。建议恢复 test_deepgemm_precision.py 中的精度输出,并补全 PR 描述各 section 内容。
| print(total_time) | ||
| print(flops) | ||
|
|
||
| # print(baseline_out - deepgemm_output) |
There was a problem hiding this comment.
🟡 建议 精度输出被完全注释
文件名为 test_deepgemm_precision.py,但本次改动将 print(baseline_out - deepgemm_output) 注释掉(原 assert 也是注释状态),one_invoke 方法已完全无精度输出,与文件名 precision 不符。
建议至少保留 print 以维持精度可观测性:
print(baseline_out - deepgemm_output) # 保留精度可观测性
# assert (baseline_out - deepgemm_output).abs().max().item() < 0.1| print(baseline_out - deepgemm_output) | ||
| end_events[i].record() | ||
|
|
||
| total_time = np.array([round(s.elapsed_time(e), 10) for s, e in zip(start_events, end_events)])[-1:] |
There was a problem hiding this comment.
🟡 建议 仅取最后一次计时值,缺乏统计稳定性
[-1:] 只取最后一次迭代的时间,若最后一次恰好存在调度抖动,结果不可靠。建议取中位数或均值(排除首次热身):
total_time = float(np.median(np.array([round(s.elapsed_time(e), 10) for s, e in zip(start_events, end_events)])[1:]))|
|
||
| bsz = 128 | ||
| kv_len = 1024 * 128 | ||
| kv_len = 1024 * 8 |
There was a problem hiding this comment.
❓ 疑问 kv_len 大幅缩减可能影响带宽测量代表性
kv_len 从 1024 * 128(131072)缩减至 1024 * 8(8192),缩小了 16 倍。短序列下 KV 数据量小,L2 缓存命中率更高,测得带宽可能偏高,难以代表长序列服务场景的真实带宽瓶颈。如果仅为缩短 CI 运行时间,建议在注释中说明选取此值的理由。
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.