fix: resolve GPU memory leak in pipeline parallel training by JYMiracle305 · Pull Request #148 · InfiniTensor/InfiniTrain

JYMiracle305 · 2026-04-23T08:57:27Z

原因：

ReceiveFromPrev 中创建的接收 tensor 形成了 autograd 引用环：
recv_tensors → grad_fn(IRecv) → IRecv.next_functions → AccumulateGrad.tensor_ → recv_tensors

ReceiveFromPrev 创建 tensor，默认 is_leaf=true
调用 IRecv::Apply 时，由于 is_leaf=true，autograd 引擎为其创建了 AccumulateGrad 并挂入 IRecv 的 next_functions_
同时，IRecv::Apply 将该 tensor 的 grad_fn_ 设为 IRecv 本身
这样就形成了闭环：tensor 通过 grad_fn_ 指向 IRecv，IRecv 通过 next_functions_ 指向 AccumulateGrad，AccumulateGrad 又通过 tensor_ 指回 tensor

正常情况下 autograd graph 是 DAG，backward 完成后函数对象会级联销毁。但环的存在导致环中对象的 ref count 永远无法降到 0，整个 graph 无法释放，每一步都残留一个闭环，used 显存持续增长。

修复方案：

创建用于在PP chunk之间接收通信数据的 tensor 时将其标记为非 leaf， set_is_leaf(false)。这样 IRecv::Apply 中不会创建
AccumulateGrad，闭环在源头被切断，graph 可以正常销毁。

JYMiracle305 · 2026-04-23T09:43:32Z

使用PP场景测试，修改后peak used开销没有逐步增加：

JYMiracle305 · 2026-04-24T00:59:44Z

JYMiracle305 changed the title ~~fix: resolve GPU memory leak in pipeline parallel training~~ [WIP]fix: resolve GPU memory leak in pipeline parallel training Apr 23, 2026

fix: resolve GPU memory leak in pipeline parallel training

96341ef

JYMiracle305 changed the title ~~[WIP]fix: resolve GPU memory leak in pipeline parallel training~~ fix: resolve GPU memory leak in pipeline parallel training Apr 23, 2026

JYMiracle305 requested review from Chamberlain0w0 and kilinchange and removed request for kilinchange April 23, 2026 09:45

JYMiracle305 force-pushed the fix/PP_memory_leak branch from cfcc049 to 96341ef Compare April 23, 2026 09:46

kilinchange approved these changes Apr 24, 2026

View reviewed changes

kilinchange merged commit b594867 into master Apr 24, 2026
2 checks passed

kilinchange deleted the fix/PP_memory_leak branch April 24, 2026 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve GPU memory leak in pipeline parallel training#148

fix: resolve GPU memory leak in pipeline parallel training#148
kilinchange merged 1 commit intomasterfrom
fix/PP_memory_leak

JYMiracle305 commented Apr 23, 2026 •

edited

Loading

Uh oh!

JYMiracle305 commented Apr 23, 2026 •

edited

Loading

Uh oh!

JYMiracle305 commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JYMiracle305 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

原因：

修复方案：

Uh oh!

JYMiracle305 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JYMiracle305 commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JYMiracle305 commented Apr 23, 2026 •

edited

Loading

JYMiracle305 commented Apr 23, 2026 •

edited

Loading