Skip to content

fix: resolve GPU memory leak in pipeline parallel training#148

Merged
kilinchange merged 1 commit intomasterfrom
fix/PP_memory_leak
Apr 24, 2026
Merged

fix: resolve GPU memory leak in pipeline parallel training#148
kilinchange merged 1 commit intomasterfrom
fix/PP_memory_leak

Conversation

@JYMiracle305
Copy link
Copy Markdown
Contributor

@JYMiracle305 JYMiracle305 commented Apr 23, 2026

原因:

ReceiveFromPrev 中创建的接收 tensor 形成了 autograd 引用环:
recv_tensors → grad_fn(IRecv) → IRecv.next_functions → AccumulateGrad.tensor_ → recv_tensors

  1. ReceiveFromPrev 创建 tensor,默认 is_leaf=true
  2. 调用 IRecv::Apply 时,由于 is_leaf=true,autograd 引擎为其创建了 AccumulateGrad 并挂入 IRecv 的 next_functions_
  3. 同时,IRecv::Apply 将该 tensor 的 grad_fn_ 设为 IRecv 本身
  4. 这样就形成了闭环:tensor 通过 grad_fn_ 指向 IRecv,IRecv 通过 next_functions_ 指向 AccumulateGrad,AccumulateGrad 又通过 tensor_ 指回 tensor

正常情况下 autograd graph 是 DAG,backward 完成后函数对象会级联销毁。但环的存在导致环中对象的 ref count 永远无法降到 0,整个 graph 无法释放,每一步都残留一个闭环,used 显存持续增长。

修复方案:

创建用于在PP chunk之间接收通信数据的 tensor 时将其标记为非 leaf, set_is_leaf(false)。这样 IRecv::Apply 中不会创建
AccumulateGrad,闭环在源头被切断,graph 可以正常销毁。

@JYMiracle305 JYMiracle305 changed the title fix: resolve GPU memory leak in pipeline parallel training [WIP]fix: resolve GPU memory leak in pipeline parallel training Apr 23, 2026
@JYMiracle305
Copy link
Copy Markdown
Contributor Author

JYMiracle305 commented Apr 23, 2026

使用PP场景测试,修改后peak used开销没有逐步增加:

image
image

@JYMiracle305 JYMiracle305 changed the title [WIP]fix: resolve GPU memory leak in pipeline parallel training fix: resolve GPU memory leak in pipeline parallel training Apr 23, 2026
@JYMiracle305 JYMiracle305 requested review from Chamberlain0w0 and kilinchange and removed request for kilinchange April 23, 2026 09:45
@JYMiracle305
Copy link
Copy Markdown
Contributor Author

image

@kilinchange kilinchange merged commit b594867 into master Apr 24, 2026
2 checks passed
@kilinchange kilinchange deleted the fix/PP_memory_leak branch April 24, 2026 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants