[WIP] feat: Support ZeRO-2 based on DistributedOptimizer#110
Open
Chamberlain0w0 wants to merge 3 commits intomasterfrom
Open
[WIP] feat: Support ZeRO-2 based on DistributedOptimizer#110Chamberlain0w0 wants to merge 3 commits intomasterfrom
Chamberlain0w0 wants to merge 3 commits intomasterfrom
Conversation
4a2527f to
a82ad42
Compare
a82ad42 to
e5b4492
Compare
d5ffecd to
e5b4492
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
基于框架目前 DistOpt 的建设,实现的 ZeRO-2 的梯度分片显存优化策略。
用户接口修改:添加
zero_stage的 gflag,在启用--use_distributed_optimizer的同时可以指定 zero 级别(目前 zero3 为占位符);zero_stage 的信息也作为成员变量存在 DDPConfig 类里。实现上的修改
a. 核心逻辑:ZeRO-2 的核心是对模型参数所对应的梯度信息也按 dp 来分片存储,每个 rank 拿到自己负责的那部分;考虑到原先的 DistOpt 实现依赖于一个大的一维连续 ParamAndGradBuffer,所以为了实现 ZeRO-2,也就需要在初始化时不构造全量的 grad_buffer,仅构造每个 shard 大小的 grad_buffer。
b. grad 于 ParamAndGradBucketGroup 创建的时候构造(见 ParamAndGradBucketGroup 构造函数):每个 group 单独构造各自 rank 上面的 shard grad buffer,以 grad_shard_buffer_list_ 的成员变量存储(按 buckets 存成一个 list,但是实际上默认情况就是一个 group 一个 bucket,所以这里就是一个 size()==1 的 list)。
c. Autograd 反向流程中,按需临时分配内存创建 full grad,用完后释放。考虑到之前修改了 tensor->grad 的 lazy init,以及每轮可能存在的 ZeroGrad(set_to_none=true),此 full grad 创建时机位于 AccumulateGrad::Backward。
d. 补充上述 c 的细节:为了不在 AccumulateGrad::Backward 插入过多 zero2 相关的 if else 判断污染代码逻辑,直接定义一个 pre-accumulate-grad 的 bypass function,用于劫持原先的 AccumulateGrad::Backward,从而实现流程的重写,替换为:视情况创建 full grad;把其中与 tensor->grad 有关的操作,都改为 full grad 的操作;正常完成梯度更新。
局限性分析:目前的 ZeRO-2 实现上,相当于把 full grad 从原先“显存上的永久存储”的角色转换为了一个“autograd 反向流程中随用随分配并及时释放的激活内存”的角色,从而使得显存下降。但是最坏情况下,仍然可能存在一种情况导致显存优化效果不太明显:计算/通信太慢,full grad 的 cudaFree 操作排在流后面太久还没释放,full grad 作为激活值会较长时间占据内存。目前的做法是适当减小 bucket size,这样 full grad 的 reduce scatter 操作会变得更细粒度一些,更能贴近随用随释放的目标。