fix(dist): preserve new_group options across reloadable group reload#2095
Open
EazyReal wants to merge 3 commits into
Open
fix(dist): preserve new_group options across reloadable group reload#2095EazyReal wants to merge 3 commits into
EazyReal wants to merge 3 commits into
Conversation
1d3971f to
b0b1e1e
Compare
b0b1e1e to
40249e6
Compare
40249e6 to
3b939c7
Compare
Contributor
Author
|
@zhuzilin could you review this one? Reloadable distributed groups were recreating with default new_group options after reload; this preserves the original group options so timeout/backend semantics survive reload. |
# Conflicts: # .github/workflows/pr-test.yml
# Conflicts: # .github/workflows/pr-test.yml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
ReloadableProcessGroupnow captures the original positional args and kwargs passed totorch.distributed.new_groupand replays them verbatim inreload_process_groups(), instead of recreating each group with onlyold_new_group(ranks=..., backend="nccl").The gloo-skip and rank-extraction logic in the
new_groupmonkeypatch is factored into_is_gloo_group()and_get_group_ranks()helpers (behavior-preserving; also removes the operator-precedence ambiguity in the olda and b or c and dgloo check).Adds a CPU regression test (
test_reload_process_groups_preserves_new_group_timeout) that drives a create -> destroy -> reload cycle and assertstimeoutandpg_optionssurvive the reload, and registers the test file in the cpu-unittest CI list.Why
On offload/reload,
reload_process_groups()recreated each NCCL group with onlyranksandbackend="nccl", dropping the originaltimeout,pg_options, and any other construction options. A job that launches with a longer distributed timeout would create its initial NCCL groups correctly but silently fall back to PyTorch's default timeout after a reload, so long-running actor train/update collectives could trip the NCCL watchdog even though the operator configured a larger timeout. The reload lifecycle is owned by this wrapper, so the wrapper is the only place that can preserve and replay the originalnew_groupcontract.Gloo and single-rank groups keep their existing bypass (returned directly, not wrapped/registered).
Validation
CPU-only, no GPU required:
pytest tests/test_reloadable_process_group_memory_check.py -q-- the new test fakesdist.new_group/old_new_group, runs new_group -> destroy_process_groups -> reload_process_groups, and asserts the reloaded inner group is recreated with the sameranks,backend,timeout, andpg_optionsobjects. It fails against the pre-fix code (which hard-codesranks=..., backend="nccl")..github/workflows/pr-test.yml.j2's cpu-unittest job so it runs in CI.