Skip to content

fix(dist): preserve new_group options across reloadable group reload#2095

Open
EazyReal wants to merge 3 commits into
THUDM:mainfrom
EazyReal:codex/upstream-rpg-timeout-preserve-new-group
Open

fix(dist): preserve new_group options across reloadable group reload#2095
EazyReal wants to merge 3 commits into
THUDM:mainfrom
EazyReal:codex/upstream-rpg-timeout-preserve-new-group

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What changed

ReloadableProcessGroup now captures the original positional args and kwargs passed to torch.distributed.new_group and replays them verbatim in reload_process_groups(), instead of recreating each group with only old_new_group(ranks=..., backend="nccl").

The gloo-skip and rank-extraction logic in the new_group monkeypatch is factored into _is_gloo_group() and _get_group_ranks() helpers (behavior-preserving; also removes the operator-precedence ambiguity in the old a and b or c and d gloo check).

Adds a CPU regression test (test_reload_process_groups_preserves_new_group_timeout) that drives a create -> destroy -> reload cycle and asserts timeout and pg_options survive the reload, and registers the test file in the cpu-unittest CI list.

Why

On offload/reload, reload_process_groups() recreated each NCCL group with only ranks and backend="nccl", dropping the original timeout, pg_options, and any other construction options. A job that launches with a longer distributed timeout would create its initial NCCL groups correctly but silently fall back to PyTorch's default timeout after a reload, so long-running actor train/update collectives could trip the NCCL watchdog even though the operator configured a larger timeout. The reload lifecycle is owned by this wrapper, so the wrapper is the only place that can preserve and replay the original new_group contract.

Gloo and single-rank groups keep their existing bypass (returned directly, not wrapped/registered).

Validation

CPU-only, no GPU required:

  • pytest tests/test_reloadable_process_group_memory_check.py -q -- the new test fakes dist.new_group/old_new_group, runs new_group -> destroy_process_groups -> reload_process_groups, and asserts the reloaded inner group is recreated with the same ranks, backend, timeout, and pg_options objects. It fails against the pre-fix code (which hard-codes ranks=..., backend="nccl").
  • The test file is registered in .github/workflows/pr-test.yml.j2's cpu-unittest job so it runs in CI.

@EazyReal EazyReal changed the title Preserve reloadable process group options fix: preserve new_group options (timeout, pg_options) across reloadable process group reload Jun 24, 2026
@EazyReal EazyReal force-pushed the codex/upstream-rpg-timeout-preserve-new-group branch from 1d3971f to b0b1e1e Compare June 24, 2026 03:18
@EazyReal EazyReal marked this pull request as ready for review June 24, 2026 03:18
@EazyReal EazyReal changed the title fix: preserve new_group options (timeout, pg_options) across reloadable process group reload fix(dist): preserve new_group options across reloadable group reload Jun 24, 2026
@EazyReal EazyReal force-pushed the codex/upstream-rpg-timeout-preserve-new-group branch from b0b1e1e to 40249e6 Compare June 24, 2026 04:20
@EazyReal EazyReal force-pushed the codex/upstream-rpg-timeout-preserve-new-group branch from 40249e6 to 3b939c7 Compare June 25, 2026 06:56
@EazyReal

Copy link
Copy Markdown
Contributor Author

@zhuzilin could you review this one? Reloadable distributed groups were recreating with default new_group options after reload; this preserves the original group options so timeout/backend semantics survive reload.

EazyReal added 2 commits June 29, 2026 00:59
# Conflicts:
#	.github/workflows/pr-test.yml
# Conflicts:
#	.github/workflows/pr-test.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant