feat(p2p): add shard-level weight update with automatic broadcast fallback by CalvinXKY · Pull Request #2146 · THUDM/slime

CalvinXKY · 2026-06-29T07:04:24Z

Summary

Add an opt-in P2P shard weight sync path for non-colocate RL training with --update-weight-mode full. Each Megatron TP rank sends its local HF shard to the matching SGLang TP rank via dist.send/recv, avoiding the default all_gather + NCCL broadcast.

If preconditions are not met, slime automatically falls back to NCCL broadcast and logs the reason on rank 0. No breaking changes — default behavior is unchanged without --use-p2p-weight-update.

Motivation

In non-colocate mode, gathering full weights on the training side and broadcasting to rollout engines dominates update_weights time for large models. P2P shard sync keeps weights sharded end-to-end.

Changes

New --use-p2p-weight-update flag (requires --update-weight-mode full)
UpdateWeightFromDistributedP2P + Qwen2/Qwen3 dense shard Megatron→HF conversion
Fallback helpers in common.py
sglang_p2p.patch applied in Dockerfile after sglang.patch / sglang-top_p.patch
Docs: docs/en/advanced/p2p-weight-sync.md, docs/zh/advanced/p2p-weight-sync.md

Usage

python3 train.py \
  --megatron-to-hf-mode bridge \
  --update-weight-mode full \
  --use-p2p-weight-update \
  --tensor-model-parallel-size 4 \
  --rollout-num-gpus-per-engine 4 \
  ...

Preconditions (auto-fallback)

Condition	Notes
Non-colocate	Colocate uses `UpdateWeightFromTensor`
`--megatron-to-hf-mode bridge`	Raw mode not supported
Model support	Qwen2 / Qwen3 dense only (MoE falls back)
TP alignment	Megatron TP == SGLang TP
SGLang PP == 1	`sglang_pp_size > 1` falls back (P2P send/recv pairing is TP-only in Phase 1)

Megatron training PP and SGLang inference PP need not match when SGLang PP=1.

Manual SGLang patch (for reviewers only)

Merged PRs apply the patch at Docker build time. To validate on an existing slimerl/slime:latest container (which already has sglang.patch + sglang-top_p.patch) without rebuilding:

# Clone this PR branch inside the container, then:
export SLIME_ROOT=/root/slime
export SGLANG_ROOT=/sgl-workspace/sglang
PATCH="${SLIME_ROOT}/docker/patch/latest/sglang_p2p.patch"

if grep -q 'tp_tensor_counts' "${SGLANG_ROOT}/python/sglang/srt/managers/io_struct.py"; then
  echo "SGLang P2P patch already applied"
else
  cd "${SGLANG_ROOT}"
  git update-index --refresh
  git apply --check "${PATCH}"
  git apply --3way "${PATCH}"
  if grep -R -n '^<<<<<<< ' python/sglang/srt/managers/io_struct.py \
      python/sglang/srt/managers/tp_worker.py \
      python/sglang/srt/model_executor/model_runner.py; then
    echo "Patch conflict — please resolve before testing"
    exit 1
  fi
  grep -q 'tp_tensor_counts' python/sglang/srt/managers/io_struct.py
  grep -q 'load_format == "presharded"' python/sglang/srt/model_executor/model_runner.py
  echo "SGLang P2P patch applied OK"
fi

Then run a non-colocate smoke test with --use-p2p-weight-update (see docs).

Testing

Validated on 8×A100, slimerl/slime:latest + manual patch above, Qwen3-4B TP=4:

End-to-end smoke completed (exit_code=0)
Steady-state update_weights ~0.33–0.40s
sglang_p2p.patch passes git apply --check on top of slime SGLang patches

Test plan

docker build with ENABLE_SGLANG_PATCH=1 succeeds (includes sglang_p2p.patch)
Non-colocate Qwen3-4B smoke with --use-p2p-weight-update
MoE or --megatron-to-hf-mode raw falls back to broadcast with [P2P] log
sglang_pp_size > 1 falls back to broadcast with [P2P] log
Default path unchanged when flag omitted

CalvinXKY · 2026-06-29T07:18:19Z

Benchmark: Qwen3-8B P2P vs NCCL broadcast (non-colocate, TP=4)

Setup

Item	Value
Date	2026-06-26
Hardware	8× NVIDIA A100 80GB
Model	Qwen3-8B
Layout	Non-colocate: 4 training GPUs (Megatron TP=4) + 4 rollout GPUs (SGLang TP=4)
Workload	GSM8K GRPO, `num-rollout=50`, `rollout-batch-size=32`, `n-samples-per-prompt=8`, `global-batch-size=256`, `rollout-max-response-len=8192`
Compared runs	P2P: `--use-p2p-weight-update` · Baseline: default NCCL broadcast (flag omitted)
Isolation	Two sequential runs on clean containers (`slime-dev6` → `slime-dev6-cmp`); all other flags/checkpoints/data matched
Outcome	Both runs finished with `exit_code=0`

Logs (local copy): logs_8b_compare/20260626_122703/
(p2p_8b_tp4.{log,err}, baseline_8b_tp4.{log,err}, master.log)

Procedure

Apply SGLang P2P support and launch P2P run with --use-p2p-weight-update.
Tear down processes/GPU state, switch to a fresh comparison container.
Run baseline with the same train.py command without --use-p2p-weight-update.
Parse perf/update_weights_time from train_metric_utils.py logs.

Note for reviewers: This benchmark was collected before this PR’s final sglang_p2p.patch landed; the P2P run used an equivalent ad-hoc SGLang patch during development. Reproduction should use the patch in this PR (or a Docker build with ENABLE_SGLANG_PATCH=1).

Results (`perf/update_weights_time`)

Metric source: train_metric_utils.py → perf/update_weights_time.

	Step 1 (warmup)	Steady-state mean (steps 2–49)	Median	Min	Max	Std dev
P2P	3.17 s	0.484 s	0.483 s	0.406 s	0.733 s	0.054 s
NCCL broadcast	3.77 s	0.755 s	0.751 s	0.696 s	0.840 s	0.034 s
Delta	—	1.56× faster	—	—	—	−35.9% sync time

49 logged rollout steps (steps 1–49); step 1 excluded from steady-state stats due to first-connection warmup.
Speedup is on the weight sync phase only, not end-to-end step time (rollout + train still dominate total step latency).

Takeaway

On Qwen3-8B non-colocate TP=4, P2P shard sync reduces steady-state update_weights_time from ~0.76s to ~0.48s versus the existing NCCL broadcast path, with stable training to completion over 49 rollout steps.

CalvinXKY · 2026-06-29T07:19:02Z

A100 tests

…lback Add P2P dist.send/recv path for non-colocate full weight sync when Megatron Bridge, shard conversion, and TP alignment preconditions are met; otherwise fall back to NCCL broadcast with a rank-0 log message. - Apply sglang_p2p.patch in Docker build after slime sglang patches - Add Qwen2/Qwen3 shard Megatron→HF conversion and updater implementation - Wire --use-p2p-weight-update and SGLang tp_tensor_counts payload - Document usage in docs/zh|en/advanced/p2p-weight-sync.md

CalvinXKY force-pushed the feat/p2p-shard-weight-update branch from 43a842a to 7571ef4 Compare June 29, 2026 07:08

CalvinXKY mentioned this pull request Jun 29, 2026

[RFC] Shard-level P2P weight sync for non-colocate training #2147

Open

16 tasks

CalvinXKY force-pushed the feat/p2p-shard-weight-update branch from 7571ef4 to 839df78 Compare June 29, 2026 07:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(p2p): add shard-level weight update with automatic broadcast fallback#2146

feat(p2p): add shard-level weight update with automatic broadcast fallback#2146
CalvinXKY wants to merge 1 commit into
THUDM:mainfrom
CalvinXKY:feat/p2p-shard-weight-update

CalvinXKY commented Jun 29, 2026 •

edited

Loading

Uh oh!

CalvinXKY commented Jun 29, 2026

Uh oh!

CalvinXKY commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

CalvinXKY commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Usage

Preconditions (auto-fallback)

Manual SGLang patch (for reviewers only)

Testing

Test plan

Uh oh!

CalvinXKY commented Jun 29, 2026

Benchmark: Qwen3-8B P2P vs NCCL broadcast (non-colocate, TP=4)

Setup

Procedure

Results (perf/update_weights_time)

Takeaway

Uh oh!

CalvinXKY commented Jun 29, 2026

A100 tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CalvinXKY commented Jun 29, 2026 •

edited

Loading

Results (`perf/update_weights_time`)