Skip to content

feat(p2p): add shard-level weight update with automatic broadcast fallback#2146

Open
CalvinXKY wants to merge 1 commit into
THUDM:mainfrom
CalvinXKY:feat/p2p-shard-weight-update
Open

feat(p2p): add shard-level weight update with automatic broadcast fallback#2146
CalvinXKY wants to merge 1 commit into
THUDM:mainfrom
CalvinXKY:feat/p2p-shard-weight-update

Conversation

@CalvinXKY

@CalvinXKY CalvinXKY commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Add an opt-in P2P shard weight sync path for non-colocate RL training with --update-weight-mode full. Each Megatron TP rank sends its local HF shard to the matching SGLang TP rank via dist.send/recv, avoiding the default all_gather + NCCL broadcast.

If preconditions are not met, slime automatically falls back to NCCL broadcast and logs the reason on rank 0. No breaking changes — default behavior is unchanged without --use-p2p-weight-update.

Motivation

In non-colocate mode, gathering full weights on the training side and broadcasting to rollout engines dominates update_weights time for large models. P2P shard sync keeps weights sharded end-to-end.

Changes

  • New --use-p2p-weight-update flag (requires --update-weight-mode full)
  • UpdateWeightFromDistributedP2P + Qwen2/Qwen3 dense shard Megatron→HF conversion
  • Fallback helpers in common.py
  • sglang_p2p.patch applied in Dockerfile after sglang.patch / sglang-top_p.patch
  • Docs: docs/en/advanced/p2p-weight-sync.md, docs/zh/advanced/p2p-weight-sync.md

Usage

python3 train.py \
  --megatron-to-hf-mode bridge \
  --update-weight-mode full \
  --use-p2p-weight-update \
  --tensor-model-parallel-size 4 \
  --rollout-num-gpus-per-engine 4 \
  ...

Preconditions (auto-fallback)

Condition Notes
Non-colocate Colocate uses UpdateWeightFromTensor
--megatron-to-hf-mode bridge Raw mode not supported
Model support Qwen2 / Qwen3 dense only (MoE falls back)
TP alignment Megatron TP == SGLang TP
SGLang PP == 1 sglang_pp_size > 1 falls back (P2P send/recv pairing is TP-only in Phase 1)

Megatron training PP and SGLang inference PP need not match when SGLang PP=1.

Manual SGLang patch (for reviewers only)

Merged PRs apply the patch at Docker build time. To validate on an existing slimerl/slime:latest container (which already has sglang.patch + sglang-top_p.patch) without rebuilding:

# Clone this PR branch inside the container, then:
export SLIME_ROOT=/root/slime
export SGLANG_ROOT=/sgl-workspace/sglang
PATCH="${SLIME_ROOT}/docker/patch/latest/sglang_p2p.patch"

if grep -q 'tp_tensor_counts' "${SGLANG_ROOT}/python/sglang/srt/managers/io_struct.py"; then
  echo "SGLang P2P patch already applied"
else
  cd "${SGLANG_ROOT}"
  git update-index --refresh
  git apply --check "${PATCH}"
  git apply --3way "${PATCH}"
  if grep -R -n '^<<<<<<< ' python/sglang/srt/managers/io_struct.py \
      python/sglang/srt/managers/tp_worker.py \
      python/sglang/srt/model_executor/model_runner.py; then
    echo "Patch conflict — please resolve before testing"
    exit 1
  fi
  grep -q 'tp_tensor_counts' python/sglang/srt/managers/io_struct.py
  grep -q 'load_format == "presharded"' python/sglang/srt/model_executor/model_runner.py
  echo "SGLang P2P patch applied OK"
fi

Then run a non-colocate smoke test with --use-p2p-weight-update (see docs).

Testing

Validated on 8×A100, slimerl/slime:latest + manual patch above, Qwen3-4B TP=4:

  • End-to-end smoke completed (exit_code=0)
  • Steady-state update_weights ~0.33–0.40s
  • sglang_p2p.patch passes git apply --check on top of slime SGLang patches

Test plan

  • docker build with ENABLE_SGLANG_PATCH=1 succeeds (includes sglang_p2p.patch)
  • Non-colocate Qwen3-4B smoke with --use-p2p-weight-update
  • MoE or --megatron-to-hf-mode raw falls back to broadcast with [P2P] log
  • sglang_pp_size > 1 falls back to broadcast with [P2P] log
  • Default path unchanged when flag omitted

@CalvinXKY CalvinXKY force-pushed the feat/p2p-shard-weight-update branch from 43a842a to 7571ef4 Compare June 29, 2026 07:08
@CalvinXKY

Copy link
Copy Markdown
Contributor Author

Benchmark: Qwen3-8B P2P vs NCCL broadcast (non-colocate, TP=4)

Setup

Item Value
Date 2026-06-26
Hardware 8× NVIDIA A100 80GB
Model Qwen3-8B
Layout Non-colocate: 4 training GPUs (Megatron TP=4) + 4 rollout GPUs (SGLang TP=4)
Workload GSM8K GRPO, num-rollout=50, rollout-batch-size=32, n-samples-per-prompt=8, global-batch-size=256, rollout-max-response-len=8192
Compared runs P2P: --use-p2p-weight-update · Baseline: default NCCL broadcast (flag omitted)
Isolation Two sequential runs on clean containers (slime-dev6slime-dev6-cmp); all other flags/checkpoints/data matched
Outcome Both runs finished with exit_code=0

Logs (local copy): logs_8b_compare/20260626_122703/
(p2p_8b_tp4.{log,err}, baseline_8b_tp4.{log,err}, master.log)

Procedure

  1. Apply SGLang P2P support and launch P2P run with --use-p2p-weight-update.
  2. Tear down processes/GPU state, switch to a fresh comparison container.
  3. Run baseline with the same train.py command without --use-p2p-weight-update.
  4. Parse perf/update_weights_time from train_metric_utils.py logs.

Note for reviewers: This benchmark was collected before this PR’s final sglang_p2p.patch landed; the P2P run used an equivalent ad-hoc SGLang patch during development. Reproduction should use the patch in this PR (or a Docker build with ENABLE_SGLANG_PATCH=1).

Results (perf/update_weights_time)

Metric source: train_metric_utils.pyperf/update_weights_time.

Step 1 (warmup) Steady-state mean (steps 2–49) Median Min Max Std dev
P2P 3.17 s 0.484 s 0.483 s 0.406 s 0.733 s 0.054 s
NCCL broadcast 3.77 s 0.755 s 0.751 s 0.696 s 0.840 s 0.034 s
Delta 1.56× faster −35.9% sync time
  • 49 logged rollout steps (steps 1–49); step 1 excluded from steady-state stats due to first-connection warmup.
  • Speedup is on the weight sync phase only, not end-to-end step time (rollout + train still dominate total step latency).

Takeaway

On Qwen3-8B non-colocate TP=4, P2P shard sync reduces steady-state update_weights_time from ~0.76s to ~0.48s versus the existing NCCL broadcast path, with stable training to completion over 49 rollout steps.

@CalvinXKY

Copy link
Copy Markdown
Contributor Author

A100 tests

image

…lback

Add P2P dist.send/recv path for non-colocate full weight sync when Megatron Bridge, shard conversion, and TP alignment preconditions are met; otherwise fall back to NCCL broadcast with a rank-0 log message.

- Apply sglang_p2p.patch in Docker build after slime sglang patches
- Add Qwen2/Qwen3 shard Megatron→HF conversion and updater implementation
- Wire --use-p2p-weight-update and SGLang tp_tensor_counts payload
- Document usage in docs/zh|en/advanced/p2p-weight-sync.md
@CalvinXKY CalvinXKY force-pushed the feat/p2p-shard-weight-update branch from 7571ef4 to 839df78 Compare June 29, 2026 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant