feat(lora): fix LoRA weight-sync deadlock and add stable LoRA RL recipe#20
Open
WWWjiahui wants to merge 2 commits into
Open
feat(lora): fix LoRA weight-sync deadlock and add stable LoRA RL recipe#20WWWjiahui wants to merge 2 commits into
WWWjiahui wants to merge 2 commits into
Conversation
Make LoRA RL (FSDP2 trainer + SGLang inference) train end-to-end.
Weight-sync deadlock (RaaS engine):
- Each sync used to unload+reload the adapter under a fixed name (lora_1).
SGLang's /unload_lora_adapter blocks in wait_for_unload until every
in-flight request releases the adapter's usage counter; paused/aborted
requests at a sync never release it, so the unload hangs forever and the
pipeline deadlocks.
- Fix: load each sync under a fresh versioned name (lora_v{seq}) and never
unload; SGLang's mem-pool LRU evicts stale adapters. Track the active
name (_current_lora_name), thread it through generation requests, and
mirror it to the eval engines that share the server.
Stable LoRA RL recipe (examples/math/qwen3-1.7b-m2po-2gpus-lora):
- LoRA's alpha/rank scaling makes each weight update much larger than
full-FT at the same lr, so the off-policy / multi-minibatch settings
full-FT tolerates collapse the policy under LoRA. Use near-on-policy:
ppo_n_minibatches=1, max_staleness=1, recompute_logprob=true (lr 5e-6).
- Validated: clean rising eval, math500 avg@4 70 -> 77.5, overall 30 -> 36.6.
5f9c0a1 to
cbd7f97
Compare
Add a LoRA subsection to the math recipes doc: the 2-GPU LoRA variant, the near-on-policy settings LoRA needs (ppo_n_minibatches=1, max_staleness=1, recompute_logprob=true), and the pause/drain/abort/load sequence used to swap the adapter on each weight sync.
haizhongzheng
added a commit
that referenced
this pull request
Jun 19, 2026
… add stable LoRA RL recipe
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes LoRA RL train end-to-end. Two independent fixes: a weight-sync deadlock in the RaaS engine, and a training recipe that keeps LoRA from collapsing. Validated on Qwen3-1.7B / math with a clean rising eval curve.
1. Weight-sync
On every training step the freshly trained LoRA adapter is pushed to the SGLang inference server. The swap follows a fixed sequence so that no request ever runs against half-updated weights or stale cache:
/pause_generation).The bug. The old code reloaded the adapter under a fixed name (
lora_1), which required unloading the previous one before step 3. SGLang's/unload_lora_adaptercallswait_for_unload, which blocks until the adapter's usage counter reaches zero, But the requests aborted during the drain never released their hold on that counter, so it stayed non-zero and the unload hung forever, freezing the entire pipeline. This only struck LoRA runs under load (full fine-tuning takes a different, unload-free path), which made it look intermittent.The fix. Never unload. Each sync loads the new adapter under a fresh versioned name (
lora_v1,lora_v2, …) and lets SGLang's memory-pool LRU reclaim the stale ones. The deadlock-prone unload is removed entirely, while the pause → drain/abort → load → flush → resume sequence above is unchanged, so the swap stays correct. A lingering old adapter only costs GPU memory, which is bounded bymax_loras_per_batch/max_loaded_loras.Validation
Qwen3-1.7B, math DP-scaled SGLang: clean monotonic rise over the first ~100 steps with importance weight pinned at 1.0 throughout —
eval-avg/math500/avg@4eval-avg/overall_avg