Skip to content

LocalBackend fork_checkpoint doesn't update vLLM's initial LoRA #651

@arcticfly

Description

@arcticfly

Problem

When using LocalBackend._experimental_fork_checkpoint with PipelineTrainer, the forked LoRA weights are not loaded by the vLLM inference server at startup.

Root cause: model.register(backend) creates an empty LoRA checkpoint at checkpoints/0000. Then _experimental_fork_checkpoint copies the source checkpoint to checkpoints/{source_step} (e.g. 0686). But when vLLM starts, it loads the adapter at @0 — which is the empty 0000 checkpoint, not the forked one.

Sequence:

  1. model.register(backend) → creates checkpoints/0000/ (empty LoRA)
  2. backend._experimental_fork_checkpoint(model, from_model="kl-000-1") → creates checkpoints/0686/ (real weights)
  3. vLLM starts with lora_modules=[LoRAModulePath(name='model@0', path='checkpoints/0000')]
  4. Training begins from an empty adapter instead of the forked checkpoint

Verification:

$ md5sum checkpoints/0000/adapter_model.safetensors checkpoints/0686/adapter_model.safetensors
3fb4a12a...  checkpoints/0000/adapter_model.safetensors   # empty
98dd58ba...  checkpoints/0686/adapter_model.safetensors   # forked

Current workaround

After calling _experimental_fork_checkpoint, copy the forked checkpoint files over 0000:

await backend._experimental_fork_checkpoint(model, from_model=src, ...)

# Overwrite empty 0000 with forked weights
step0_dir = art_path / project / "models" / model.name / "checkpoints" / "0000"
forked_dir = art_path / project / "models" / model.name / "checkpoints" / f"{fork_step:04d}"
for f in forked_dir.iterdir():
    shutil.copy2(f, step0_dir / f.name)

Suggested fix

_experimental_fork_checkpoint on LocalBackend should either:

  1. Copy the forked checkpoint to 0000 (overwriting the empty one), or
  2. Update the model's state so vLLM knows to load the forked step instead of @0

This issue is specific to LocalBackendServerlessBackend handles fork differently (uploads as W&B artifact with the correct step alias).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions