Skip to content

Regression: forked GRPO run can produce malformed checkpoint outputs on newer ART ref #672

@arcticfly

Description

@arcticfly

Summary

We are seeing a regression when moving from the fork-fix branch to a newer ART ref. With the same training setup and the same forked checkpoint, some runs on the newer ART ref produce malformed/repetitive generations from saved checkpoint artifacts after a short amount of GRPO training. The older branch has been clean in the same probe so far.

This does not look like an obvious failure to load the initial fork: initial eval metrics are in the same range across the two refs. The divergence appears after training continues from the forked checkpoint.

ART refs compared

  • Newer/current ref: codex/save-checkpoint-artifact at 4f972aa4328f16ad2d2b64a135e45a808c549e7c
  • Older comparison ref: fix/fork-on-pre-v5 at 6ecd16c8fcdfd2d2076eb1f0475e5b727ba4428c
  • Main at time of comparison: 48b2e5f6c384a62b44f34e1472e5fb1eeaa3474a
  • Package lock resolved both as openpipe-art==0.5.17 from the corresponding git refs.

Training setup

Sanitized setup details:

  • Base model: unsloth/Meta-Llama-3.1-8B-Instruct
  • Training starts by forking from an existing LoRA checkpoint at approximately step 686.
  • Infrastructure: SkyPilot on Kubernetes, H200:2.
  • GPU split: trainer on GPU 0, inference on GPU 1.
  • Rollout workers: 12.
  • GRPO-style training with group_size=6, batch_size=4.
  • learning_rate=1.2e-5
  • max_tokens=512
  • max_steps=716 for the comparison runs.
  • train_limit=768
  • eval_every=10, eval_samples=25
  • KL penalty enabled with kl_penalty_coef=1.0, kl_window_size=10.
  • Reward mode uses scorer-derived rewards plus a RULER judge.
  • RULER judge model: openrouter/google/gemini-2.5-flash.
  • save_checkpoint_artifact=true.

Observed behavior

After training from the same forked checkpoint and saving checkpoint artifacts, we probe the saved artifacts through W&B Inference using a small fixed set of generic transcript-cleanup prompts. The probe uses:

  • 3 generic prompts, not domain-specific.
  • n=6 completions per prompt.
  • temperature=0.7
  • max_tokens=512
  • System instruction asks the model to return only a corrected transcript wrapped in output tags.

We are intentionally not including the private prompts or outputs here. The failure mode is that the model starts producing malformed/repetitive/code-like text instead of a short cleaned transcript. In one run we saw repeated HeaderCode; in other failing outputs the literal token was not always present, but the responses were still clearly malformed or much longer than expected.

Current aggregate from the comparison probes:

  • Older ref (fix/fork-on-pre-v5): 6 runs completed/probed, 0 malformed responses out of 108 sampled completions.
  • Newer ref (codex/save-checkpoint-artifact): 4 runs completed/probed so far, 35 malformed responses out of 72 sampled completions.
  • Some newer-ref runs are clean while others fail badly, so this looks run-dependent rather than deterministic.

The initial eval score immediately after loading/forking is comparable between refs, which makes a simple fork-not-loaded explanation less likely. The issue appears after additional GRPO training and checkpoint save/reload.

Why this seems ART-related

The training config, forked checkpoint, data shape, inference probe, and infrastructure are held constant between the two refs. The older fix/fork-on-pre-v5 ref is consistently clean in the probe, while the newer ref intermittently produces corrupted checkpoint behavior.

Potential areas worth checking:

  • Changes around forked checkpoint loading/copying since fix/fork-on-pre-v5.
  • Save/reload behavior for LoRA checkpoint artifacts.
  • vLLM adapter lifecycle after checkpoint save/reload.
  • KL reference checkpoint/window handling when training from a forked checkpoint.
  • Any interaction between the dedicated trainer/inference split and adapter reloads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions