Summary
We are seeing a regression when moving from the fork-fix branch to a newer ART ref. With the same training setup and the same forked checkpoint, some runs on the newer ART ref produce malformed/repetitive generations from saved checkpoint artifacts after a short amount of GRPO training. The older branch has been clean in the same probe so far.
This does not look like an obvious failure to load the initial fork: initial eval metrics are in the same range across the two refs. The divergence appears after training continues from the forked checkpoint.
ART refs compared
- Newer/current ref:
codex/save-checkpoint-artifact at 4f972aa4328f16ad2d2b64a135e45a808c549e7c
- Older comparison ref:
fix/fork-on-pre-v5 at 6ecd16c8fcdfd2d2076eb1f0475e5b727ba4428c
- Main at time of comparison:
48b2e5f6c384a62b44f34e1472e5fb1eeaa3474a
- Package lock resolved both as
openpipe-art==0.5.17 from the corresponding git refs.
Training setup
Sanitized setup details:
- Base model:
unsloth/Meta-Llama-3.1-8B-Instruct
- Training starts by forking from an existing LoRA checkpoint at approximately step 686.
- Infrastructure: SkyPilot on Kubernetes,
H200:2.
- GPU split: trainer on GPU 0, inference on GPU 1.
- Rollout workers: 12.
- GRPO-style training with
group_size=6, batch_size=4.
learning_rate=1.2e-5
max_tokens=512
max_steps=716 for the comparison runs.
train_limit=768
eval_every=10, eval_samples=25
- KL penalty enabled with
kl_penalty_coef=1.0, kl_window_size=10.
- Reward mode uses scorer-derived rewards plus a RULER judge.
- RULER judge model:
openrouter/google/gemini-2.5-flash.
save_checkpoint_artifact=true.
Observed behavior
After training from the same forked checkpoint and saving checkpoint artifacts, we probe the saved artifacts through W&B Inference using a small fixed set of generic transcript-cleanup prompts. The probe uses:
- 3 generic prompts, not domain-specific.
n=6 completions per prompt.
temperature=0.7
max_tokens=512
- System instruction asks the model to return only a corrected transcript wrapped in output tags.
We are intentionally not including the private prompts or outputs here. The failure mode is that the model starts producing malformed/repetitive/code-like text instead of a short cleaned transcript. In one run we saw repeated HeaderCode; in other failing outputs the literal token was not always present, but the responses were still clearly malformed or much longer than expected.
Current aggregate from the comparison probes:
- Older ref (
fix/fork-on-pre-v5): 6 runs completed/probed, 0 malformed responses out of 108 sampled completions.
- Newer ref (
codex/save-checkpoint-artifact): 4 runs completed/probed so far, 35 malformed responses out of 72 sampled completions.
- Some newer-ref runs are clean while others fail badly, so this looks run-dependent rather than deterministic.
The initial eval score immediately after loading/forking is comparable between refs, which makes a simple fork-not-loaded explanation less likely. The issue appears after additional GRPO training and checkpoint save/reload.
Why this seems ART-related
The training config, forked checkpoint, data shape, inference probe, and infrastructure are held constant between the two refs. The older fix/fork-on-pre-v5 ref is consistently clean in the probe, while the newer ref intermittently produces corrupted checkpoint behavior.
Potential areas worth checking:
- Changes around forked checkpoint loading/copying since
fix/fork-on-pre-v5.
- Save/reload behavior for LoRA checkpoint artifacts.
- vLLM adapter lifecycle after checkpoint save/reload.
- KL reference checkpoint/window handling when training from a forked checkpoint.
- Any interaction between the dedicated trainer/inference split and adapter reloads.
Summary
We are seeing a regression when moving from the fork-fix branch to a newer ART ref. With the same training setup and the same forked checkpoint, some runs on the newer ART ref produce malformed/repetitive generations from saved checkpoint artifacts after a short amount of GRPO training. The older branch has been clean in the same probe so far.
This does not look like an obvious failure to load the initial fork: initial eval metrics are in the same range across the two refs. The divergence appears after training continues from the forked checkpoint.
ART refs compared
codex/save-checkpoint-artifactat4f972aa4328f16ad2d2b64a135e45a808c549e7cfix/fork-on-pre-v5at6ecd16c8fcdfd2d2076eb1f0475e5b727ba4428c48b2e5f6c384a62b44f34e1472e5fb1eeaa3474aopenpipe-art==0.5.17from the corresponding git refs.Training setup
Sanitized setup details:
unsloth/Meta-Llama-3.1-8B-InstructH200:2.group_size=6,batch_size=4.learning_rate=1.2e-5max_tokens=512max_steps=716for the comparison runs.train_limit=768eval_every=10,eval_samples=25kl_penalty_coef=1.0,kl_window_size=10.openrouter/google/gemini-2.5-flash.save_checkpoint_artifact=true.Observed behavior
After training from the same forked checkpoint and saving checkpoint artifacts, we probe the saved artifacts through W&B Inference using a small fixed set of generic transcript-cleanup prompts. The probe uses:
n=6completions per prompt.temperature=0.7max_tokens=512We are intentionally not including the private prompts or outputs here. The failure mode is that the model starts producing malformed/repetitive/code-like text instead of a short cleaned transcript. In one run we saw repeated
HeaderCode; in other failing outputs the literal token was not always present, but the responses were still clearly malformed or much longer than expected.Current aggregate from the comparison probes:
fix/fork-on-pre-v5): 6 runs completed/probed, 0 malformed responses out of 108 sampled completions.codex/save-checkpoint-artifact): 4 runs completed/probed so far, 35 malformed responses out of 72 sampled completions.The initial eval score immediately after loading/forking is comparable between refs, which makes a simple fork-not-loaded explanation less likely. The issue appears after additional GRPO training and checkpoint save/reload.
Why this seems ART-related
The training config, forked checkpoint, data shape, inference probe, and infrastructure are held constant between the two refs. The older
fix/fork-on-pre-v5ref is consistently clean in the probe, while the newer ref intermittently produces corrupted checkpoint behavior.Potential areas worth checking:
fix/fork-on-pre-v5.