Goal
Get the VL-JEPA captioning loop to produce instance-discriminative outputs on real-world images (COCO and similar), not the per-image-identical caption observed in current validation runs.
Current state
From VALIDATION_FINDINGS.md: 18M-param VL-JEPA hits 100% on synthetic colored shapes but mode-collapses across 4 runs on real COCO (baseline, stronger conditioning, prefix tokens + FiLM, word dropout — all collapse). The K=16 latent bottleneck carries category-level info (32.1% on 20-class, 6.4× above random) but not instance-level detail. Decoder learns to ignore the latent plan.
Hypotheses to test
- Scale. Train a 200M+ param model end-to-end on a larger dataset; see whether the bottleneck enriches enough to break collapse.
- Bottleneck capacity. Increase K (16 → 64+) and/or per-vector dim; observe whether decoder starts using the plan.
- Auxiliary loss. Add a contrastive or reconstructive auxiliary on the latent plan to force per-image distinctiveness before the decoder sees it.
- Decoder regularization. Schedule prefix-token dropout / word-dropout differently to prevent decoder from learning the unconditional caption distribution.
Acceptance
- A configuration where decoder produces image-specific captions on a held-out COCO subset, with caption diversity above a defined threshold (e.g. distinct-2 score > X).
- Findings written up in
VALIDATION_FINDINGS.md (or a successor doc).
Notes
- This is research, not pure implementation. Budget accordingly.
- The native captioning capability is downstream of this. Sponsor/dependent alignment scoring is not — that work uses the LLM backbone instead.
Goal
Get the VL-JEPA captioning loop to produce instance-discriminative outputs on real-world images (COCO and similar), not the per-image-identical caption observed in current validation runs.
Current state
From
VALIDATION_FINDINGS.md: 18M-param VL-JEPA hits 100% on synthetic colored shapes but mode-collapses across 4 runs on real COCO (baseline, stronger conditioning, prefix tokens + FiLM, word dropout — all collapse). The K=16 latent bottleneck carries category-level info (32.1% on 20-class, 6.4× above random) but not instance-level detail. Decoder learns to ignore the latent plan.Hypotheses to test
Acceptance
VALIDATION_FINDINGS.md(or a successor doc).Notes