Skip to content

Resolve VL-JEPA mode collapse on real images #10

@EightRice

Description

@EightRice

Goal

Get the VL-JEPA captioning loop to produce instance-discriminative outputs on real-world images (COCO and similar), not the per-image-identical caption observed in current validation runs.

Current state

From VALIDATION_FINDINGS.md: 18M-param VL-JEPA hits 100% on synthetic colored shapes but mode-collapses across 4 runs on real COCO (baseline, stronger conditioning, prefix tokens + FiLM, word dropout — all collapse). The K=16 latent bottleneck carries category-level info (32.1% on 20-class, 6.4× above random) but not instance-level detail. Decoder learns to ignore the latent plan.

Hypotheses to test

  1. Scale. Train a 200M+ param model end-to-end on a larger dataset; see whether the bottleneck enriches enough to break collapse.
  2. Bottleneck capacity. Increase K (16 → 64+) and/or per-vector dim; observe whether decoder starts using the plan.
  3. Auxiliary loss. Add a contrastive or reconstructive auxiliary on the latent plan to force per-image distinctiveness before the decoder sees it.
  4. Decoder regularization. Schedule prefix-token dropout / word-dropout differently to prevent decoder from learning the unconditional caption distribution.

Acceptance

  • A configuration where decoder produces image-specific captions on a held-out COCO subset, with caption diversity above a defined threshold (e.g. distinct-2 score > X).
  • Findings written up in VALIDATION_FINDINGS.md (or a successor doc).

Notes

  • This is research, not pure implementation. Budget accordingly.
  • The native captioning capability is downstream of this. Sponsor/dependent alignment scoring is not — that work uses the LLM backbone instead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    track:trainingJEPA, LLM backbone, FedAvg, native modeltype:researchInvestigation needed before implementation

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions