Skip to content

Multimodal: Add Cross-Attention Arch#77

Merged
amazloumi merged 3 commits intofeat/multimodalfrom
feat/multimodal-cross-attn
May 7, 2026
Merged

Multimodal: Add Cross-Attention Arch#77
amazloumi merged 3 commits intofeat/multimodalfrom
feat/multimodal-cross-attn

Conversation

@amazloumi
Copy link
Copy Markdown
Member

Summary

Adds the Cross-Attention VLM arch (arch = "cross_attention", Llama-3-V style) on top of the VLM foundation that landed in PR #76 . The residual stream stays text-only; image features flow as K/V into separate CrossAttentionBlocks inserted at a configurable cadence. CA blocks are zero-initialized so adding the arch on top of a text-only checkpoint is identity at step 0 and learns from there.

This PR stacks on PR #76, so the diff shows only the CA additions.
_RESERVED_ARCHS shrinks to ("mot",) — Mixture-of-Transformers will land in the next PR.

Testing

  • ruff check kempnerforge/ tests/ scripts/ — clean
  • ruff format --check kempnerforge/ tests/ scripts/ — clean
  • pyright kempnerforge/ — 0 errors / 0 warnings
  • sphinx-build -W --keep-going -b html docs docs/_build/html — clean (no Sphinx warnings)
  • pytest tests/unit/ tests/integration/ --cov --cov-branch1130 passed, 1 skipped, 81% coverage
  • torchrun --nproc_per_node=4 -m pytest tests/distributed/ --slow81 passed on 4× H200 (full distributed suite incl. 6 JD + 8 CA VLM tests)
  • CA + MoE smoke (tests/distributed/test_vlm_cross_attn_fsdp.py::TestMoEWithVLM) — passes on 2-GPU FSDP2

Follow-ups

  • PR 3 — Mixture-of-Transformers arch. MoTConfig + MoTBlock + MoTStrategy + JD→MoT warm-start helper. Per-modality Q/K/V/O + per-modality FFN at every layer, single global SDPA. Reserved in _RESERVED_ARCHS here.

@amazloumi amazloumi merged commit 5d151b5 into feat/multimodal May 7, 2026
@amazloumi amazloumi deleted the feat/multimodal-cross-attn branch May 7, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants