feat(layers): implement custom shape-aligned attention and MoE primit… by katyaoussar · Pull Request #4200 · AI-Hypercomputer/maxtext

katyaoussar · 2026-06-18T18:19:37Z

This PR implements a custom, JAX/Flax NNX integration of the DeepSeek-V4 attention mechanism and core primitives.

Here is a concise summary of what was done to the four core components:

1- RoPE (Rotary Embeddings): Implemented custom interleaved channel frequency pairing ([-x1, x0, -x3, x2]) and partial dimension rotation for precise token position encoding.
2- Grouped Linear: Created parallel, multi-group projection layers to efficiently mix attention head outputs in a single compilable step.
3- MoE (Mixture of Experts): Built the learned Top-K expert routing mechanism along with the custom SqrtSoftplus load-balancing loss to ensure stable training routing.
4- Attention Block: Engineered a unified, TPU-optimized module combining local sliding window attention, overlapping compressed sparse attention (CSA) with a causal indexer, and heavily compressed history attention (HCA) — using block-bias masking to avoid dynamic gather memory stalls.

…ives for DeepSeek-V4

feat(layers): implement custom shape-aligned attention and MoE primit…

7b4996b

…ives for DeepSeek-V4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(layers): implement custom shape-aligned attention and MoE primit…#4200

feat(layers): implement custom shape-aligned attention and MoE primit…#4200
katyaoussar wants to merge 1 commit into
AI-Hypercomputer:mainfrom
katyaoussar:feature/deepseek-v4-custom-attention

katyaoussar commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

katyaoussar commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant