Serving Ling-2.6-1T on TPU with SGLang-JAX#348
Conversation
e044db3 to
08d02a1
Compare
- Retitle to highlight the Pallas fused MoE kernel core (hiding data movement behind compute) - Fix non-native phrasing, dangling modifiers, and naming consistency (Fused MoE V1/V2, FusedEPMoE, fp8, hidden-dimension slices) - Credit Fused MoE V1 authors (tpu-inference) and add SGLang-JAX adapted-kernel reference; renumber references - Move full TPU-vs-GPU comparison (incl. prefill gap) to benchmarks next to the GLA prefill note; renumber figures - Deduplicate: GLA section, memory pools, DP, future-work bullets, AIME result; fold Accuracy section into appendix - Clarify measurement scopes (in-kernel vs standalone all-to-all) and per-device vs per-chip specs; restore TPU v7x spec note in appendix Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| type: blog | ||
| --- | ||
|
|
||
| SGLang-JAX now supports efficient serving of inclusionAI's Ling-2.6-1T on TPU v7x. With a working baseline in place, profiling pointed to the MoE path as the main bottleneck: each layer scatters tokens across 32 JAX devices, runs expert FFNs, and gathers the outputs back. This post focuses first on Fused MoE V2, a new Pallas kernel that fuses scatter, expert FFN, and gather while overlapping TPU compute and data movement. |
There was a problem hiding this comment.
"profiling pointed to the Mixture of Experts (MoE) path"
nit: define MoE once
edit: this is just terminology comment basically, can keep this on one discussion thread that is linked on L137
| With Fused MoE V2, MoE prefill latency drops from **5.16 ms to 2.42 ms** — and on the same SGLang decode benchmark, **16 TPU v7x chips reach 1.29×–1.77× the output throughput of 16 H200 GPUs**. The full numbers are below. | ||
|
|
||
| <img src="/images/blog/2026-06-11-ling-2-6-tpu/hero.png" alt="Ling-2.6-1T decode throughput, TPU v7x vs GPU H200" style="display:block; margin: 2.5em auto 0.5em auto; width: 100%; max-width: 900px;" /> | ||
| <p style="text-align: center; color: #666; font-style: italic;">Figure 1. Ling-2.6-1T decode throughput on TPU v7x-16 vs H200×16, using the same SGLang benchmark with 16,384-token input and 1,024-token output.</p> |
| - **TPU vs H200 decode:** TPU v7x-16 delivers **1.29×** the decode output throughput of H200×16 at `mc=128`, and **1.77×** at `mc=512`. | ||
| - **Beyond MoE:** The full Ling-2.6-1T bring-up also includes hybrid KV/recurrent memory pools, GLA linear attention, and single-controller data parallelism. | ||
|
|
||
| For the rest of the post, only a few Ling-2.6-1T facts matter: it is a **1T sparse MoE** model with **63B activated parameters per token**, **256 routed experts with top-8 routing plus one shared expert**, **per-channel fp8 MoE weights**, and a hybrid **MLA + Lightning Linear** backbone. The MoE structure drives the kernel work in the first half of this post; the hybrid backbone motivates the later memory-pool and GLA bring-up sections. |
There was a problem hiding this comment.
Minor / open to feedback: Ling-2.6-1T is a 1T sparse MoE model with ....
"For the rest of the post only a few" is not really needed. We just are basically adding a tldr model description for Ling 2.6 1.T
lets look at previous LMSys blogs? (I think reviewing against day 0 posts + some model optimization posts are good reference materials)
|
|
||
| For the rest of the post, only a few Ling-2.6-1T facts matter: it is a **1T sparse MoE** model with **63B activated parameters per token**, **256 routed experts with top-8 routing plus one shared expert**, **per-channel fp8 MoE weights**, and a hybrid **MLA + Lightning Linear** backbone. The MoE structure drives the kernel work in the first half of this post; the hybrid backbone motivates the later memory-pool and GLA bring-up sections. | ||
|
|
||
| ## Optimizing the Fused MoE Kernel |
There was a problem hiding this comment.
open to feedback: this restates the tl;dr and the introduction, lets maybe make this a bit tighter? (edit: move some of the background info up like the MoeV1 explainer?)
something like: ## The Setup: Optimizing the Fused MoE Kernel
other options
- maybe a better title
- have this be under the tl;dr as a
###subsection?(would make### Model Explanation` or something a subsection to the tl;dr / background info too)
|
|
||
| Our Ling-2.6-1T support is intentionally scoped for this release — several items remain as follow-ups we're actively working on: | ||
|
|
||
| - **GLA / Linear-Attention prefill kernel.** As flagged in the benchmark section, the GLA (Lightning Linear) prefill kernel is now the dominant prefill cost. Bringing it up to par — better chunking/tiling, fusing the gating and recurrent-state updates, and the same MXU/VPU/DMA-overlap treatment applied to the MoE kernel — is the most direct remaining lever for end-to-end prefill. |
There was a problem hiding this comment.
drop the dash
Bringing it up to par by considering methods such as
|
|
||
| The important design choice is that DP is part of the SPMD runtime, not a fleet of independent server replicas. SGLang-JAX runs one logical scheduler, and `dp_rank` is attached to requests, KV allocation, and prefix-cache keys. That gives global admission control from one load snapshot, deterministic batch construction across hosts, and one global prefix-cache structure with entries keyed by `(dp_rank, prefix)`. | ||
|
|
||
| This also composes cleanly with the rest of the hybrid runtime. Moving between DP × EP and DP × TP × EP is a mesh-shape change rather than a scheduler fork, so the memory pools, batching path, and attention backends keep the same mental model. |
There was a problem hiding this comment.
DP × EP and DP × TP × EP can be phrased as continue to scale for larger individual experts.
This may be too technical of an explanation so feel free to say no
| scatter tokens -> local expert FFN -> gather results | ||
| ``` | ||
|
|
||
| With this structure, MoE cost is more than GEMM FLOPs. The kernel has to move data through three expensive paths: token routing across chips, expert weight reads from HBM into VMEM, and fp8 layout / scale handling around the MXU. |
There was a problem hiding this comment.
Matrix Multiplication Unit (MXU)
This is a term every person knows if they know TPU, but let's include the https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm in the appendix ?
If you feel there are too many of these term explanations in line with the blog post maybe we include a small term glossary at the end
edit: ignoring the TPU arch add to the appendix, this is just terminology comment basically, can keep this on one discussion thread that is linked on L137
| type: blog | ||
| --- | ||
|
|
||
| SGLang-JAX now supports efficient serving of inclusionAI's Ling-2.6-1T on TPU v7x. With a working baseline in place, profiling pointed to the MoE path as the main bottleneck: each layer scatters tokens across 32 JAX devices, runs expert FFNs, and gathers the outputs back. This post focuses first on Fused MoE V2, a new Pallas kernel that fuses scatter, expert FFN, and gather while overlapping TPU compute and data movement. |
There was a problem hiding this comment.
does runs the experts FFNs read more natural ?
| 824.6 GFLOP / 2307 TFLOP/s = 0.36 ms | ||
| ``` | ||
|
|
||
| This is an ideal lower bound that excludes data movement, fp8 packing/unpacking, and VPU-side scale handling. The measured **2.42 ms** production trace is still about **7×** above this bound, so pure GEMM FLOPs do not explain the latency. |
There was a problem hiding this comment.
same comment as MoE / MXU, Vector Processing Unit (VPU)
I think FFN, TPU are fine
edit: this is just terminology comment basically, can keep this on one discussion thread that is linked on L137
|
|
||
| ### 2. Why this needs a Pallas fused kernel | ||
|
|
||
| The rest of this section uses some TPU terminology. The simplified picture is: a TensorCore contains MXU, VPU, and VMEM; HBM sits outside the chip; chips communicate over ICI. |
There was a problem hiding this comment.
haha see my other comment, we already use the terminology above this. open to moving this up?
Summary
Validation
npm run build