Serving Ling-2.6-1T on TPU with SGLang-JAX by JamesBrianD · Pull Request #348 · lm-sys/lm-sys.github.io

JamesBrianD · 2026-06-11T18:49:52Z

Summary

Add a new Ling-2.6-1T TPU serving blog post.
Document Fused MoE V2, the TPU execution model, V1/V2 kernel changes, and benchmark results.
Add PNG assets for the hero image, throughput charts, TPU execution model, MoE pipeline timelines, and overlap breakdown.

Validation

npm run build

- Retitle to highlight the Pallas fused MoE kernel core (hiding data movement behind compute) - Fix non-native phrasing, dangling modifiers, and naming consistency (Fused MoE V1/V2, FusedEPMoE, fp8, hidden-dimension slices) - Credit Fused MoE V1 authors (tpu-inference) and add SGLang-JAX adapted-kernel reference; renumber references - Move full TPU-vs-GPU comparison (incl. prefill gap) to benchmarks next to the GLA prefill note; renumber figures - Deduplicate: GLA section, memory pools, DP, future-work bullets, AIME result; fold Accuracy section into appendix - Clarify measurement scopes (in-kernel vs standalone all-to-all) and per-device vs per-chip specs; restore TPU v7x spec note in appendix Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

alexnails · 2026-06-13T16:46:47Z

+type: blog
+---
+
+SGLang-JAX now supports efficient serving of inclusionAI's Ling-2.6-1T on TPU v7x. With a working baseline in place, profiling pointed to the MoE path as the main bottleneck: each layer scatters tokens across 32 JAX devices, runs expert FFNs, and gathers the outputs back. This post focuses first on Fused MoE V2, a new Pallas kernel that fuses scatter, expert FFN, and gather while overlapping TPU compute and data movement.


"profiling pointed to the Mixture of Experts (MoE) path"

nit: define MoE once

edit: this is just terminology comment basically, can keep this on one discussion thread that is linked on L137

alexnails · 2026-06-13T16:47:54Z

+With Fused MoE V2, MoE prefill latency drops from **5.16 ms to 2.42 ms** — and on the same SGLang decode benchmark, **16 TPU v7x chips reach 1.29×–1.77× the output throughput of 16 H200 GPUs**. The full numbers are below.
+
+<img src="/images/blog/2026-06-11-ling-2-6-tpu/hero.png" alt="Ling-2.6-1T decode throughput, TPU v7x vs GPU H200" style="display:block; margin: 2.5em auto 0.5em auto; width: 100%; max-width: 900px;" />
+<p style="text-align: center; color: #666; font-style: italic;">Figure 1. Ling-2.6-1T decode throughput on TPU v7x-16 vs H200×16, using the same SGLang benchmark with 16,384-token input and 1,024-token output.</p>


is this random dataset?

alexnails · 2026-06-13T16:50:22Z

+- **TPU vs H200 decode:** TPU v7x-16 delivers **1.29×** the decode output throughput of H200×16 at `mc=128`, and **1.77×** at `mc=512`.
+- **Beyond MoE:** The full Ling-2.6-1T bring-up also includes hybrid KV/recurrent memory pools, GLA linear attention, and single-controller data parallelism.
+
+For the rest of the post, only a few Ling-2.6-1T facts matter: it is a **1T sparse MoE** model with **63B activated parameters per token**, **256 routed experts with top-8 routing plus one shared expert**, **per-channel fp8 MoE weights**, and a hybrid **MLA + Lightning Linear** backbone. The MoE structure drives the kernel work in the first half of this post; the hybrid backbone motivates the later memory-pool and GLA bring-up sections.


Minor / open to feedback: Ling-2.6-1T is a 1T sparse MoE model with ....

"For the rest of the post only a few" is not really needed. We just are basically adding a tldr model description for Ling 2.6 1.T

lets look at previous LMSys blogs? (I think reviewing against day 0 posts + some model optimization posts are good reference materials)

alexnails · 2026-06-13T16:51:02Z

+
+For the rest of the post, only a few Ling-2.6-1T facts matter: it is a **1T sparse MoE** model with **63B activated parameters per token**, **256 routed experts with top-8 routing plus one shared expert**, **per-channel fp8 MoE weights**, and a hybrid **MLA + Lightning Linear** backbone. The MoE structure drives the kernel work in the first half of this post; the hybrid backbone motivates the later memory-pool and GLA bring-up sections.
+
+## Optimizing the Fused MoE Kernel


open to feedback: this restates the tl;dr and the introduction, lets maybe make this a bit tighter? (edit: move some of the background info up like the MoeV1 explainer?)

something like: ## The Setup: Optimizing the Fused MoE Kernel

other options

maybe a better title

have this be under the tl;dr as a ### subsection?(would make### Model Explanation` or something a subsection to the tl;dr / background info too)

alexnails · 2026-06-13T16:56:40Z

+
+Our Ling-2.6-1T support is intentionally scoped for this release — several items remain as follow-ups we're actively working on:
+
+- **GLA / Linear-Attention prefill kernel.** As flagged in the benchmark section, the GLA (Lightning Linear) prefill kernel is now the dominant prefill cost. Bringing it up to par — better chunking/tiling, fusing the gating and recurrent-state updates, and the same MXU/VPU/DMA-overlap treatment applied to the MoE kernel — is the most direct remaining lever for end-to-end prefill.


drop the dash

Bringing it up to par by considering methods such as

alexnails · 2026-06-13T17:01:39Z

+
+The important design choice is that DP is part of the SPMD runtime, not a fleet of independent server replicas. SGLang-JAX runs one logical scheduler, and `dp_rank` is attached to requests, KV allocation, and prefix-cache keys. That gives global admission control from one load snapshot, deterministic batch construction across hosts, and one global prefix-cache structure with entries keyed by `(dp_rank, prefix)`.
+
+This also composes cleanly with the rest of the hybrid runtime. Moving between DP × EP and DP × TP × EP is a mesh-shape change rather than a scheduler fork, so the memory pools, batching path, and attention backends keep the same mental model.


DP × EP and DP × TP × EP can be phrased as continue to scale for larger individual experts.

This may be too technical of an explanation so feel free to say no

alexnails · 2026-06-13T17:05:45Z

+scatter tokens -> local expert FFN -> gather results
+```
+
+With this structure, MoE cost is more than GEMM FLOPs. The kernel has to move data through three expensive paths: token routing across chips, expert weight reads from HBM into VMEM, and fp8 layout / scale handling around the MXU.


Matrix Multiplication Unit (MXU)

This is a term every person knows if they know TPU, but let's include the https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm in the appendix ?

If you feel there are too many of these term explanations in line with the blog post maybe we include a small term glossary at the end

edit: ignoring the TPU arch add to the appendix, this is just terminology comment basically, can keep this on one discussion thread that is linked on L137

alexnails · 2026-06-13T17:06:29Z

+type: blog
+---
+
+SGLang-JAX now supports efficient serving of inclusionAI's Ling-2.6-1T on TPU v7x. With a working baseline in place, profiling pointed to the MoE path as the main bottleneck: each layer scatters tokens across 32 JAX devices, runs expert FFNs, and gathers the outputs back. This post focuses first on Fused MoE V2, a new Pallas kernel that fuses scatter, expert FFN, and gather while overlapping TPU compute and data movement.


does runs the experts FFNs read more natural ?

alexnails · 2026-06-13T17:07:54Z

+824.6 GFLOP / 2307 TFLOP/s = 0.36 ms
+```
+
+This is an ideal lower bound that excludes data movement, fp8 packing/unpacking, and VPU-side scale handling. The measured **2.42 ms** production trace is still about **7×** above this bound, so pure GEMM FLOPs do not explain the latency.


same comment as MoE / MXU, Vector Processing Unit (VPU)

I think FFN, TPU are fine

edit: this is just terminology comment basically, can keep this on one discussion thread that is linked on L137

alexnails · 2026-06-13T17:10:09Z

+
+### 2. Why this needs a Pallas fused kernel
+
+The rest of this section uses some TPU terminology. The simplified picture is: a TensorCore contains MXU, VPU, and VMEM; HBM sits outside the chip; chips communicate over ICI.


haha see my other comment, we already use the terminology above this. open to moving this up?

blog: add Ling-2.6 TPU serving post

08d02a1

JamesBrianD force-pushed the blog/ling-2-6-tpu branch from e044db3 to 08d02a1 Compare June 11, 2026 18:58

JamesBrianD marked this pull request as ready for review June 11, 2026 18:58

JamesBrianD and others added 2 commits June 12, 2026 17:06

blog: update Ling-2.6 TPU post acknowledgments

3f7e7d1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

alexnails reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serving Ling-2.6-1T on TPU with SGLang-JAX#348

Serving Ling-2.6-1T on TPU with SGLang-JAX#348
JamesBrianD wants to merge 3 commits into
lm-sys:mainfrom
JamesBrianD:blog/ling-2-6-tpu

JamesBrianD commented Jun 11, 2026

Uh oh!

alexnails Jun 13, 2026 •

edited

Loading

Uh oh!

alexnails Jun 13, 2026

Uh oh!

alexnails Jun 13, 2026

Uh oh!

alexnails Jun 13, 2026

Uh oh!

alexnails Jun 13, 2026 •

edited

Loading

Uh oh!

alexnails Jun 13, 2026

Uh oh!

alexnails Jun 13, 2026 •

edited

Loading

Uh oh!

alexnails Jun 13, 2026

Uh oh!

alexnails Jun 13, 2026 •

edited

Loading

Uh oh!

alexnails Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		For the rest of the post, only a few Ling-2.6-1T facts matter: it is a 1T sparse MoE model with 63B activated parameters per token, 256 routed experts with top-8 routing plus one shared expert, per-channel fp8 MoE weights, and a hybrid MLA + Lightning Linear backbone. The MoE structure drives the kernel work in the first half of this post; the hybrid backbone motivates the later memory-pool and GLA bring-up sections.

		## Optimizing the Fused MoE Kernel


		Our Ling-2.6-1T support is intentionally scoped for this release — several items remain as follow-ups we're actively working on:

		- GLA / Linear-Attention prefill kernel. As flagged in the benchmark section, the GLA (Lightning Linear) prefill kernel is now the dominant prefill cost. Bringing it up to par — better chunking/tiling, fusing the gating and recurrent-state updates, and the same MXU/VPU/DMA-overlap treatment applied to the MoE kernel — is the most direct remaining lever for end-to-end prefill.


		The important design choice is that DP is part of the SPMD runtime, not a fleet of independent server replicas. SGLang-JAX runs one logical scheduler, and `dp_rank` is attached to requests, KV allocation, and prefix-cache keys. That gives global admission control from one load snapshot, deterministic batch construction across hosts, and one global prefix-cache structure with entries keyed by `(dp_rank, prefix)`.

		This also composes cleanly with the rest of the hybrid runtime. Moving between DP × EP and DP × TP × EP is a mesh-shape change rather than a scheduler fork, so the memory pools, batching path, and attention backends keep the same mental model.


		### 2. Why this needs a Pallas fused kernel

		The rest of this section uses some TPU terminology. The simplified picture is: a TensorCore contains MXU, VPU, and VMEM; HBM sits outside the chip; chips communicate over ICI.

Conversation

JamesBrianD commented Jun 11, 2026

Summary

Validation

Uh oh!

alexnails Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexnails Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

alexnails Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

alexnails Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

alexnails Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexnails Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

alexnails Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexnails Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

alexnails Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexnails Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexnails Jun 13, 2026 •

edited

Loading

alexnails Jun 13, 2026 •

edited

Loading

alexnails Jun 13, 2026 •

edited

Loading

alexnails Jun 13, 2026 •

edited

Loading