Skip to content

Serving Ling-2.6-1T on TPU with SGLang-JAX#348

Open
JamesBrianD wants to merge 3 commits into
lm-sys:mainfrom
JamesBrianD:blog/ling-2-6-tpu
Open

Serving Ling-2.6-1T on TPU with SGLang-JAX#348
JamesBrianD wants to merge 3 commits into
lm-sys:mainfrom
JamesBrianD:blog/ling-2-6-tpu

Conversation

@JamesBrianD

Copy link
Copy Markdown

Summary

  • Add a new Ling-2.6-1T TPU serving blog post.
  • Document Fused MoE V2, the TPU execution model, V1/V2 kernel changes, and benchmark results.
  • Add PNG assets for the hero image, throughput charts, TPU execution model, MoE pipeline timelines, and overlap breakdown.

Validation

  • npm run build

@JamesBrianD JamesBrianD marked this pull request as ready for review June 11, 2026 18:58
JamesBrianD and others added 2 commits June 12, 2026 17:06
- Retitle to highlight the Pallas fused MoE kernel core (hiding data
  movement behind compute)
- Fix non-native phrasing, dangling modifiers, and naming consistency
  (Fused MoE V1/V2, FusedEPMoE, fp8, hidden-dimension slices)
- Credit Fused MoE V1 authors (tpu-inference) and add SGLang-JAX
  adapted-kernel reference; renumber references
- Move full TPU-vs-GPU comparison (incl. prefill gap) to benchmarks
  next to the GLA prefill note; renumber figures
- Deduplicate: GLA section, memory pools, DP, future-work bullets,
  AIME result; fold Accuracy section into appendix
- Clarify measurement scopes (in-kernel vs standalone all-to-all) and
  per-device vs per-chip specs; restore TPU v7x spec note in appendix

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
type: blog
---

SGLang-JAX now supports efficient serving of inclusionAI's Ling-2.6-1T on TPU v7x. With a working baseline in place, profiling pointed to the MoE path as the main bottleneck: each layer scatters tokens across 32 JAX devices, runs expert FFNs, and gathers the outputs back. This post focuses first on Fused MoE V2, a new Pallas kernel that fuses scatter, expert FFN, and gather while overlapping TPU compute and data movement.

@alexnails alexnails Jun 13, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"profiling pointed to the Mixture of Experts (MoE) path"

nit: define MoE once

edit: this is just terminology comment basically, can keep this on one discussion thread that is linked on L137

With Fused MoE V2, MoE prefill latency drops from **5.16 ms to 2.42 ms** — and on the same SGLang decode benchmark, **16 TPU v7x chips reach 1.29×–1.77× the output throughput of 16 H200 GPUs**. The full numbers are below.

<img src="/images/blog/2026-06-11-ling-2-6-tpu/hero.png" alt="Ling-2.6-1T decode throughput, TPU v7x vs GPU H200" style="display:block; margin: 2.5em auto 0.5em auto; width: 100%; max-width: 900px;" />
<p style="text-align: center; color: #666; font-style: italic;">Figure 1. Ling-2.6-1T decode throughput on TPU v7x-16 vs H200×16, using the same SGLang benchmark with 16,384-token input and 1,024-token output.</p>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this random dataset?

- **TPU vs H200 decode:** TPU v7x-16 delivers **1.29×** the decode output throughput of H200×16 at `mc=128`, and **1.77×** at `mc=512`.
- **Beyond MoE:** The full Ling-2.6-1T bring-up also includes hybrid KV/recurrent memory pools, GLA linear attention, and single-controller data parallelism.

For the rest of the post, only a few Ling-2.6-1T facts matter: it is a **1T sparse MoE** model with **63B activated parameters per token**, **256 routed experts with top-8 routing plus one shared expert**, **per-channel fp8 MoE weights**, and a hybrid **MLA + Lightning Linear** backbone. The MoE structure drives the kernel work in the first half of this post; the hybrid backbone motivates the later memory-pool and GLA bring-up sections.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor / open to feedback: Ling-2.6-1T is a 1T sparse MoE model with ....

"For the rest of the post only a few" is not really needed. We just are basically adding a tldr model description for Ling 2.6 1.T

lets look at previous LMSys blogs? (I think reviewing against day 0 posts + some model optimization posts are good reference materials)


For the rest of the post, only a few Ling-2.6-1T facts matter: it is a **1T sparse MoE** model with **63B activated parameters per token**, **256 routed experts with top-8 routing plus one shared expert**, **per-channel fp8 MoE weights**, and a hybrid **MLA + Lightning Linear** backbone. The MoE structure drives the kernel work in the first half of this post; the hybrid backbone motivates the later memory-pool and GLA bring-up sections.

## Optimizing the Fused MoE Kernel

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open to feedback: this restates the tl;dr and the introduction, lets maybe make this a bit tighter? (edit: move some of the background info up like the MoeV1 explainer?)

something like: ## The Setup: Optimizing the Fused MoE Kernel

other options

  1. maybe a better title
  2. have this be under the tl;dr as a ### subsection?(would make### Model Explanation` or something a subsection to the tl;dr / background info too)


Our Ling-2.6-1T support is intentionally scoped for this release — several items remain as follow-ups we're actively working on:

- **GLA / Linear-Attention prefill kernel.** As flagged in the benchmark section, the GLA (Lightning Linear) prefill kernel is now the dominant prefill cost. Bringing it up to par — better chunking/tiling, fusing the gating and recurrent-state updates, and the same MXU/VPU/DMA-overlap treatment applied to the MoE kernel — is the most direct remaining lever for end-to-end prefill.

@alexnails alexnails Jun 13, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the dash

Bringing it up to par by considering methods such as


The important design choice is that DP is part of the SPMD runtime, not a fleet of independent server replicas. SGLang-JAX runs one logical scheduler, and `dp_rank` is attached to requests, KV allocation, and prefix-cache keys. That gives global admission control from one load snapshot, deterministic batch construction across hosts, and one global prefix-cache structure with entries keyed by `(dp_rank, prefix)`.

This also composes cleanly with the rest of the hybrid runtime. Moving between DP × EP and DP × TP × EP is a mesh-shape change rather than a scheduler fork, so the memory pools, batching path, and attention backends keep the same mental model.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DP × EP and DP × TP × EP can be phrased as continue to scale for larger individual experts.

This may be too technical of an explanation so feel free to say no

scatter tokens -> local expert FFN -> gather results
```

With this structure, MoE cost is more than GEMM FLOPs. The kernel has to move data through three expensive paths: token routing across chips, expert weight reads from HBM into VMEM, and fp8 layout / scale handling around the MXU.

@alexnails alexnails Jun 13, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matrix Multiplication Unit (MXU)

This is a term every person knows if they know TPU, but let's include the https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm in the appendix ?

If you feel there are too many of these term explanations in line with the blog post maybe we include a small term glossary at the end

edit: ignoring the TPU arch add to the appendix, this is just terminology comment basically, can keep this on one discussion thread that is linked on L137

type: blog
---

SGLang-JAX now supports efficient serving of inclusionAI's Ling-2.6-1T on TPU v7x. With a working baseline in place, profiling pointed to the MoE path as the main bottleneck: each layer scatters tokens across 32 JAX devices, runs expert FFNs, and gathers the outputs back. This post focuses first on Fused MoE V2, a new Pallas kernel that fuses scatter, expert FFN, and gather while overlapping TPU compute and data movement.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does runs the experts FFNs read more natural ?

824.6 GFLOP / 2307 TFLOP/s = 0.36 ms
```

This is an ideal lower bound that excludes data movement, fp8 packing/unpacking, and VPU-side scale handling. The measured **2.42 ms** production trace is still about **7×** above this bound, so pure GEMM FLOPs do not explain the latency.

@alexnails alexnails Jun 13, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as MoE / MXU, Vector Processing Unit (VPU)

I think FFN, TPU are fine

edit: this is just terminology comment basically, can keep this on one discussion thread that is linked on L137


### 2. Why this needs a Pallas fused kernel

The rest of this section uses some TPU terminology. The simplified picture is: a TensorCore contains MXU, VPU, and VMEM; HBM sits outside the chip; chips communicate over ICI.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha see my other comment, we already use the terminology above this. open to moving this up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants