docs(benchmarks): Gemma3n decode profile, no fusion justified (#329) by inureyes · Pull Request #345 · lablup/mlxcel

inureyes · 2026-06-17T21:48:58Z

Summary

Profile-first investigation for #329: profile Gemma3n decode, then add a compiled fusion only if the profile justifies one. It does not, so this PR ships the profiling finding as documentation and adds no kernel. A documented "no fusion justified" outcome is the result the issue explicitly calls for.

Findings (Apple M1 Ultra, MLX 0.31.2, mlx-vlm 0.4.4)

Decode is ~92% GPU-bound. The Rust graph-build step (forward), where an FFI-crossing fusion would land, is ~7% per token and fully hidden behind async_eval. A fusion that only collapses FFI crossings would regress here, the same outcome as MLXCEL_FUSED_QK_NORM on Qwen3 (1-3.4% slower on M1 Ultra).
Decode tok/s: e2b-4bit 83.1, e4b-4bit 62.9 (text), 58.7 (image). mlxcel already leads mlx-vlm (68.9 / 52.6 / 47.9) by 1.20-1.23x, and mlx-lm cannot load these multimodal checkpoints, so there is no parity gap to chase.
The worthwhile Gemma3n fusions already shipped in perf(gemma3n): reduce bf16 decode AltUp/MLP graph overhead #60 (fused MLP bridge, compiled gelu_topk/GeGLU, stacked AltUp) and are active on M1 Ultra (use_fused_decode_path() is true on non-NA hardware).
Pure weight streaming is ~43% of GPU time; the ~57% structure overhead is dominated by command-buffer dispatch gaps, not fusable kernel compute. Raising MLX_MAX_OPS_PER_BUFFER toward 1000 recovers +11-13% with no code (e2b 82.7 -> 93.2, e4b 63.0 -> 70.0). That is a scheduling knob, not a kernel, and must be hardware-gated because M5 regresses with larger buffers, so it belongs in a separate follow-up issue.
Text vs VLM: the vision tower is a prefill cost only; decode is the same per-layer structure either way.

What changed

Add docs/benchmark_results/gemma3n-decode-profile.md: full numbers, pipeline split, one-decode-token op histogram, pure-GEMV streaming floor, command-buffer sweep, text-vs-VLM table, and reproduce commands.
Link it from the "Decode-gap investigations" list in docs/benchmarks.md.

Test plan

Decode best-of-3 (caffeinate, cool) on gemma3n-e2b/e4b-4bit via mlxcel-bench-decode.
Pipeline split via MLXCEL_PROFILE_PIPELINE_DETAIL; op histogram via MLXCEL_EXPORT_DECODE_DOT.
mlx-vlm reference via mlx_vlm.stream_generate (wall-clock).
Docs only; no code paths changed.

Closes #329

Profile-first deliverable for #329. Measured Gemma3n decode on M1 Ultra (e2b-4bit 83 tok/s, e4b-4bit 63 tok/s) with mlxcel-bench-decode and the mlxcel-gpu-profiling hooks, and compared against mlx-vlm, the only Python runtime that loads these multimodal checkpoints (mlx-lm cannot). Conclusion: no compiled fusion is justified now. Decode is ~92% GPU-bound and the Rust graph-build step where an FFI-crossing fusion would land is ~7% and fully hidden behind async_eval, so a fusion that only collapses FFI crossings would regress as MLXCEL_FUSED_QK_NORM did on Qwen3. The worthwhile Gemma3n fusions already shipped in #60 (fused MLP bridge, compiled gelu_topk/GeGLU, stacked AltUp) and are active on M1 Ultra. mlxcel already leads mlx-vlm by 1.20-1.23x on these checkpoints. The decode-time overhead above pure weight streaming is dominated by command-buffer dispatch gaps, not small-kernel compute: raising MLX_MAX_OPS_PER_BUFFER toward 1000 recovers +11-13% with no code. That scheduling knob, hardware-gated since M5 regresses with larger buffers, is the recommended follow-up and belongs in its own issue, not a fusion. Adds docs/benchmark_results/gemma3n-decode-profile.md with the full numbers, per-layer op analysis, text-vs-VLM comparison, and reproduce commands; links it from docs/benchmarks.md.

The no-fusion decode-profile finding is unchanged; these are accuracy fixes to its supporting claims, verified against the code on this branch. The fused MLP bridge (gemma3n_mlp_forward) is gated on regular_weight(), so it is bf16-only and not active on the 4-bit e2b/e4b checkpoints profiled here; the op-histogram QuantizedMatmul nodes are the unfused gate/up/down. Extending the bridge to quantized weights would only collapse the FFI crossings the profile already shows are hidden, so the conclusion stands. The compiled gelu_topk and GeGLU activation kernels predate #60 (added 2026-04-02 and 2026-04-24) and are reused by it; #60 itself shipped the fused MLP bridge and stacked AltUp. Corrected the three spots that attributed the activation kernels to #60 or implied the bridge was active on the 4-bit path.

inureyes added status:review Under review type:docs Documentation improvements or additions area:models Model architectures, weights, loading, metadata area:core mlxcel-core: MLX FFI, primitives, KV cache, layers priority:backlog Future considerations labels Jun 17, 2026

inureyes added status:done Completed and removed status:review Under review labels Jun 17, 2026

inureyes merged commit 9cb93a3 into main Jun 17, 2026
5 checks passed

inureyes deleted the perf/329-gemma3n-decode-profile branch June 17, 2026 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(benchmarks): Gemma3n decode profile, no fusion justified (#329)#345

docs(benchmarks): Gemma3n decode profile, no fusion justified (#329)#345
inureyes merged 2 commits into
mainfrom
perf/329-gemma3n-decode-profile

inureyes commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

inureyes commented Jun 17, 2026

Summary

Findings (Apple M1 Ultra, MLX 0.31.2, mlx-vlm 0.4.4)

What changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant