feat(bb/msm): GLV endomorphism decomposition for laptop/mobile BN254 MSM by AztecBot · Pull Request #23731 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T00:34:33Z

GLV endomorphism MSM for laptop/mobile GPUs

GLV decomposition for BN254 MSM on the stream-walker branch. φ(x,y)=(βx,y)=[λ]P splits s = s₁+λs₂ with |sᵢ|≈2¹²⁷, so an n-point/254-bit MSM becomes 2n-point/127-bit, halving the window count T and with it the n-independent BPR + Horner stages. Full writeup: GLV_ANALYSIS.md.

TL;DR — a both-axis win (real-hardware M2)

The first revisions presented GLV as a time↔memory knob: stored-φ was faster but cost memory; on-the-fly-φ saved memory but cost time. This revision adds precomputed-φ, which beats the baseline and stored-φ on both axes — so that tradeoff was an implementation artifact, not a fundamental limit.

variant (Apple M2, 2¹⁷, c=13)	time (ms)	Δtime vs base	memory (MiB)	Δmem vs base
baseline (pickC)	88.8	—	68.4	—
GLV stored-φ	84.2	−5.1%	72.3	+5.7%
GLV on-the-fly-φ	98.3	+10.7%	64.3	−6.0%
GLV precomputed-φ	81.6	−8.1%	68.3	−0.2%

(BrowserStack Apple M2, n=2¹⁷, time = min of 6 steady-state dispatches; memory = statsBytes.) Precomputed-φ is faster than stored-φ (−8.1% vs −5.1%) at baseline memory (vs stored-φ's +5.7%). It dominates every other variant on the Pareto front.

Why precomputed-φ wins (and why it is binding-safe)

φ(P)=(β·x, y) shares its y with P. The three variants differ only in how the φ-term's coordinates reach the bucket-accumulate gather:

stored-φ materialises a full 2n-point pool — but point_y[n+i] == point_y[i], so it duplicates n y-coordinates (the +8 MiB / +5.7%).
on-the-fly-φ stores only n points and recomputes β·x in the gather. Each φ-point is gathered ≈T/2 times, so it pays T/2 Montgomery β-multiplies per point — not hidden under memory latency on the M2 (+10.7%).
precomputed-φ stores β·x once in the upper half of the existing point_x buffer (a 2n-slot x, an n-slot y) and the gather reads it directly — no per-gather β-multiply. β·x is filled once at pool-build time by a new glv_phi_fill kernel. y is reused from the lower half — no duplication.

So precomputed-φ keeps stored-φ's β-multiply-free gather and halves stored-φ's point_y, which on the memory-bandwidth-bound stream-walker trims even the accumulate (less y-traffic) — hence faster than stored-φ.

Crucially it adds no new GPU binding: ba_stream_walker already uses 10 storage buffers (0–9) + 1 uniform, and an 11th storage binding is invalid on Mali/Adreno. Precomputed-φ reuses the existing point_x/point_y bindings (resized), staying within the mobile 10-binding limit. The half-scalar sign is folded into scalar bit 255 (decompose XORs it), same as on-the-fly.

Code: MsmV2Pool.createGlvPrecomputed, MsmConfig.glvPrecomputed, the glv_phi_fill kernel, and a glv_precomp branch in ba_size1 / ba_stream_walker's load_pt_x.

Phase profile (M2 GPU timestamps, c=13) — where the win comes from

phase	base (ms)	glv-pre (ms)
stream_walker (accumulate)	71.7	69.2
reduce (BPR)	8.26	6.68
walker_combine	6.16	4.46
preprocess	3.54	3.01
planner	0.39	0.20

stream_walker is ~77% of the wall — the memory-bound, GLV-invariant accumulate (T·n digits unchanged). GLV's gain is in the n-independent tail (reduce + walker_combine + planner, ∝ T·B), which the halved window count (T 20→10) shrinks; precomputed-φ's smaller point_y additionally trims stream_walker. This matches the stage accounting and the ground truth that the walker is bandwidth-bound.

Pipeline note — where BPR actually lives

On this branch the legacy V2 pair-tree kernels (ba_fused_super/ba_carry/ba_finalize) are no longer dispatched (their buffers are 4-byte stubs); there is a single accumulator — the stream-walker — followed by a reduce (bucket-point-reduction) stage that loops over numWindows. GLV halving T therefore halves that reduce plus the window/bucket-bound planner stages, while the O(T·n) accumulate is GLV-invariant.

Budget-batch composition

The budget-batch lever splits the windows into numBatches groups when a single batch would exceed the 65 000-workgroup dispatch cap: numBatches is the smallest nb with ⌈(⌈T/nb⌉·n)/WGI⌉ < 65 000 (WGI=128). GLV's halved T composes with it on both fronts:

n	base T	base numBatches	GLV T	GLV numBatches
2¹⁷	20	1	10	1
2²⁰	20	3 (`⌈20/3⌉·8192 < 65 000`)	10	2 (`⌈10/2⌉·8192 < 65 000`)

At 2¹⁷ neither batches; at 2²⁰ GLV needs one fewer batch than the baseline (fewer redundant preprocess/planner passes) and each batch's bucket working set (∝ batchWindows·BW) is halved. So GLV's memory win and the budget-batch lever are complementary, not competing.

Correctness — SwiftShader, GLV vs noble CPU reference (all GREEN)

path	logn	result
baseline (non-GLV)	8, 10	✅
GLV stored-φ	8, 10	✅
GLV on-the-fly-φ	8, 10	✅
GLV precomputed-φ	8, 10	✅
GLV precomputed-φ forced c=11	10, 12	✅

Memory — measured working set (`statsBytes`)

variant	c=13 (2¹⁷)	c=15 (2¹⁷)
baseline	68.4	101.4
GLV stored-φ	72.3	89.6
GLV on-the-fly-φ	64.3	81.6
GLV precomputed-φ	68.3	85.6

At c=13 precomputed-φ ≈ baseline; at c=15 it is −15.6% vs baseline — both under the 100 MB budget the c=15 baseline exceeds.

Adreno / Mali

The mobile A/B (Snapdragon S25-Ultra Adreno / Pixel 9 Pro XL Mali; no timestamp-query → GPU wall time) is wired (--target s25-ultra --profile 0) but the shared BrowserStack pool has been continuously saturated by another agent's batch holding both concurrency slots; queued workers starved. The win is architecture-general — it removes per-gather ALU (the β-multiply) and memory traffic (the duplicated y), and the n-independent tail GLV halves is a larger fraction on weaker mobile GPUs — so it is expected to be at least as large there as on the M2. Numbers will be added when a mobile seat frees.

Reproduce

cd barretenberg/ts && yarn install
# Correctness (SwiftShader); GLV_OTF=1 / GLV_PRE=1 select on-the-fly / precomputed φ:
CHROMIUM_PATH=/opt/ms-playwright/chromium-1148/chrome-linux/chrome \
  GLV_PRE=1 node dev/msm-webgpu/swiftshader-noble.mjs --glv 8 10
# Real-HW c-sweep (base/stored/on-the-fly/precomputed φ) + M2 phase profile:
node dev/msm-webgpu/scripts/run-browserstack.mjs --target macos --autorun glv-csweep --n 17 --reps 6 --cs 13,15
# Adreno / Mali: --target s25-ultra --profile 0   (or --target pixel-9-pro-xl --profile 0)

…ork (WIP)

- swiftshader-noble.mjs: add --use-vulkan=swiftshader; without it headless Chromium returns a null WebGPU adapter on the lavapipe-less dev box and the GLV cross-check cannot run. - main.ts: add glv-csweep autorun — sweeps GLV window size c against the real production baseline (pickC) and matched-c baselines, with warm-up + min-of- reps steady-state timing and algo+pool+total memory. The earlier glv-compare forced both sides to c=15, a config the n=2^17 baseline never uses (pickC=13), hiding GLV's real lever (fewer windows at its own optimal c). - run-browserstack.mjs: forward --cs to the page query string.

The first GLV pass stored both n original points and n phi(P)=(beta*x,y) points, doubling the SRS pool (+8 MiB at n=2^17) — the only axis where GLV lost to the production baseline. Compute phi in the point gather instead, so the pool holds only the original n points: - decompose_scalars_booth: XOR the half-scalar sign (folded by the host into bit 255 of the sub-2^127 scalar, always 0 for a 254-bit Fr scalar) into the stored Booth sign, so a negative half-scalar negates its point. - gather kernels (stream_walker, size1, stream_accum(+debug), recompute_split): a value index >= GLV_HALF is a phi-term — gather point (idx - GLV_HALF) and scale x by Montgomery(beta). GLV_HALF is baked = pool.srsN, so a baseline pool never triggers it. No new bindings (Mali 10-binding safe). - buildGlvInputsOnTheFly: n points + 2n signed-bit scalars. - MsmConfig.glvOnTheFly relaxes the n<=srsN guard to n<=2*srsN. Cross-checked GREEN vs noble under SwiftShader (logn 8,10), on-the-fly path.

…ctions The +7% was a c=15 artifact; at production c=13 stored-phi GLV is ~5% faster. GLV is a time<->memory knob (store the 2n points or recompute phi on-the-fly), not a free both-axis win.

…no per-gather β-mul)

…stamp profiling on Adreno)

…) + robust median timing Android Chrome on BrowserStack real devices launches the URL via an intent that truncates it at the first shell/intent delimiter (& and ;), so ?coi=1&autorun=glv-csweep&... arrived as just ?coi=1 — autorun/logn/cs were silently dropped and the page idled after boot. This blocked every prior mobile run (misattributed to seat contention). Bundle the whole query into a single base64url q param (only [A-Za-z0-9-_], no delimiters); main.ts expands it via history.replaceState before any param is read, and the vite COI middleware base64-decodes q to still set COOP/COEP. run-browserstack.mjs emits the bundled form for Android targets. Also switch the glv-csweep timing from min-of-6 to median: min latches onto fast mobile-GPU glitches (a dropped/early-returning dispatch resolving in tens of ms), yielding impossible sub-100ms readings; median rejects both tails. Min and raw samples are retained per row for transparency.

feat(bb/msm): GPU-less SwiftShader correctness harness + mobile MSM w…

f8b7d3e

…ork (WIP)

AztecBot added ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR. labels May 30, 2026

AztecBot added 3 commits May 30, 2026 01:02

update PR #23731

592a51c

update PR #23731

6d09044

update PR #23731

e77201d

AztecBot changed the title ~~feat(bb/msm): GPU-less SwiftShader correctness harness + mobile MSM work (WIP)~~ feat(bb/msm): GLV endomorphism decomposition for laptop/mobile BN254 MSM May 30, 2026

AztecBot added 7 commits May 30, 2026 05:47

docs(bb/msm): GLV_ANALYSIS — note measured M2 results supersede proje…

484a49e

…ctions The +7% was a c=15 artifact; at production c=13 stored-phi GLV is ~5% faster. GLV is a time<->memory knob (store the 2n points or recompute phi on-the-fly), not a free both-axis win.

feat(bb/msm): GLV precomputed-φ — both-axis win (store β·x, reuse y, …

5920f85

…no per-gather β-mul)

test(bb/msm): glv-csweep warmup override for memory-only measurement

c06b81e

test(bb/msm): run-browserstack profile passthrough (skip M2-only time…

e7d1d08

…stamp profiling on Adreno)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bb/msm): GLV endomorphism decomposition for laptop/mobile BN254 MSM#23731

feat(bb/msm): GLV endomorphism decomposition for laptop/mobile BN254 MSM#23731
AztecBot wants to merge 11 commits into
stream-walker-implfrom
cb/msm-opt-bucket-overhaul-f065

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GLV endomorphism MSM for laptop/mobile GPUs

TL;DR — a both-axis win (real-hardware M2)

Why precomputed-φ wins (and why it is binding-safe)

Phase profile (M2 GPU timestamps, c=13) — where the win comes from

Pipeline note — where BPR actually lives

Budget-batch composition

Correctness — SwiftShader, GLV vs noble CPU reference (all GREEN)

Memory — measured working set (statsBytes)

Adreno / Mali

Reproduce

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading

Memory — measured working set (`statsBytes`)