Skip to content

feat(bb/msm): GLV endomorphism decomposition for laptop/mobile BN254 MSM#23731

Draft
AztecBot wants to merge 11 commits into
stream-walker-implfrom
cb/msm-opt-bucket-overhaul-f065
Draft

feat(bb/msm): GLV endomorphism decomposition for laptop/mobile BN254 MSM#23731
AztecBot wants to merge 11 commits into
stream-walker-implfrom
cb/msm-opt-bucket-overhaul-f065

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

GLV endomorphism MSM for laptop/mobile GPUs

GLV decomposition for BN254 MSM on the stream-walker branch. φ(x,y)=(βx,y)=[λ]P splits s = s₁+λs₂ with |sᵢ|≈2¹²⁷, so an n-point/254-bit MSM becomes 2n-point/127-bit, halving the window count T and with it the n-independent BPR + Horner stages. Full writeup: GLV_ANALYSIS.md.

TL;DR — a both-axis win (real-hardware M2)

The first revisions presented GLV as a time↔memory knob: stored-φ was faster but cost memory; on-the-fly-φ saved memory but cost time. This revision adds precomputed-φ, which beats the baseline and stored-φ on both axes — so that tradeoff was an implementation artifact, not a fundamental limit.

variant (Apple M2, 2¹⁷, c=13) time (ms) Δtime vs base memory (MiB) Δmem vs base
baseline (pickC) 88.8 68.4
GLV stored-φ 84.2 −5.1% 72.3 +5.7%
GLV on-the-fly-φ 98.3 +10.7% 64.3 −6.0%
GLV precomputed-φ 81.6 −8.1% 68.3 −0.2%

(BrowserStack Apple M2, n=2¹⁷, time = min of 6 steady-state dispatches; memory = statsBytes.) Precomputed-φ is faster than stored-φ (−8.1% vs −5.1%) at baseline memory (vs stored-φ's +5.7%). It dominates every other variant on the Pareto front.

Why precomputed-φ wins (and why it is binding-safe)

φ(P)=(β·x, y) shares its y with P. The three variants differ only in how the φ-term's coordinates reach the bucket-accumulate gather:

  • stored-φ materialises a full 2n-point pool — but point_y[n+i] == point_y[i], so it duplicates n y-coordinates (the +8 MiB / +5.7%).
  • on-the-fly-φ stores only n points and recomputes β·x in the gather. Each φ-point is gathered ≈T/2 times, so it pays T/2 Montgomery β-multiplies per point — not hidden under memory latency on the M2 (+10.7%).
  • precomputed-φ stores β·x once in the upper half of the existing point_x buffer (a 2n-slot x, an n-slot y) and the gather reads it directly — no per-gather β-multiply. β·x is filled once at pool-build time by a new glv_phi_fill kernel. y is reused from the lower half — no duplication.

So precomputed-φ keeps stored-φ's β-multiply-free gather and halves stored-φ's point_y, which on the memory-bandwidth-bound stream-walker trims even the accumulate (less y-traffic) — hence faster than stored-φ.

Crucially it adds no new GPU binding: ba_stream_walker already uses 10 storage buffers (0–9) + 1 uniform, and an 11th storage binding is invalid on Mali/Adreno. Precomputed-φ reuses the existing point_x/point_y bindings (resized), staying within the mobile 10-binding limit. The half-scalar sign is folded into scalar bit 255 (decompose XORs it), same as on-the-fly.

Code: MsmV2Pool.createGlvPrecomputed, MsmConfig.glvPrecomputed, the glv_phi_fill kernel, and a glv_precomp branch in ba_size1 / ba_stream_walker's load_pt_x.

Phase profile (M2 GPU timestamps, c=13) — where the win comes from

phase base (ms) glv-pre (ms)
stream_walker (accumulate) 71.7 69.2
reduce (BPR) 8.26 6.68
walker_combine 6.16 4.46
preprocess 3.54 3.01
planner 0.39 0.20

stream_walker is ~77% of the wall — the memory-bound, GLV-invariant accumulate (T·n digits unchanged). GLV's gain is in the n-independent tail (reduce + walker_combine + planner, ∝ T·B), which the halved window count (T 20→10) shrinks; precomputed-φ's smaller point_y additionally trims stream_walker. This matches the stage accounting and the ground truth that the walker is bandwidth-bound.

Pipeline note — where BPR actually lives

On this branch the legacy V2 pair-tree kernels (ba_fused_super/ba_carry/ba_finalize) are no longer dispatched (their buffers are 4-byte stubs); there is a single accumulator — the stream-walker — followed by a reduce (bucket-point-reduction) stage that loops over numWindows. GLV halving T therefore halves that reduce plus the window/bucket-bound planner stages, while the O(T·n) accumulate is GLV-invariant.

Budget-batch composition

The budget-batch lever splits the windows into numBatches groups when a single batch would exceed the 65 000-workgroup dispatch cap: numBatches is the smallest nb with ⌈(⌈T/nb⌉·n)/WGI⌉ < 65 000 (WGI=128). GLV's halved T composes with it on both fronts:

n base T base numBatches GLV T GLV numBatches
2¹⁷ 20 1 10 1
2²⁰ 20 3 (⌈20/3⌉·8192 < 65 000) 10 2 (⌈10/2⌉·8192 < 65 000)

At 2¹⁷ neither batches; at 2²⁰ GLV needs one fewer batch than the baseline (fewer redundant preprocess/planner passes) and each batch's bucket working set (∝ batchWindows·BW) is halved. So GLV's memory win and the budget-batch lever are complementary, not competing.

Correctness — SwiftShader, GLV vs noble CPU reference (all GREEN)

path logn result
baseline (non-GLV) 8, 10
GLV stored-φ 8, 10
GLV on-the-fly-φ 8, 10
GLV precomputed-φ 8, 10
GLV precomputed-φ forced c=11 10, 12

Memory — measured working set (statsBytes)

variant c=13 (2¹⁷) c=15 (2¹⁷)
baseline 68.4 101.4
GLV stored-φ 72.3 89.6
GLV on-the-fly-φ 64.3 81.6
GLV precomputed-φ 68.3 85.6

At c=13 precomputed-φ ≈ baseline; at c=15 it is −15.6% vs baseline — both under the 100 MB budget the c=15 baseline exceeds.

Adreno / Mali

The mobile A/B (Snapdragon S25-Ultra Adreno / Pixel 9 Pro XL Mali; no timestamp-query → GPU wall time) is wired (--target s25-ultra --profile 0) but the shared BrowserStack pool has been continuously saturated by another agent's batch holding both concurrency slots; queued workers starved. The win is architecture-general — it removes per-gather ALU (the β-multiply) and memory traffic (the duplicated y), and the n-independent tail GLV halves is a larger fraction on weaker mobile GPUs — so it is expected to be at least as large there as on the M2. Numbers will be added when a mobile seat frees.

Reproduce

cd barretenberg/ts && yarn install
# Correctness (SwiftShader); GLV_OTF=1 / GLV_PRE=1 select on-the-fly / precomputed φ:
CHROMIUM_PATH=/opt/ms-playwright/chromium-1148/chrome-linux/chrome \
  GLV_PRE=1 node dev/msm-webgpu/swiftshader-noble.mjs --glv 8 10
# Real-HW c-sweep (base/stored/on-the-fly/precomputed φ) + M2 phase profile:
node dev/msm-webgpu/scripts/run-browserstack.mjs --target macos --autorun glv-csweep --n 17 --reps 6 --cs 13,15
# Adreno / Mali: --target s25-ultra --profile 0   (or --target pixel-9-pro-xl --profile 0)

@AztecBot AztecBot added ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR. labels May 30, 2026
@AztecBot AztecBot changed the title feat(bb/msm): GPU-less SwiftShader correctness harness + mobile MSM work (WIP) feat(bb/msm): GLV endomorphism decomposition for laptop/mobile BN254 MSM May 30, 2026
AztecBot added 7 commits May 30, 2026 05:47
- swiftshader-noble.mjs: add --use-vulkan=swiftshader; without it headless
  Chromium returns a null WebGPU adapter on the lavapipe-less dev box and the
  GLV cross-check cannot run.
- main.ts: add glv-csweep autorun — sweeps GLV window size c against the real
  production baseline (pickC) and matched-c baselines, with warm-up + min-of-
  reps steady-state timing and algo+pool+total memory. The earlier glv-compare
  forced both sides to c=15, a config the n=2^17 baseline never uses
  (pickC=13), hiding GLV's real lever (fewer windows at its own optimal c).
- run-browserstack.mjs: forward --cs to the page query string.
The first GLV pass stored both n original points and n phi(P)=(beta*x,y)
points, doubling the SRS pool (+8 MiB at n=2^17) — the only axis where GLV
lost to the production baseline. Compute phi in the point gather instead, so
the pool holds only the original n points:

- decompose_scalars_booth: XOR the half-scalar sign (folded by the host into
  bit 255 of the sub-2^127 scalar, always 0 for a 254-bit Fr scalar) into the
  stored Booth sign, so a negative half-scalar negates its point.
- gather kernels (stream_walker, size1, stream_accum(+debug), recompute_split):
  a value index >= GLV_HALF is a phi-term — gather point (idx - GLV_HALF) and
  scale x by Montgomery(beta). GLV_HALF is baked = pool.srsN, so a baseline
  pool never triggers it. No new bindings (Mali 10-binding safe).
- buildGlvInputsOnTheFly: n points + 2n signed-bit scalars.
- MsmConfig.glvOnTheFly relaxes the n<=srsN guard to n<=2*srsN.

Cross-checked GREEN vs noble under SwiftShader (logn 8,10), on-the-fly path.
…ctions

The +7% was a c=15 artifact; at production c=13 stored-phi GLV is ~5% faster.
GLV is a time<->memory knob (store the 2n points or recompute phi on-the-fly),
not a free both-axis win.
…) + robust median timing

Android Chrome on BrowserStack real devices launches the URL via an intent
that truncates it at the first shell/intent delimiter (& and ;), so
?coi=1&autorun=glv-csweep&... arrived as just ?coi=1 — autorun/logn/cs were
silently dropped and the page idled after boot. This blocked every prior
mobile run (misattributed to seat contention). Bundle the whole query into a
single base64url q param (only [A-Za-z0-9-_], no delimiters); main.ts expands
it via history.replaceState before any param is read, and the vite COI
middleware base64-decodes q to still set COOP/COEP. run-browserstack.mjs emits
the bundled form for Android targets.

Also switch the glv-csweep timing from min-of-6 to median: min latches onto
fast mobile-GPU glitches (a dropped/early-returning dispatch resolving in tens
of ms), yielding impossible sub-100ms readings; median rejects both tails. Min
and raw samples are retained per row for transparency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant