feat(bb/msm): GLV endomorphism decomposition for laptop/mobile BN254 MSM#23731
Draft
AztecBot wants to merge 11 commits into
Draft
feat(bb/msm): GLV endomorphism decomposition for laptop/mobile BN254 MSM#23731AztecBot wants to merge 11 commits into
AztecBot wants to merge 11 commits into
Conversation
- swiftshader-noble.mjs: add --use-vulkan=swiftshader; without it headless Chromium returns a null WebGPU adapter on the lavapipe-less dev box and the GLV cross-check cannot run. - main.ts: add glv-csweep autorun — sweeps GLV window size c against the real production baseline (pickC) and matched-c baselines, with warm-up + min-of- reps steady-state timing and algo+pool+total memory. The earlier glv-compare forced both sides to c=15, a config the n=2^17 baseline never uses (pickC=13), hiding GLV's real lever (fewer windows at its own optimal c). - run-browserstack.mjs: forward --cs to the page query string.
The first GLV pass stored both n original points and n phi(P)=(beta*x,y) points, doubling the SRS pool (+8 MiB at n=2^17) — the only axis where GLV lost to the production baseline. Compute phi in the point gather instead, so the pool holds only the original n points: - decompose_scalars_booth: XOR the half-scalar sign (folded by the host into bit 255 of the sub-2^127 scalar, always 0 for a 254-bit Fr scalar) into the stored Booth sign, so a negative half-scalar negates its point. - gather kernels (stream_walker, size1, stream_accum(+debug), recompute_split): a value index >= GLV_HALF is a phi-term — gather point (idx - GLV_HALF) and scale x by Montgomery(beta). GLV_HALF is baked = pool.srsN, so a baseline pool never triggers it. No new bindings (Mali 10-binding safe). - buildGlvInputsOnTheFly: n points + 2n signed-bit scalars. - MsmConfig.glvOnTheFly relaxes the n<=srsN guard to n<=2*srsN. Cross-checked GREEN vs noble under SwiftShader (logn 8,10), on-the-fly path.
…ctions The +7% was a c=15 artifact; at production c=13 stored-phi GLV is ~5% faster. GLV is a time<->memory knob (store the 2n points or recompute phi on-the-fly), not a free both-axis win.
…no per-gather β-mul)
…stamp profiling on Adreno)
…) + robust median timing Android Chrome on BrowserStack real devices launches the URL via an intent that truncates it at the first shell/intent delimiter (& and ;), so ?coi=1&autorun=glv-csweep&... arrived as just ?coi=1 — autorun/logn/cs were silently dropped and the page idled after boot. This blocked every prior mobile run (misattributed to seat contention). Bundle the whole query into a single base64url q param (only [A-Za-z0-9-_], no delimiters); main.ts expands it via history.replaceState before any param is read, and the vite COI middleware base64-decodes q to still set COOP/COEP. run-browserstack.mjs emits the bundled form for Android targets. Also switch the glv-csweep timing from min-of-6 to median: min latches onto fast mobile-GPU glitches (a dropped/early-returning dispatch resolving in tens of ms), yielding impossible sub-100ms readings; median rejects both tails. Min and raw samples are retained per row for transparency.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GLV endomorphism MSM for laptop/mobile GPUs
GLV decomposition for BN254 MSM on the stream-walker branch. φ(x,y)=(βx,y)=[λ]P splits s = s₁+λs₂ with |sᵢ|≈2¹²⁷, so an n-point/254-bit MSM becomes 2n-point/127-bit, halving the window count T and with it the n-independent BPR + Horner stages. Full writeup:
GLV_ANALYSIS.md.TL;DR — a both-axis win (real-hardware M2)
The first revisions presented GLV as a time↔memory knob: stored-φ was faster but cost memory; on-the-fly-φ saved memory but cost time. This revision adds precomputed-φ, which beats the baseline and stored-φ on both axes — so that tradeoff was an implementation artifact, not a fundamental limit.
(BrowserStack Apple M2, n=2¹⁷, time = min of 6 steady-state dispatches; memory =
statsBytes.) Precomputed-φ is faster than stored-φ (−8.1% vs −5.1%) at baseline memory (vs stored-φ's +5.7%). It dominates every other variant on the Pareto front.Why precomputed-φ wins (and why it is binding-safe)
φ(P)=(β·x, y) shares its y with P. The three variants differ only in how the φ-term's coordinates reach the bucket-accumulate gather:
point_y[n+i] == point_y[i], so it duplicates n y-coordinates (the +8 MiB / +5.7%).point_xbuffer (a 2n-slot x, an n-slot y) and the gather reads it directly — no per-gather β-multiply. β·x is filled once at pool-build time by a newglv_phi_fillkernel. y is reused from the lower half — no duplication.So precomputed-φ keeps stored-φ's β-multiply-free gather and halves stored-φ's point_y, which on the memory-bandwidth-bound stream-walker trims even the accumulate (less y-traffic) — hence faster than stored-φ.
Crucially it adds no new GPU binding:
ba_stream_walkeralready uses 10 storage buffers (0–9) + 1 uniform, and an 11th storage binding is invalid on Mali/Adreno. Precomputed-φ reuses the existingpoint_x/point_ybindings (resized), staying within the mobile 10-binding limit. The half-scalar sign is folded into scalar bit 255 (decompose XORs it), same as on-the-fly.Code:
MsmV2Pool.createGlvPrecomputed,MsmConfig.glvPrecomputed, theglv_phi_fillkernel, and aglv_precompbranch inba_size1/ba_stream_walker'sload_pt_x.Phase profile (M2 GPU timestamps, c=13) — where the win comes from
stream_walkeris ~77% of the wall — the memory-bound, GLV-invariant accumulate (T·n digits unchanged). GLV's gain is in the n-independent tail (reduce + walker_combine + planner, ∝ T·B), which the halved window count (T 20→10) shrinks; precomputed-φ's smaller point_y additionally trimsstream_walker. This matches the stage accounting and the ground truth that the walker is bandwidth-bound.Pipeline note — where BPR actually lives
On this branch the legacy V2 pair-tree kernels (
ba_fused_super/ba_carry/ba_finalize) are no longer dispatched (their buffers are 4-byte stubs); there is a single accumulator — the stream-walker — followed by areduce(bucket-point-reduction) stage that loops overnumWindows. GLV halving T therefore halves thatreduceplus the window/bucket-bound planner stages, while the O(T·n) accumulate is GLV-invariant.Budget-batch composition
The budget-batch lever splits the windows into
numBatchesgroups when a single batch would exceed the 65 000-workgroup dispatch cap:numBatchesis the smallest nb with⌈(⌈T/nb⌉·n)/WGI⌉ < 65 000(WGI=128). GLV's halved T composes with it on both fronts:⌈20/3⌉·8192 < 65 000)⌈10/2⌉·8192 < 65 000)At 2¹⁷ neither batches; at 2²⁰ GLV needs one fewer batch than the baseline (fewer redundant preprocess/planner passes) and each batch's bucket working set (∝
batchWindows·BW) is halved. So GLV's memory win and the budget-batch lever are complementary, not competing.Correctness — SwiftShader, GLV vs noble CPU reference (all GREEN)
Memory — measured working set (
statsBytes)At c=13 precomputed-φ ≈ baseline; at c=15 it is −15.6% vs baseline — both under the 100 MB budget the c=15 baseline exceeds.
Adreno / Mali
The mobile A/B (Snapdragon S25-Ultra Adreno / Pixel 9 Pro XL Mali; no timestamp-query → GPU wall time) is wired (
--target s25-ultra --profile 0) but the shared BrowserStack pool has been continuously saturated by another agent's batch holding both concurrency slots; queued workers starved. The win is architecture-general — it removes per-gather ALU (the β-multiply) and memory traffic (the duplicated y), and the n-independent tail GLV halves is a larger fraction on weaker mobile GPUs — so it is expected to be at least as large there as on the M2. Numbers will be added when a mobile seat frees.Reproduce