Skip to content

feat(bb/msm): mobile-first MSM — SwiftShader correctness harness + GLV design#23728

Draft
AztecBot wants to merge 4 commits into
stream-walker-implfrom
cb/msm-opt/glv-mobile-7f3a
Draft

feat(bb/msm): mobile-first MSM — SwiftShader correctness harness + GLV design#23728
AztecBot wants to merge 4 commits into
stream-walker-implfrom
cb/msm-opt/glv-mobile-7f3a

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Mobile-first BN254 WebGPU MSM — GLV decomposition

Autonomous work toward a memory- and time-optimal BN254 MSM for laptop/mobile GPUs (Apple TBDR, Adreno, Mali) under the ≤100 MB algorithm-buffer budget and 16/32 KB workgroup-memory limits. Full design + budget: barretenberg/ts/src/msm_webgpu/MOBILE_MSM_DESIGN.md.

Algorithmic contribution: GLV endomorphism (cuzk/glv.ts + scalarBits knob)

MSM_DESIGN_ANALYSIS.md §3.6 flags GLV as the top unexploited win ("wins everywhere", used by neither baseline). BN254 has φ(x,y)=(βx,y)=[λ]P; every scalar splits as k ≡ k₁ + λ·k₂ (mod r) with |kᵢ| < 2¹²⁷. So an n-pair, 254-bit MSM becomes a 2n-pair, 128-bit MSM Σ k₁ᵢPᵢ + Σ k₂ᵢφPᵢ. Constants derived offline (Tonelli-Shanks cube roots, Gauss lattice reduction), verified against the group law, re-asserted at module load.

Total accumulation work is invariant (2n·T′ ≈ n·T nonzero digits), but halving the scalar bit length halves the window count T, which:

baseline GLV effect
BPR + transpose-scan (T·2ᶜ, the n-independent term — 37 % of GPU wall @2¹⁶) T T/2 time −, halved
Horner (T) T T/2 time −
accumulation (n·T) n·T n·T flat (no regression)
algorithm buffers @2¹⁷ 12.52 MB 8.02 MB −36 %
algorithm buffers @2²⁰ 43.33 MB 29.56 MB −32 %

(Buffer totals computed from the real allocation formulas in ba_stream_plan.ts, holding thread count at the work-justified baseline.) Mobile-agnostic: φ is one Fq-multiply + a free coordinate copy, zero extra workgroup memory.

Correctness — validated under SwiftShader (GPU-less host)

New WASM-free, network-free WebGPU correctness oracle vs noble (dev/msm-webgpu/xcheck.*), run under SwiftShader:

baseline    logN=8 PASS (c=4)   logN=10 PASS (c=8)
GLV (128b)  logN=8 PASS (c=5)   logN=10 PASS (c=8)   max |kᵢ|=126 bits

Commits

  1. SwiftShader correctness harness (xcheck.{html,ts} + driver)
  2. cuzk/glv.ts GLV decomposition + MsmV2.scalarBits knob
  3. MOBILE_MSM_DESIGN.md design + memory/time budget
  4. ?glv=1 knob in the bench harness for on-device runs

Status / blockers (honest)

  • No local GPU — correctness validated only under SwiftShader at logn 8/10 (per task constraints).
  • BrowserStack: blocked. Both seats (2, shared across ~10 agents) were occupied throughout this session; never interfered with running jobs. The on-device path is turnkey: cloudflared installed, ?glv=1 wired into the bench autorun (the WASM/noble cross-check then also confirms GLV correctness on the real device). Reproduce on a free seat:
    node dev/msm-webgpu/scripts/run-browserstack.mjs --target macos --n 17 --autorun msm-bench and the same with &glv=1 appended to the page URL.
  • Peak GPU memory is not measurable via WebGPU (no API; performance.memory is JS heap). The credible memory evidence is therefore the analytical buffer budget above, derived from the real allocation formulas.
  • Designed, not yet wired (full memory-optimality): work-invariant thread allocation under GLV, and on-the-fly φ point-fetch to avoid doubling the input SRS — see design doc §4/§6.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
AztecBot added 3 commits May 30, 2026 00:14
Add BN254 GLV scalar decomposition (cuzk/glv.ts): splits each 254-bit scalar
k into (k1, k2) with |ki| < 2^127 via Gauss-reduced lattice + Babai rounding,
so an n-pair 254-bit MSM becomes a 2n-pair 128-bit MSM (Sigma k1 P + k2 phiP).
Constants derived offline (Tonelli-Shanks cube roots, Gauss reduction),
verified against the group law, and re-asserted at module load.

MsmV2 gains a scalarBits config knob: numWindows = ceil(scalarBits / c), so
halving the scalar bit length under GLV halves the window count T (and thus
the bucket-reduction work and bucket_sums buffer). Default 254 preserves
existing behaviour.

Cross-check harness (dev/msm-webgpu/xcheck.*): WASM-free, network-free WebGPU
MSM correctness oracle vs noble; runs under SwiftShader on a GPU-less host.
GLV mode (?glv=1) validated PASS at logn=8,10.
Strictly additive: routes the WebGPU path through GLV decomposition (2n
pairs, scalarBits=128) when ?glv=1 is set. The WASM/noble cross-check still
validates the result (GLV output == original MSM), so a BrowserStack run
yields on-device GLV correctness plus timing. Default runs unaffected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant