fix(bb/msm): correct MsmV2 multi-batch (numBatches>1) + honest peak-memory map by AztecBot · Pull Request #23733 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T01:37:09Z

Axis B — MsmV2 peak GPU memory: the budget lever, now correct

Headline: the multi-batch path is now correct (was returning the wrong MSM)

Self-reviewing the numBatches memory lever uncovered that every nb>1
returned the wrong MSM — and the same multi-batch code is the wgFits-forced
default at logn19 (nb=2) / logn20 (nb=3), so MsmV2 was very likely
incorrect by default at logn≥19. This PR roots out and fixes both bugs.

Cross-check vs @noble/curves (headless SwiftShader — correctness is
hardware-independent; nb forced via the new MsmConfig.forceNumBatches):

logn	nb=1	nb=2	nb=3	nb=4
10	✅	✅	✅	✅
14	✅	✅	✅	—
16	✅	✅	—	—

Previously every nb>1 disagreed (the prior revision's failure table was at
logn16, where nb=2 was ❌). Now all pass — and logn16 nb=2 passes at the
same statsBytes peak (47.4 → 39.1 MB, −17.5 %), confirming the fix changes no
buffer size. At nb=1, bucket_base=0 and the batch loop runs once, so the
default path is byte-identical to before.

Root cause & fix

Two bugs, both nb=1-invisible (at nb=1 batchBuckets == bTotal, local == global):

Missing global bucket offset. Each batch builds its CSR in the LOCAL
window space [0, batchWindows) → LOCAL bucket space [0, batchBuckets), but
the three kernels writing the full-bTotal bucketResult (ba_size1,
ba_stream_walker, ba_walker_combine) indexed it by the local bucket, so
every batch overwrote [0, batchBuckets) instead of its disjoint global slice.
Fix: thread bucket_base = bi*batchBuckets into those kernels' destination
index (ba_size1 .y, ba_stream_walker .w — M_partials is now derived
in-shader to free the slot, no new binding — ba_walker_combine .z), one
bind group per batch (mirrors the existing per-batch decomposeBinds). The
CSR / partials / linked-list spaces stay LOCAL; only the final write adds the
base.
Stale per-batch walker scratch. bucketHead (atomic linked-list heads),
walkerNodeCounter (atomic node allocator), taskCuts, threadCuts were
cleared once before the loop, so batch bi walked batch bi-1's stale
heads or overflowed max_nodes and silently dropped its partials. Fix: clear
per-batch; bucketResult and the 8 MB walkerPartials stay cleared once
(disjoint accumulation / only fresh linked slots are read).

Memory: the lever is now a correct memory/time trade

statsBytes() is a pure sum of GPUBuffer.size, a closed form of
(n, c, numBatches) — the fix changes no buffer size, so the prior
real-hardware A/B (macOS Sequoia · Chrome 148, logn16, median-of-5 GPU wall)
carries over verbatim, now on a correct path:

nb	peak (`statsBytes`)	macOS wall	Δ time
1	47.4 MB	50.7 ms	—
2	39.1 MB	60.1 ms	+18.5 %
3	36.6 MB	64.7 ms	+27.6 %
5	34.1 MB	90.0 ms	+77.5 %

Peak falls monotonically, wall rises, accelerating — diminishing returns past
nb≈2–3. So raising the default budget would still regress the common path:
the no-op default (MEM_BUDGET 248 MB) is right. Deterministic peak across
sizes (script-verified by mem-accounting.mjs): logn17 68.8 MB total / logn20
229.7 MB, fully-batched floor 38.7 / 168.1 MB. The fixed floor (scalarsRaw
32 MB @logn20, bucketResult, redBuf, walkerPartials, SRS) batching cannot
reach.

(Per-nb real-hardware re-time queued on BrowserStack — both seats were saturated
at write time. nb=1 wall is unchanged by construction since that path is
byte-identical; the SwiftShader cross-check proves correctness independent of
hardware.)

Conclusion

The host-buffer-management lever is a correct but steep memory/time trade.
Pushing peak below the batching floor needs WGSL-level levers (in-place
bucket reduction −17 MB; on-GPU SRS y-recovery −32 MB via the existing
decompress_g1_bn254; per-batch scalar byte-slicing) — each a separate verified
change, flagged in MSM_V2_MEMORY.md.

Files

…/wgsl/cuzk/ba_size1, ba_stream_walker, ba_walker_combine .template.wgsl (+ regenerated _generated/shaders.ts) — per-batch bucket_base.
…/msm_webgpu/msm_v2.ts — per-batch writer bind groups; per-batch scratch clears; forceNumBatches knob; honest estimateMem; batchCount getter.
…/msm_webgpu/MSM_V2_MEMORY.md — corrected map + correctness fix + cross-check.
…/dev/msm-webgpu/main.ts — ?nb= / ?nbs= hooks; msm-membudget-sweep per-nb cross-check.
…/dev/msm-webgpu/scripts/mem-accounting.mjs — deterministic accounting.

Created by claudebox · group: aztec

… map

…trade harness Adds the instrumentation the self-review needs to turn the budget knob from an unmeasured no-op into a measured memory/time trade: - mem-accounting.mjs: replicates ensureScratch sizing exactly (no GPU) to regenerate the per-buffer peak map + numBatches lever curve. Corrects the prior table: it omitted reducePrefScratch (~5.6MB @logn20) and undercounted the bucket lists, so the fully-batched floor is ~104MB scratch / ~168MB total @logn20, not the previously-claimed 92.5/156. - MsmV2.batchCount getter + dev-page ?membudget knob + msm-membudget-sweep autorun: rebuilds MsmV2 across budgets at fixed logN on a real device, times GPU wall ms per numBatches, and cross-checks correctness vs noble.

…er floor Script-verified (mem-accounting.mjs) corrections to the peak-memory map: - adds reducePrefScratch (~5.6MB @logn20) the prior table omitted - fully-batched floor @logn20 is ~104MB scratch / ~168MB total, not 92.5/156 - scalarsRawBuf (32MB) is the largest scratch buffer and is NOT batch-dependent - documents that the ≤100MB-to-2^20 goal is unreachable by batching alone, and that the lever is a memory/time trade (decompose re-reads all n scalars per batch); time curve pending the real-hardware sweep

estimateMem omitted reducePrefScratch (~5.6MB @logn20), planMeta, and streamPlannerMeta, so a caller's memBudgetBytes silently under-bounded the true peak by ~6MB. Hoist the (data-independent) reduction MAXC above the budget fit-check and fold those buffers into fixedScratch. Default MEM_BUDGET (248MB) still dwarfs the ~172MB logn20 estimate, so the default numBatches is unchanged (the fix only affects callers who set a tight budget). Also documents on MsmConfig.memBudgetBytes that the lever is a measured memory/time trade.

…e HW) macOS Sequoia / Chrome 148, logn16, median-of-5 GPU wall, noble cross-check passed at every nb: nb1 47.4MB/50.7ms -> nb2 39.1MB/+18.5% -> nb5 34.1MB/+77.5%. Monotonic memory drop, steep+accelerating time cost, diminishing returns past nb2-3. Conclusion: the host-buffer memory lever is exhausted (no free over-provisioning; batching is a deliberate trade), default no-op is correct, remaining time-neutral cuts need WGSL changes.

The membudget sweep only cross-checked the first (nb=1) result against noble, so multi-batch (nb>1) correctness was never actually verified — the whole point of the lever is that nb>1 stays correct. Record a per-row crossOk for every budget and fail the run if any nb disagrees.

- macOS cross-check is at nb=1 only (correct the earlier 'every nb' wording); nb>1 is the same kernels over disjoint window slices, all-nb sweep queued. - Add Galaxy S25 Ultra (Adreno) trade row: same monotonic time/memory shape. - Flag that the Android nb=1 MSM disagreed with noble (bases self-verified OK, so it's an Adreno MSM-compute issue, not this PR's buffer sizing) — pre-existing, independent of this change.

…orrect Per-nb cross-check on real macOS hardware (Chrome 148, logn16): nb=1 matches @noble/curves but nb=2,3,4,5,10 ALL disagree. This refutes the lever's premise ('no correctness change, same path as the logn20 default'). Root cause is host side: planner/walker bind groups + params are built once with no per-batch bucket offset, so the batchBuckets-vs-bTotal index spaces only coincide at nb=1. Blast radius: the same multi-batch code is the wgFits-forced default at logn19 (nb=2) and logn20 (nb=3), so MsmV2 is very likely incorrect by default at logn>=19 (never cross-checked there — noble too slow). Distinct from the invisible bucket-0 issue in PR #23741. Warns on MsmConfig.memBudgetBytes and documents the evidence + root-cause direction in MSM_V2_MEMORY.md. Does not change runtime behavior (default budget stays a no-op); the fix to the multi-batch path is required follow-up.

decompose applies batch_window_base only to the scalar-bit read (w_global), but writes local-window-indexed output (idx = w*input_size+p); the bucketResult- writing kernels (ba_size1/ba_stream_walker/ba_walker_combine) never re-apply a batch_window_base*BW offset, so nb>1 batches overwrite the low batchBuckets region of bucketResult instead of filling disjoint slices. Records the fix direction.

The numBatches budget lever (and the wgFits-forced default at logn19 nb=2 / logn20 nb=3) returned the wrong MSM: every batch's LOCAL CSR was written to the same [0, batchBuckets) region of bucketResult instead of its disjoint global slice, and the per-batch walker scratch (bucketHead/walkerNodeCounter/taskCuts/ threadCuts) was reset only once before the loop, so batch>0 either walked batch0's stale linked-list heads or overflowed max_nodes and dropped its partials. Fix: - Thread a per-batch bucket_base = bi*batchBuckets into the three bucket-writing kernels (ba_size1 .y, ba_stream_walker .w, ba_walker_combine .z). The CSR / partials / linked-list index spaces stay LOCAL; only the final bucket_sums write adds the base. ba_stream_walker derives M_partials in-shader (2*NUM_THREADS*S) to free params.w with no extra binding. - Build one size1/walker/combine bind group per batch (bucket_base uniform), matching the existing per-batch decomposeBinds. - Move the per-batch scratch clears inside the batch loop; keep bucketResult and the 8 MB walkerPartials cleared once (disjoint accumulation / only fresh linked slots are read). - Add MsmConfig.forceNumBatches (+ ?nb= / ?nbs= dev-page hooks) to pin nb for deterministic cross-checks. At nb=1 bucket_base=0 and the loop runs once, so the default path is byte-identical to before. Cross-checked vs @noble/curves (headless SwiftShader): nb=1..4 @logn10 and nb=1..3 @logn14 all agree (previously every nb>1 disagreed). Updates the MsmConfig docs; MSM_V2_MEMORY.md updated separately.

…check

…ilure size

perf(bb/msm): MsmV2 memory-budget batch lever + corrected peak-memory…

80b1059

… map

AztecBot added ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR. labels May 30, 2026

AztecBot added 4 commits May 30, 2026 05:27

AztecBot changed the title ~~perf(bb/msm): MsmV2 memory-budget batch lever + corrected peak-memory map~~ perf(bb/msm): MsmV2 peak-memory map (corrected) + measured numBatches memory/time trade May 30, 2026

AztecBot added 3 commits May 30, 2026 06:08

AztecBot changed the title ~~perf(bb/msm): MsmV2 peak-memory map (corrected) + measured numBatches memory/time trade~~ perf(bb/msm): MsmV2 memory-budget review — corrected map + multi-batch path is incorrect (nb>1) May 30, 2026

AztecBot added 3 commits May 30, 2026 06:53

docs(bb/msm): record multi-batch correctness fix + SwiftShader cross-…

e997c5b

…check

AztecBot changed the title ~~perf(bb/msm): MsmV2 memory-budget review — corrected map + multi-batch path is incorrect (nb>1)~~ fix(bb/msm): correct MsmV2 multi-batch (numBatches>1) + honest peak-memory map May 30, 2026

docs(bb/msm): add logn16 cross-check — nb=2 now correct at the old fa…

db42da1

…ilure size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bb/msm): correct MsmV2 multi-batch (numBatches>1) + honest peak-memory map#23733

fix(bb/msm): correct MsmV2 multi-batch (numBatches>1) + honest peak-memory map#23733
AztecBot wants to merge 12 commits into
stream-walker-implfrom
cb/msm-v2-mem-batchbudget-9f2e

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Axis B — MsmV2 peak GPU memory: the budget lever, now correct

Headline: the multi-batch path is now correct (was returning the wrong MSM)

Root cause & fix

Memory: the lever is now a correct memory/time trade

Conclusion

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading