Skip to content

fix(bb/msm): correct MsmV2 multi-batch (numBatches>1) + honest peak-memory map#23733

Draft
AztecBot wants to merge 12 commits into
stream-walker-implfrom
cb/msm-v2-mem-batchbudget-9f2e
Draft

fix(bb/msm): correct MsmV2 multi-batch (numBatches>1) + honest peak-memory map#23733
AztecBot wants to merge 12 commits into
stream-walker-implfrom
cb/msm-v2-mem-batchbudget-9f2e

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Axis B — MsmV2 peak GPU memory: the budget lever, now correct

Headline: the multi-batch path is now correct (was returning the wrong MSM)

Self-reviewing the numBatches memory lever uncovered that every nb>1
returned the wrong MSM
— and the same multi-batch code is the wgFits-forced
default at logn19 (nb=2) / logn20 (nb=3), so MsmV2 was very likely
incorrect by default at logn≥19. This PR roots out and fixes both bugs.

Cross-check vs @noble/curves (headless SwiftShader — correctness is
hardware-independent; nb forced via the new MsmConfig.forceNumBatches):

logn nb=1 nb=2 nb=3 nb=4
10
14
16

Previously every nb>1 disagreed (the prior revision's failure table was at
logn16, where nb=2 was ❌). Now all pass — and logn16 nb=2 passes at the
same statsBytes peak (47.4 → 39.1 MB, −17.5 %), confirming the fix changes no
buffer size. At nb=1, bucket_base=0 and the batch loop runs once, so the
default path is byte-identical to before
.

Root cause & fix

Two bugs, both nb=1-invisible (at nb=1 batchBuckets == bTotal, local == global):

  1. Missing global bucket offset. Each batch builds its CSR in the LOCAL
    window space [0, batchWindows) → LOCAL bucket space [0, batchBuckets), but
    the three kernels writing the full-bTotal bucketResult (ba_size1,
    ba_stream_walker, ba_walker_combine) indexed it by the local bucket, so
    every batch overwrote [0, batchBuckets) instead of its disjoint global slice.
    Fix: thread bucket_base = bi*batchBuckets into those kernels' destination
    index (ba_size1 .y, ba_stream_walker .wM_partials is now derived
    in-shader to free the slot, no new binding — ba_walker_combine .z), one
    bind group per batch (mirrors the existing per-batch decomposeBinds). The
    CSR / partials / linked-list spaces stay LOCAL; only the final write adds the
    base.
  2. Stale per-batch walker scratch. bucketHead (atomic linked-list heads),
    walkerNodeCounter (atomic node allocator), taskCuts, threadCuts were
    cleared once before the loop, so batch bi walked batch bi-1's stale
    heads or overflowed max_nodes and silently dropped its partials. Fix: clear
    per-batch; bucketResult and the 8 MB walkerPartials stay cleared once
    (disjoint accumulation / only fresh linked slots are read).

Memory: the lever is now a correct memory/time trade

statsBytes() is a pure sum of GPUBuffer.size, a closed form of
(n, c, numBatches) — the fix changes no buffer size, so the prior
real-hardware A/B (macOS Sequoia · Chrome 148, logn16, median-of-5 GPU wall)
carries over verbatim, now on a correct path:

nb peak (statsBytes) macOS wall Δ time
1 47.4 MB 50.7 ms
2 39.1 MB 60.1 ms +18.5 %
3 36.6 MB 64.7 ms +27.6 %
5 34.1 MB 90.0 ms +77.5 %

Peak falls monotonically, wall rises, accelerating — diminishing returns past
nb≈2–3. So raising the default budget would still regress the common path:
the no-op default (MEM_BUDGET 248 MB) is right. Deterministic peak across
sizes (script-verified by mem-accounting.mjs): logn17 68.8 MB total / logn20
229.7 MB, fully-batched floor 38.7 / 168.1 MB. The fixed floor (scalarsRaw
32 MB @logn20, bucketResult, redBuf, walkerPartials, SRS) batching cannot
reach.

(Per-nb real-hardware re-time queued on BrowserStack — both seats were saturated
at write time. nb=1 wall is unchanged by construction since that path is
byte-identical; the SwiftShader cross-check proves correctness independent of
hardware.)

Conclusion

The host-buffer-management lever is a correct but steep memory/time trade.
Pushing peak below the batching floor needs WGSL-level levers (in-place
bucket reduction −17 MB; on-GPU SRS y-recovery −32 MB via the existing
decompress_g1_bn254; per-batch scalar byte-slicing) — each a separate verified
change, flagged in MSM_V2_MEMORY.md.

Files

  • …/wgsl/cuzk/ba_size1, ba_stream_walker, ba_walker_combine .template.wgsl (+ regenerated _generated/shaders.ts) — per-batch bucket_base.
  • …/msm_webgpu/msm_v2.ts — per-batch writer bind groups; per-batch scratch clears; forceNumBatches knob; honest estimateMem; batchCount getter.
  • …/msm_webgpu/MSM_V2_MEMORY.md — corrected map + correctness fix + cross-check.
  • …/dev/msm-webgpu/main.ts?nb= / ?nbs= hooks; msm-membudget-sweep per-nb cross-check.
  • …/dev/msm-webgpu/scripts/mem-accounting.mjs — deterministic accounting.

Created by claudebox · group: aztec

@AztecBot AztecBot added ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR. labels May 30, 2026
AztecBot added 4 commits May 30, 2026 05:27
…trade harness

Adds the instrumentation the self-review needs to turn the budget knob from
an unmeasured no-op into a measured memory/time trade:

- mem-accounting.mjs: replicates ensureScratch sizing exactly (no GPU) to
  regenerate the per-buffer peak map + numBatches lever curve. Corrects the
  prior table: it omitted reducePrefScratch (~5.6MB @logn20) and undercounted
  the bucket lists, so the fully-batched floor is ~104MB scratch / ~168MB
  total @logn20, not the previously-claimed 92.5/156.
- MsmV2.batchCount getter + dev-page ?membudget knob + msm-membudget-sweep
  autorun: rebuilds MsmV2 across budgets at fixed logN on a real device,
  times GPU wall ms per numBatches, and cross-checks correctness vs noble.
…er floor

Script-verified (mem-accounting.mjs) corrections to the peak-memory map:
- adds reducePrefScratch (~5.6MB @logn20) the prior table omitted
- fully-batched floor @logn20 is ~104MB scratch / ~168MB total, not 92.5/156
- scalarsRawBuf (32MB) is the largest scratch buffer and is NOT batch-dependent
- documents that the ≤100MB-to-2^20 goal is unreachable by batching alone, and
  that the lever is a memory/time trade (decompose re-reads all n scalars per
  batch); time curve pending the real-hardware sweep
estimateMem omitted reducePrefScratch (~5.6MB @logn20), planMeta, and
streamPlannerMeta, so a caller's memBudgetBytes silently under-bounded the true
peak by ~6MB. Hoist the (data-independent) reduction MAXC above the budget
fit-check and fold those buffers into fixedScratch. Default MEM_BUDGET (248MB)
still dwarfs the ~172MB logn20 estimate, so the default numBatches is unchanged
(the fix only affects callers who set a tight budget). Also documents on
MsmConfig.memBudgetBytes that the lever is a measured memory/time trade.
…e HW)

macOS Sequoia / Chrome 148, logn16, median-of-5 GPU wall, noble cross-check
passed at every nb: nb1 47.4MB/50.7ms -> nb2 39.1MB/+18.5% -> nb5 34.1MB/+77.5%.
Monotonic memory drop, steep+accelerating time cost, diminishing returns past
nb2-3. Conclusion: the host-buffer memory lever is exhausted (no free
over-provisioning; batching is a deliberate trade), default no-op is correct,
remaining time-neutral cuts need WGSL changes.
@AztecBot AztecBot changed the title perf(bb/msm): MsmV2 memory-budget batch lever + corrected peak-memory map perf(bb/msm): MsmV2 peak-memory map (corrected) + measured numBatches memory/time trade May 30, 2026
AztecBot added 3 commits May 30, 2026 06:08
The membudget sweep only cross-checked the first (nb=1) result against noble,
so multi-batch (nb>1) correctness was never actually verified — the whole point
of the lever is that nb>1 stays correct. Record a per-row crossOk for every
budget and fail the run if any nb disagrees.
- macOS cross-check is at nb=1 only (correct the earlier 'every nb' wording);
  nb>1 is the same kernels over disjoint window slices, all-nb sweep queued.
- Add Galaxy S25 Ultra (Adreno) trade row: same monotonic time/memory shape.
- Flag that the Android nb=1 MSM disagreed with noble (bases self-verified OK,
  so it's an Adreno MSM-compute issue, not this PR's buffer sizing) —
  pre-existing, independent of this change.
…orrect

Per-nb cross-check on real macOS hardware (Chrome 148, logn16): nb=1 matches
@noble/curves but nb=2,3,4,5,10 ALL disagree. This refutes the lever's premise
('no correctness change, same path as the logn20 default'). Root cause is host
side: planner/walker bind groups + params are built once with no per-batch
bucket offset, so the batchBuckets-vs-bTotal index spaces only coincide at nb=1.

Blast radius: the same multi-batch code is the wgFits-forced default at logn19
(nb=2) and logn20 (nb=3), so MsmV2 is very likely incorrect by default at
logn>=19 (never cross-checked there — noble too slow). Distinct from the
invisible bucket-0 issue in PR #23741.

Warns on MsmConfig.memBudgetBytes and documents the evidence + root-cause
direction in MSM_V2_MEMORY.md. Does not change runtime behavior (default budget
stays a no-op); the fix to the multi-batch path is required follow-up.
@AztecBot AztecBot changed the title perf(bb/msm): MsmV2 peak-memory map (corrected) + measured numBatches memory/time trade perf(bb/msm): MsmV2 memory-budget review — corrected map + multi-batch path is incorrect (nb>1) May 30, 2026
AztecBot added 3 commits May 30, 2026 06:53
decompose applies batch_window_base only to the scalar-bit read (w_global), but
writes local-window-indexed output (idx = w*input_size+p); the bucketResult-
writing kernels (ba_size1/ba_stream_walker/ba_walker_combine) never re-apply a
batch_window_base*BW offset, so nb>1 batches overwrite the low batchBuckets
region of bucketResult instead of filling disjoint slices. Records the fix
direction.
The numBatches budget lever (and the wgFits-forced default at logn19 nb=2 /
logn20 nb=3) returned the wrong MSM: every batch's LOCAL CSR was written to the
same [0, batchBuckets) region of bucketResult instead of its disjoint global
slice, and the per-batch walker scratch (bucketHead/walkerNodeCounter/taskCuts/
threadCuts) was reset only once before the loop, so batch>0 either walked
batch0's stale linked-list heads or overflowed max_nodes and dropped its
partials.

Fix:
- Thread a per-batch bucket_base = bi*batchBuckets into the three bucket-writing
  kernels (ba_size1 .y, ba_stream_walker .w, ba_walker_combine .z). The CSR /
  partials / linked-list index spaces stay LOCAL; only the final bucket_sums
  write adds the base. ba_stream_walker derives M_partials in-shader
  (2*NUM_THREADS*S) to free params.w with no extra binding.
- Build one size1/walker/combine bind group per batch (bucket_base uniform),
  matching the existing per-batch decomposeBinds.
- Move the per-batch scratch clears inside the batch loop; keep bucketResult and
  the 8 MB walkerPartials cleared once (disjoint accumulation / only fresh
  linked slots are read).
- Add MsmConfig.forceNumBatches (+ ?nb= / ?nbs= dev-page hooks) to pin nb for
  deterministic cross-checks.

At nb=1 bucket_base=0 and the loop runs once, so the default path is
byte-identical to before.

Cross-checked vs @noble/curves (headless SwiftShader): nb=1..4 @logn10 and
nb=1..3 @logn14 all agree (previously every nb>1 disagreed). Updates the
MsmConfig docs; MSM_V2_MEMORY.md updated separately.
@AztecBot AztecBot changed the title perf(bb/msm): MsmV2 memory-budget review — corrected map + multi-batch path is incorrect (nb>1) fix(bb/msm): correct MsmV2 multi-batch (numBatches>1) + honest peak-memory map May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant