feat(prover): Build the LogUp aux trace on the GPU by ColoCarletti · Pull Request #762 · yetanotherco/lambda_vm

ColoCarletti · 2026-07-01T20:34:11Z

Description:
Moves the LogUp auxiliary trace build (Round 1) from the CPU to the GPU, and keeps the data on the GPU instead of copying it back and forth.

What changed:

The aux trace columns are now built on the GPU and stay there. The aux LDE reads them in place, so there's no download + re-upload.
The aux build reuses the main trace that's already on the GPU (from the main LDE) instead of uploading it again.

Result (ethrex 5-tx, RTX 5090): aux build drops from ~1.74s to ~1.33s. The proof is byte-for-byte identical to the CPU version and all tests pass.

Note: this is a building block. On its own the win is small, because the stages around it (trace generation, constraint evaluation) are still on the CPU and force data across the bus. The bigger speedup comes when those move to the GPU too and the whole chain stays on the device — this PR adds the plumbing (resident aux trace, main-trace reuse, device-input LDE) they build on.

Known limit: preprocessed tables still upload their main trace, since they have no GPU copy to reuse.

…nto gpu_integration

MauroToscano · 2026-07-01T20:34:54Z

/bench-gpu

MauroToscano · 2026-07-01T20:34:58Z

/review-ai

github-actions · 2026-07-01T20:35:02Z

GPU Benchmark (ABBA) — `7ae0382314` vs `main` (14 pairs)

_{RTX 5090 · Vast.ai datacenter @ $0.8680555555555556/hr · prover/cuda · drift-free A/B/B/A}

=== ABBA paired result  (improvement: + = PR faster) ===
  pairs: 14   mean A (PR): 21.853s   mean B (base): 22.640s

  [parametric] paired-t   mean +3.46%   sd 2.06%   se 0.55%
               95% CI: [+2.27%, +4.66%]   (t df=13 = 2.16)
  [robust]     median +2.91%   Wilcoxon W+=105 W-=0  p(exact)=0.0001  (z=+3.26)

  --- server stability (this run; compare across servers) ---
  run-to-run jitter:    A CV 1.85%   B CV 1.12%        (lower = steadier)
  within-session drift: +0.09% over the run, 1st->2nd half +0.15%
    (jitter -> Tier-1 cached gate floor; drift -> whether the cached baseline can be trusted)

  VERDICT: REAL IMPROVEMENT - PR faster by ~3.46% (t-CI and Wilcoxon agree)

  raw pairs: /tmp/abba_run/pairs.csv

_{+ = PR faster. Trust the verdict when paired-t and Wilcoxon agree.}

github-actions · 2026-07-01T20:37:48Z

Codex Code Review

Found one issue.

High - Resident LogUp buffers bypass the VRAM budget and can exhaust the GPU.
The PR adds GpuLdeBase::trace_dev, an extra n * main_cols * 8 device buffer retained from each main LDE, but estimate_table_vram_bytes() still budgets only LDE columns plus tree/scratch. Those handles are collected for all tables in main_gpu_handles and kept into later phases, so the chunk planner can admit work that fits the old estimate while exceeding real VRAM. The aux build also runs over air_trace_pairs.par_iter_mut() and may build resident LogUp buffers for all tables concurrently, outside plan_table_chunks(). This is a concrete CUDA OOM risk on large multi-table proofs, especially because the resident aux path then hard-requires the device aux LDE. Account for trace_dev and resident aux build buffers in the budget, chunk the aux build GPU work, and drop/clear trace_dev immediately after aux build no longer needs it. References: prover.rs, prover.rs, prover.rs, lde.rs.

claude · 2026-07-01T20:40:42Z

        blowup_factor,
        weights,
        "coset_lde_row_major lde_size",
+        true,


Medium (device memory / hot-path work): retain_trace_col_major is hard-coded true here, so every GPU main-trace LDE now allocates an extra column-major snapshot (n * total_cols u64s, ~half of handle.buf) and runs an extra launch_row_to_col_major transpose — on every proof, regardless of whether the resident aux build will ever consume it.

The snapshot lives inside the GpuLdeBase handle held in main_gpu_handles for the whole aux-build window (the peak-memory phase), but it is only read by the resident aux path, which is disabled when:

disk-spill + StorageMode::Disk (resident_aux_ok = false) — and this is precisely the memory-constrained mode where the extra resident buffer hurts most;

debug-checks (the whole resident/main_trace_dev threading is cfg'd out, yet the buffer is still allocated and retained here);

non-Goldilocks / preprocessed tables that fall back.

Consider threading the retain flag through so it's only true when the resident aux build can actually use it (i.e. gate on the same not(debug-checks) / non-disk conditions), rather than paying the transpose + extra residency unconditionally.

claude · 2026-07-01T20:41:01Z

Review: Build the LogUp aux trace on the GPU

Reviewed the full diff (the PR files aren't checked out locally, so this is diff-based plus the base helpers it builds on). Overall this is a careful, well-structured change: the CUDA kernels mirror the CPU evaluator closely, the parity/--ignored GPU tests pin them together, and the fall-back logic (non-Goldilocks, below GPU_LOGUP_MIN_ROWS, LAMBDA_VM_NO_GPU_LOGUP, disk-spill, debug-checks) is consistent. I verified the pieces most likely to break:

Fingerprint / alpha indexing — alpha_powers_len = max(num_bus_elements()) where num_bus_elements() includes the +1 for the α⁰ bus-id term, so elem_alpha_idx (offsets 1..B) never over-indexes alpha_powers. ✔
Accumulated column — logup_row_sum_ext3 sums all num_out columns (incl. the virtual one), the additive Hillis-Steele scan is correctly double-buffered, and finalize_accum computes acc[i] = S[i] - (i+1)·(L/N), matching the CPU build_accumulated_column. The virtual column is correctly excluded from the assembled aux buffer while still feeding the accumulator. ✔
Signed multiplicity / i64_to_canonical — negation edge cases are safe because |c| < 2^63 < p, so unsigned_abs() % p is never 0 for negative c. ✔
Resident-main reuse — guarded by rows == trace_len && buf.len() == num_cols * trace_len; the snapshot and main_segment_cols both derive from the same table's main columns in the same index order, so length equality is a sound guard (mismatched-count tables fall back). ✔
Resident aux → commit hand-off — when the resident build fires, the host aux trace is intentionally left empty and a device-LDE failure is a hard ProvingError (not a silent fall-through committing a zero aux trace). ✔

Findings

1. Medium — unconditional trace snapshot retention (device memory + hot-path transpose).
coset_lde_row_major_with_merkle_tree_keep passes retain_trace_col_major: true unconditionally, so every GPU main LDE allocates an extra n * total_cols column-major buffer and runs an extra transpose, even when the resident aux path can't consume it (disk-spill StorageMode::Disk, debug-checks, preprocessed/non-Goldilocks fallback). This is most costly under disk-spill — the mode chosen specifically to save memory. Consider gating retention on the conditions where the resident build is actually enabled. (Inline comment on lde.rs.)

2. Low — dropped safety debug_assert in the new transmute.
try_expand_leaf_and_tree_ext3_row_major_keep_dev reinterprets Vec<u64> → Vec<FieldElement<E>> via Vec::from_raw_parts(.., len/3, capacity/3). The sibling try_expand_leaf_and_tree_ext3_row_major_keep guards this with a debug_assert!(v.len() % 3 == 0 && v.capacity() % 3 == 0, ...); the new function omits it. A non-multiple-of-3 capacity would be UB on drop. Worth mirroring the assert for parity.

Nothing blocking. Nice work on the parity tests and the resident/fallback separation.

JuArce · 2026-07-01T20:51:00Z

/bench-gpu

github-actions · 2026-07-01T21:06:41Z

AI Review

PR #762 · 17 changed files

Findings

Status	Sev	Location	Finding	Found by
candidate	low	crypto/math-cuda/src/logup.rs:410	Underflow in row_sum slice for num_rows == 0	minimax minimax/MiniMax-M3
candidate	low	prover/src/instruments.rs	Instrument labels read "(CPU)" but the GPU-resident path also runs these stages	minimax minimax/MiniMax-M3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: Underflow in row_sum slice for num_rows == 0

Status: candidate
Severity: low
Location: crypto/math-cuda/src/logup.rs:410
Found by: minimax:minimax/MiniMax-M3
Verified by: -
Rejected by: -

Claim

When logup_aux_resident is invoked with num_rows == 0, row_sum.slice((num_rows - 1) * 3..num_rows * 3) underflows on num_rows - 1 and panics. Public callers (try_build_aux_resident_gpu) guard with GPU_LOGUP_MIN_ROWS = 1024 so the path is safe in practice, but a direct caller of logup_aux_resident with num_rows == 0 would panic.

Evidence

Line 410: let l_host: Vec<u64> = stream.clone_dtoh(&row_sum.slice((num_rows - 1) * 3..num_rows * 3))?; — num_rows - 1 is usize, and the previous (num_rows - 1) * 3 would underflow at num_rows == 0. The kernels above it early-return on empty grids and the scan_add_inplace short-circuits on n <= 1, but the slice expression is not guarded.

Suggested fix

Add an early return before the scan/finalize when num_rows == 0 (similar to logup_term_columns returning Ok(Vec::new()) at line 146): allocate an empty ResidentAux { num_aux_cols: 1, num_rows: 0, table_contribution: [0;3], buf: ... } or guard the slice with if num_rows == 0 { return Ok(ResidentAux { table_contribution: [0;3], .. }) }.

AI-002: Instrument labels read "(CPU)" but the GPU-resident path also runs these stages

Status: candidate
Severity: low
Location: prover/src/instruments.rs
Found by: minimax:minimax/MiniMax-M3
Verified by: -
Rejected by: -

Claim

The new report rows are labelled "LogUp fingerprint (CPU)", "LogUp batch invert (CPU)", "LogUp term combine (CPU)", "LogUp accumulate scan (CPU)", but when the GPU-resident aux build fires these all show 0 (the host counter only accumulates from the CPU process_chunk and build_accumulated_column_from_terms). A reader could be confused that GPU time is missing.

Evidence

accum_aux_term / accum_aux_accumulate are only called from the CPU paths in lookup.rs (lines 1615, 1677). The resident aux build in logup_gpu.rs / lookup.rs returns early and never calls these accumulators.

Suggested fix

Either drop the "(CPU)" suffix, or add a parallel timer on the GPU path that records host wall-time per phase and accumulates into the same fields. Lower-priority cosmetic; the comment near the resident build (timing flag + eprintln) already gives per-phase ms when LAMBDA_VM_LOGUP_TIMING is set.

Reviewer Lanes

Lane	Model	Prompt	Status	Findings
glm	openrouter/z-ai/glm-5.2	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0
kimi	openrouter/moonshotai/kimi-k2.7-code	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0
minimax	minimax/MiniMax-M3	general	success	2
moonmath	zro/minimax-m3	general	error: agentic lane timed out after 1800s	0
nemotron	openrouter/nvidia/nemotron-3-ultra-550b-a55b	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0

Verification Lanes

Lane	Model	Status	Confirmed	Rejected	Uncertain
deepseek-verifier	openrouter/deepseek/deepseek-v4-pro	error: opencode failed (provider/auth/runtime error) and no verifications were submitted	0	0	0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

…o gpu_logup_aux

ColoCarletti and others added 25 commits June 29, 2026 15:08

spike

5f369f0

resident-data scaffolding

0fa7f32

part 2

bdea00e

composition parts fold

e9dcda8

fix

b08ef81

fix_doc

8ea8a0a

move merkle tree to gpu

caa7efd

merkle tree

b811337

finish merkle

819ad01

fix

0fb8df6

rm unused functions

8ab633f

cleanup

fd31fc9

Merge branch 'main' into gpu_integration

0f53e65

fix

b5fbdef

fix

b37b649

Merge branch 'main' into gpu_integration

b81e63f

fix clippy

873491e

fix comments

e4ada64

Merge branch 'main' into gpu_integration

31d5524

fix

4ca240c

Merge branch 'gpu_integration' of github.com:yetanotherco/lambda_vm i…

45b2cfe

…nto gpu_integration

rm data transfer

ea411be

fmt

f0e0d38

refactor

b2596f1

Merge branch 'main' into gpu_logup_aux

7ae0382

ColoCarletti changed the title ~~feat(proveR): Build the LogUp aux trace on the GPU~~ feat(prover): Build the LogUp aux trace on the GPU Jul 1, 2026

claude Bot reviewed Jul 1, 2026

View reviewed changes

ColoCarletti and others added 4 commits July 2, 2026 15:36

Merge branch 'main' into gpu_logup_aux

0861b71

fmt

66c8d45

Merge branch 'gpu_logup_aux' of github.com:yetanotherco/lambda_vm int…

ae56284

…o gpu_logup_aux

Merge branch 'main' into gpu_logup_aux

91d7816

MauroToscano mentioned this pull request Jul 3, 2026

Review fixes for the GPU LogUp aux build (#762) #779

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(prover): Build the LogUp aux trace on the GPU#762

feat(prover): Build the LogUp aux trace on the GPU#762
ColoCarletti wants to merge 29 commits into
mainfrom
gpu_logup_aux

ColoCarletti commented Jul 1, 2026

Uh oh!

MauroToscano commented Jul 1, 2026

Uh oh!

MauroToscano commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

claude Bot Jul 1, 2026

Uh oh!

claude Bot commented Jul 1, 2026

Uh oh!

JuArce commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ColoCarletti commented Jul 1, 2026

Uh oh!

MauroToscano commented Jul 1, 2026

Uh oh!

MauroToscano commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU Benchmark (ABBA) — 7ae0382314 vs main (14 pairs)

Uh oh!

github-actions Bot commented Jul 1, 2026

Codex Code Review

Uh oh!

claude Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jul 1, 2026

Review: Build the LogUp aux trace on the GPU

Findings

Uh oh!

JuArce commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

AI Review

Findings

Reviewer Lanes

Verification Lanes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jul 1, 2026 •

edited

Loading

GPU Benchmark (ABBA) — `7ae0382314` vs `main` (14 pairs)