Skip to content

feat(prover): Build the LogUp aux trace on the GPU#762

Open
ColoCarletti wants to merge 29 commits into
mainfrom
gpu_logup_aux
Open

feat(prover): Build the LogUp aux trace on the GPU#762
ColoCarletti wants to merge 29 commits into
mainfrom
gpu_logup_aux

Conversation

@ColoCarletti

Copy link
Copy Markdown
Collaborator

Description:
Moves the LogUp auxiliary trace build (Round 1) from the CPU to the GPU, and keeps the data on the GPU instead of copying it back and forth.

What changed:

  • The aux trace columns are now built on the GPU and stay there. The aux LDE reads them in place, so there's no download + re-upload.
  • The aux build reuses the main trace that's already on the GPU (from the main LDE) instead of uploading it again.

Result (ethrex 5-tx, RTX 5090): aux build drops from ~1.74s to ~1.33s. The proof is byte-for-byte identical to the CPU version and all tests pass.

Note: this is a building block. On its own the win is small, because the stages around it (trace generation, constraint evaluation) are still on the CPU and force data across the bus. The bigger speedup comes when those move to the GPU too and the whole chain stays on the device — this PR adds the plumbing (resident aux trace, main-trace reuse, device-input LDE) they build on.

Known limit: preprocessed tables still upload their main trace, since they have no GPU copy to reuse.

@ColoCarletti ColoCarletti changed the title feat(proveR): Build the LogUp aux trace on the GPU feat(prover): Build the LogUp aux trace on the GPU Jul 1, 2026
@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-gpu

@MauroToscano

Copy link
Copy Markdown
Contributor

/review-ai

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

GPU Benchmark (ABBA) — 7ae0382314 vs main (14 pairs)

RTX 5090 · Vast.ai datacenter @ $0.8680555555555556/hr · prover/cuda · drift-free A/B/B/A

=== ABBA paired result  (improvement: + = PR faster) ===
  pairs: 14   mean A (PR): 21.853s   mean B (base): 22.640s

  [parametric] paired-t   mean +3.46%   sd 2.06%   se 0.55%
               95% CI: [+2.27%, +4.66%]   (t df=13 = 2.16)
  [robust]     median +2.91%   Wilcoxon W+=105 W-=0  p(exact)=0.0001  (z=+3.26)

  --- server stability (this run; compare across servers) ---
  run-to-run jitter:    A CV 1.85%   B CV 1.12%        (lower = steadier)
  within-session drift: +0.09% over the run, 1st->2nd half +0.15%
    (jitter -> Tier-1 cached gate floor; drift -> whether the cached baseline can be trusted)

  VERDICT: REAL IMPROVEMENT - PR faster by ~3.46% (t-CI and Wilcoxon agree)

  raw pairs: /tmp/abba_run/pairs.csv

+ = PR faster. Trust the verdict when paired-t and Wilcoxon agree.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codex Code Review

Found one issue.

High - Resident LogUp buffers bypass the VRAM budget and can exhaust the GPU.
The PR adds GpuLdeBase::trace_dev, an extra n * main_cols * 8 device buffer retained from each main LDE, but estimate_table_vram_bytes() still budgets only LDE columns plus tree/scratch. Those handles are collected for all tables in main_gpu_handles and kept into later phases, so the chunk planner can admit work that fits the old estimate while exceeding real VRAM. The aux build also runs over air_trace_pairs.par_iter_mut() and may build resident LogUp buffers for all tables concurrently, outside plan_table_chunks(). This is a concrete CUDA OOM risk on large multi-table proofs, especially because the resident aux path then hard-requires the device aux LDE. Account for trace_dev and resident aux build buffers in the budget, chunk the aux build GPU work, and drop/clear trace_dev immediately after aux build no longer needs it. References: prover.rs, prover.rs, prover.rs, lde.rs.

blowup_factor,
weights,
"coset_lde_row_major lde_size",
true,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium (device memory / hot-path work): retain_trace_col_major is hard-coded true here, so every GPU main-trace LDE now allocates an extra column-major snapshot (n * total_cols u64s, ~half of handle.buf) and runs an extra launch_row_to_col_major transpose — on every proof, regardless of whether the resident aux build will ever consume it.

The snapshot lives inside the GpuLdeBase handle held in main_gpu_handles for the whole aux-build window (the peak-memory phase), but it is only read by the resident aux path, which is disabled when:

  • disk-spill + StorageMode::Disk (resident_aux_ok = false) — and this is precisely the memory-constrained mode where the extra resident buffer hurts most;
  • debug-checks (the whole resident/main_trace_dev threading is cfg'd out, yet the buffer is still allocated and retained here);
  • non-Goldilocks / preprocessed tables that fall back.

Consider threading the retain flag through so it's only true when the resident aux build can actually use it (i.e. gate on the same not(debug-checks) / non-disk conditions), rather than paying the transpose + extra residency unconditionally.

@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review: Build the LogUp aux trace on the GPU

Reviewed the full diff (the PR files aren't checked out locally, so this is diff-based plus the base helpers it builds on). Overall this is a careful, well-structured change: the CUDA kernels mirror the CPU evaluator closely, the parity/--ignored GPU tests pin them together, and the fall-back logic (non-Goldilocks, below GPU_LOGUP_MIN_ROWS, LAMBDA_VM_NO_GPU_LOGUP, disk-spill, debug-checks) is consistent. I verified the pieces most likely to break:

  • Fingerprint / alpha indexingalpha_powers_len = max(num_bus_elements()) where num_bus_elements() includes the +1 for the α⁰ bus-id term, so elem_alpha_idx (offsets 1..B) never over-indexes alpha_powers. ✔
  • Accumulated columnlogup_row_sum_ext3 sums all num_out columns (incl. the virtual one), the additive Hillis-Steele scan is correctly double-buffered, and finalize_accum computes acc[i] = S[i] - (i+1)·(L/N), matching the CPU build_accumulated_column. The virtual column is correctly excluded from the assembled aux buffer while still feeding the accumulator. ✔
  • Signed multiplicity / i64_to_canonical — negation edge cases are safe because |c| < 2^63 < p, so unsigned_abs() % p is never 0 for negative c. ✔
  • Resident-main reuse — guarded by rows == trace_len && buf.len() == num_cols * trace_len; the snapshot and main_segment_cols both derive from the same table's main columns in the same index order, so length equality is a sound guard (mismatched-count tables fall back). ✔
  • Resident aux → commit hand-off — when the resident build fires, the host aux trace is intentionally left empty and a device-LDE failure is a hard ProvingError (not a silent fall-through committing a zero aux trace). ✔

Findings

1. Medium — unconditional trace snapshot retention (device memory + hot-path transpose).
coset_lde_row_major_with_merkle_tree_keep passes retain_trace_col_major: true unconditionally, so every GPU main LDE allocates an extra n * total_cols column-major buffer and runs an extra transpose, even when the resident aux path can't consume it (disk-spill StorageMode::Disk, debug-checks, preprocessed/non-Goldilocks fallback). This is most costly under disk-spill — the mode chosen specifically to save memory. Consider gating retention on the conditions where the resident build is actually enabled. (Inline comment on lde.rs.)

2. Low — dropped safety debug_assert in the new transmute.
try_expand_leaf_and_tree_ext3_row_major_keep_dev reinterprets Vec<u64>Vec<FieldElement<E>> via Vec::from_raw_parts(.., len/3, capacity/3). The sibling try_expand_leaf_and_tree_ext3_row_major_keep guards this with a debug_assert!(v.len() % 3 == 0 && v.capacity() % 3 == 0, ...); the new function omits it. A non-multiple-of-3 capacity would be UB on drop. Worth mirroring the assert for parity.

Nothing blocking. Nice work on the parity tests and the resident/fallback separation.

@JuArce

JuArce commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

/bench-gpu

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

AI Review

PR #762 · 17 changed files

Findings

Status Sev Location Finding Found by
candidate low crypto/math-cuda/src/logup.rs:410 Underflow in row_sum slice for num_rows == 0 minimax
minimax/MiniMax-M3
candidate low prover/src/instruments.rs Instrument labels read "(CPU)" but the GPU-resident path also runs these stages minimax
minimax/MiniMax-M3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: Underflow in row_sum slice for num_rows == 0
  • Status: candidate
  • Severity: low
  • Location: crypto/math-cuda/src/logup.rs:410
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: -
  • Rejected by: -

Claim

When logup_aux_resident is invoked with num_rows == 0, row_sum.slice((num_rows - 1) * 3..num_rows * 3) underflows on num_rows - 1 and panics. Public callers (try_build_aux_resident_gpu) guard with GPU_LOGUP_MIN_ROWS = 1024 so the path is safe in practice, but a direct caller of logup_aux_resident with num_rows == 0 would panic.

Evidence

Line 410: let l_host: Vec&lt;u64&gt; = stream.clone_dtoh(&amp;row_sum.slice((num_rows - 1) * 3..num_rows * 3))?;num_rows - 1 is usize, and the previous (num_rows - 1) * 3 would underflow at num_rows == 0. The kernels above it early-return on empty grids and the scan_add_inplace short-circuits on n &lt;= 1, but the slice expression is not guarded.

Suggested fix

Add an early return before the scan/finalize when num_rows == 0 (similar to logup_term_columns returning Ok(Vec::new()) at line 146): allocate an empty ResidentAux { num_aux_cols: 1, num_rows: 0, table_contribution: [0;3], buf: ... } or guard the slice with if num_rows == 0 { return Ok(ResidentAux { table_contribution: [0;3], .. }) }.

AI-002: Instrument labels read "(CPU)" but the GPU-resident path also runs these stages
  • Status: candidate
  • Severity: low
  • Location: prover/src/instruments.rs
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: -
  • Rejected by: -

Claim

The new report rows are labelled "LogUp fingerprint (CPU)", "LogUp batch invert (CPU)", "LogUp term combine (CPU)", "LogUp accumulate scan (CPU)", but when the GPU-resident aux build fires these all show 0 (the host counter only accumulates from the CPU process_chunk and build_accumulated_column_from_terms). A reader could be confused that GPU time is missing.

Evidence

accum_aux_term / accum_aux_accumulate are only called from the CPU paths in lookup.rs (lines 1615, 1677). The resident aux build in logup_gpu.rs / lookup.rs returns early and never calls these accumulators.

Suggested fix

Either drop the "(CPU)" suffix, or add a parallel timer on the GPU path that records host wall-time per phase and accumulates into the same fields. Lower-priority cosmetic; the comment near the resident build (timing flag + eprintln) already gives per-phase ms when LAMBDA_VM_LOGUP_TIMING is set.

Reviewer Lanes

Lane Model Prompt Status Findings
glm openrouter/z-ai/glm-5.2 general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0
kimi openrouter/moonshotai/kimi-k2.7-code general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0
minimax minimax/MiniMax-M3 general success 2
moonmath zro/minimax-m3 general error: agentic lane timed out after 1800s 0
nemotron openrouter/nvidia/nemotron-3-ultra-550b-a55b general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0

Verification Lanes

Lane Model Status Confirmed Rejected Uncertain
deepseek-verifier openrouter/deepseek/deepseek-v4-pro error: opencode failed (provider/auth/runtime error) and no verifications were submitted 0 0 0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants