Skip to content

ci(gpu-tests): review fixes — rc capture, ssh keepalive, log artifact, sed guard, doc nits#777

Merged
MauroToscano merged 1 commit into
ci_run_tests_gpufrom
fix/pr747-review-nits
Jul 3, 2026
Merged

ci(gpu-tests): review fixes — rc capture, ssh keepalive, log artifact, sed guard, doc nits#777
MauroToscano merged 1 commit into
ci_run_tests_gpufrom
fix/pr747-review-nits

Conversation

@MauroToscano

Copy link
Copy Markdown
Contributor

Review follow-ups for #747, targeting its branch. All small, behavior-preserving fixes; shellcheck and actionlint pass.

Fixes

.github/workflows/gpu-tests.yml

  • rc=$? after if wait_ready ...; fi was always 0 — bash gives a false if with no else exit status 0, so the rc=1 timeout / rc=2 image failure diagnostic in the provisioning warning never worked. Now captured in an else branch (verified empirically).
  • ssh keepalive on the test sessionConnectTimeout only covers connection setup; a box that wedged or dropped off the network mid-suite would hang the step until the 240-minute job timeout, stalling the merge queue. ServerAliveInterval=60 + ServerAliveCountMax=10 fails ~10 minutes after the box goes dark.
  • upload the full test log as an artifact (gpu-test-log, 14-day retention) — the workflow's own comment notes the UI truncates/rotates huge step logs; the artifact keeps the complete multi-hour output retrievable. Summary footer now points to it.

scripts/gpu_test.sh

  • Guard the cudarc-pin sed anchors — if crypto/math-cuda/Cargo.toml's cudarc features are ever renamed or reformatted, the sed would silently no-op, fallback-latest would stay active, and the driver-symbol panic the pin exists to prevent (cuDevSmResourceSplit etc.) would come back with a confusing signature. Now fails loudly.
  • Restore the mutated Cargo.toml on exit — CI re-checks-out before every run so it never noticed, but a manual run on a dev GPU box was left with a silently dirty tracked file that could be committed by accident.

Makefile

  • Add test-prover-debug to .PHONY (pre-existing omission; every other test target is listed).
  • test-cuda-integration comment said the test asserts R1/R2/R3 counters — it also asserts the R4 DEEP-composition and FRI-commit counters.
  • Scope the test-prover-comprehensive-cuda parity comment: CPU CI's merge-queue comprehensive job also runs test_recursion_execute, which has no GPU leg (see 'Not fixed here' below).

Docs

  • docs/roadmap.md: GPU FFT, GPU Merkle tree, and GPU FRI were still listed as Planned — all three are implemented (crypto/math-cuda/kernels/), dispatch-counted (crypto/stark/src/gpu_lde.rs), and CI-tested by this very PR. Now Done.
  • prove_elfs_tests.rs: cargo test --ignoredcargo test -- --ignored in the ignore comment.

Deliberately NOT fixed here (bigger than one-liners)

  1. Vast box leak window: teardown is if: always() in the same job and the cleanup label embeds run_attempt — if the GitHub runner itself dies (infra failure, 6h hard kill), the box leaks and nothing ever sweeps it. Needs a small scheduled sweeper workflow (also covers benchmark-gpu.yml, which has the same gap).
  2. Merge-queue status-check timeout: repo setting, not code — if it's still the 60-min default when gpu-tests becomes a required check, worst-case provisioning (~75 min) fails every queue entry. Check when flipping the required check on.
  3. Recursion on the GPU path: test_recursion_execute is never run with cuda enabled anywhere; adding it needs recursion ELF builds on the box and a test-time budget decision.

…, sed guard, doc nits

- gpu-tests.yml: capture wait_ready's exit code in an else branch (rc=$? after
  'if ...; fi' is always 0, so the timeout-vs-image-failure diagnostic never worked)
- gpu-tests.yml: add ServerAliveInterval/CountMax to the test ssh session so a box
  that goes dark mid-suite fails in ~10 min instead of eating the 240-min job budget
- gpu-tests.yml: upload the full test log as an artifact (the step log gets
  truncated in the UI for multi-hour runs, as the workflow itself notes)
- gpu_test.sh: guard the cudarc-pin sed anchors (a silent no-op after a math-cuda
  Cargo.toml refactor would resurrect the fallback-latest driver-symbol panic) and
  restore the mutated Cargo.toml on exit so manual runs don't leave the tree dirty
- Makefile: add test-prover-debug to .PHONY; correct the test-cuda-integration
  comment (the test asserts R1-R4 counters, not R1-R3); scope the comprehensive-cuda
  parity comment (CPU's merge-queue job also runs test_recursion_execute)
- docs/roadmap.md: GPU FFT / Merkle tree / FRI are implemented and CI-tested, not
  'Planned'
- prove_elfs_tests.rs: fix 'cargo test --ignored' -> 'cargo test -- --ignored' in
  the ignore comment
@MauroToscano MauroToscano merged commit 828000e into ci_run_tests_gpu Jul 3, 2026
6 checks passed
@MauroToscano MauroToscano deleted the fix/pr747-review-nits branch July 3, 2026 21:01
MauroToscano added a commit that referenced this pull request Jul 3, 2026
* ci: run tests on GPU server

* keep instance running for debug

* ci: require CUDA 13.1

* ci: test prover on CUDA

* ci: improve gpu failed tests summaries

* ci: test-threads=1

* remove temporary code

* fix: set cuda_max_good>=12.8

* comments

* apply code review

* dummy comment

* remove tmp code

* reduce to 48gb

* add retries

* rent server with cuda 13.1

* print nvidia-smi

* print machine info in a separated step

* remove temp code

* ci(gpu-tests): review fixes — rc capture, ssh keepalive, log artifact, sed guard, doc nits (#777)

- gpu-tests.yml: capture wait_ready's exit code in an else branch (rc=$? after
  'if ...; fi' is always 0, so the timeout-vs-image-failure diagnostic never worked)
- gpu-tests.yml: add ServerAliveInterval/CountMax to the test ssh session so a box
  that goes dark mid-suite fails in ~10 min instead of eating the 240-min job budget
- gpu-tests.yml: upload the full test log as an artifact (the step log gets
  truncated in the UI for multi-hour runs, as the workflow itself notes)
- gpu_test.sh: guard the cudarc-pin sed anchors (a silent no-op after a math-cuda
  Cargo.toml refactor would resurrect the fallback-latest driver-symbol panic) and
  restore the mutated Cargo.toml on exit so manual runs don't leave the tree dirty
- Makefile: add test-prover-debug to .PHONY; correct the test-cuda-integration
  comment (the test asserts R1-R4 counters, not R1-R3); scope the comprehensive-cuda
  parity comment (CPU's merge-queue job also runs test_recursion_execute)
- docs/roadmap.md: GPU FFT / Merkle tree / FRI are implemented and CI-tested, not
  'Planned'
- prove_elfs_tests.rs: fix 'cargo test --ignored' -> 'cargo test -- --ignored' in
  the ignore comment

---------

Co-authored-by: MauroFab <maurotoscano2@gmail.com>
Co-authored-by: Mauro Toscano <12560266+MauroToscano@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant