Skip to content

chore: fix test flakes#247

Merged
sjmiller609 merged 3 commits into
mainfrom
codex/eliminate-linux-test-flakes
May 30, 2026
Merged

chore: fix test flakes#247
sjmiller609 merged 3 commits into
mainfrom
codex/eliminate-linux-test-flakes

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented May 29, 2026

Summary

  • evict pooled guest gRPC connections after retryable no-wait exec failures
  • harden build result vsock dialing against nil dialers
  • prevent restart-policy stable-window resets from racing in-flight health-check restarts
  • make Firecracker fork disk-usage assertion measure workspace usage instead of global free space
  • treat nginx startup log waits as diagnostics in basic end-to-end tests; keep ingress HTTP probes as the authoritative behavior check
  • wait for restored network-disabled running-fork sources to regain guest-agent readiness, and widen the warm-fork-chain readiness budget
  • verify preserved Cloud Hypervisor binary caches match the host architecture before embedding them
  • trim the test-agent notes to keep only the useful summary and validation signal

Validation

  • targeted Linux loops passed for guest exec readiness, build result handling, restart policy, Firecracker fork isolation, basic end-to-end tests, Cloud Hypervisor warm fork chain, QEMU running-network fork, and lib/vmm TestMultipleVersions
  • full Linux suite passed after the original flake fixes
  • fresh PR CI on 4a5cc98 passed 4 times: Linux test, test-darwin, and e2e-install all green on the push run plus three reruns
  • latest Test workflow attempt: Linux test 4m17s, test-darwin 22s, e2e-install 30s

Deft notes

  • disk and RAM looked healthy in the available evidence; CPU load was high during testing
  • swap was full on deft-kernel-dev during investigation, but the observed failures had concrete readiness/cache signatures
  • no manual cleanup was performed after your notes-only push; the branch now hardens the CH cache check in CI instead

Note

Medium Risk
Changes guest connection pooling, fork restore readiness, and restart-policy reset timing in production paths; scope is defensive and test-focused but affects instance lifecycle behavior.

Overview
This PR hardens test and CI reliability around guest connectivity, fork lifecycles, restart policy, and embedded Cloud Hypervisor binaries.

Guest exec now drops pooled gRPC connections after retryable no-wait failures, matching the wait-and-retry path so stale vsock connections are not reused.

Builds treat a nil vsock dialer as an explicit error while waiting for the builder agent. Running forks wait for the restored source guest agent even when networking is disabled, and fork readiness checks no longer skip network-disabled instances. Warm fork chain tests use a 90s readiness budget.

Restart policy only resets attempt counters after stability measured from the later of instance start and the latest restart attempt, avoiding races with in-flight health-check restarts.

Firecracker fork disk assertions use workspace disk utilization totals instead of host-wide free-space deltas. Basic e2e tests treat nginx startup log waits as diagnostics; ingress HTTP probes remain the behavior check.

ensure-ch-binaries verifies cached Cloud Hypervisor binaries with file and refreshes wrong-arch or corrupt caches preserved across CI runs.

Reviewed by Cursor Bugbot for commit 4a5cc98. Bugbot is set up for automated code reviews on this repo. Configure here.

@sjmiller609 sjmiller609 force-pushed the codex/eliminate-linux-test-flakes branch from fac58c9 to f17ed89 Compare May 29, 2026 21:49
@sjmiller609 sjmiller609 force-pushed the codex/eliminate-linux-test-flakes branch from f17ed89 to 773ad49 Compare May 30, 2026 18:43
@sjmiller609 sjmiller609 marked this pull request as ready for review May 30, 2026 18:50
@sjmiller609 sjmiller609 requested review from hiroTamada and rgarcia May 30, 2026 18:50
@sjmiller609 sjmiller609 changed the title Fix Linux test flakes Fix test flakes May 30, 2026
@sjmiller609 sjmiller609 changed the title Fix test flakes Address test flakes May 30, 2026
@sjmiller609 sjmiller609 changed the title Address test flakes chore: fix test flakes May 30, 2026
@firetiger-agent
Copy link
Copy Markdown

Monitoring Plan: Vsock nil-guard, connection eviction, restart-policy fix

What this PR does: Fixes three latent runtime bugs in the hypervisor communication layer — prevents a potential nil-pointer panic when the vsock dialer is unavailable, ensures stale gRPC connections are evicted after a failed non-retrying exec, and corrects a race in the restart-policy controller that could prematurely clear the restart-attempt counter while a restart was still in flight.

Intended effect:

  • Nil-vsock panic prevention: baseline: potential nil-pointer crash on nil dialer with nil error; confirmed if no recovered from panic logs appear post-deploy.
  • Stale connection eviction: baseline: retryable exec errors could persist across calls reusing poisoned connections; confirmed if retryable exec-failure series resolve within one retry.
  • Restart-policy counter: baseline kernel_hypeman_uptime_missing_total ~60/hr; confirmed if it remains at ~60/hr (no runaway restart-loop spike).

Risks:

  • Build wait errors increasebuild failed log rate, alert if > 10/hr (baseline ~0–1/hr); nil-dialer path now surfaces an error instead of retrying silently.
  • API 5xx rate regressionattributes.res.status >= 500 on API request logs, alert if error rate > 0.1% for 2+ consecutive hours (baseline 0.01–0.02%).
  • Restart attempts stall longer — "restart policy stable window reached" log; alert if frequency drops to near-zero when it was previously non-zero (stable window anchored to LastAttemptAt now, making it stricter).
  • Instance creation error spike — "failed to create instance" error log, alert if > 50K/hr sustained (baseline ~26K/hr average).

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

@sjmiller609 sjmiller609 force-pushed the codex/eliminate-linux-test-flakes branch from abc499d to b62f483 Compare May 30, 2026 19:31
@sjmiller609 sjmiller609 force-pushed the codex/eliminate-linux-test-flakes branch from b62f483 to 4a5cc98 Compare May 30, 2026 19:57
@sjmiller609
Copy link
Copy Markdown
Collaborator Author

4 re-runs worked

@sjmiller609 sjmiller609 merged commit 934d96c into main May 30, 2026
20 checks passed
@sjmiller609 sjmiller609 deleted the codex/eliminate-linux-test-flakes branch May 30, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants