Skip to content

Add mailbox resume network handoff#260

Merged
sjmiller609 merged 18 commits into
mainfrom
hypeship/network-handoff-v2
Jun 1, 2026
Merged

Add mailbox resume network handoff#260
sjmiller609 merged 18 commits into
mainfrom
hypeship/network-handoff-v2

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Jun 1, 2026

Summary

  • adds a guest-agent ReconfigureNetwork RPC that applies restored network settings with netlink, with the existing shell-based path kept as a compatibility fallback
  • patches a resume-network mailbox payload into Firecracker snapshot memory before resume so the guest can reconfigure itself after VMGenID without requiring first post-resume host-to-guest RPC contact
  • keeps wait_for_network enabled by default for running forks; callers can set it to false to return immediately after resume while guest network apply continues asynchronously
  • waits for the default path with a small guest UDP stage=applied ack instead of a post-resume guest RPC
  • documents the behavior in lib/forkvm/README.md
  • isolates CI test network lock/lease files per run so linux tests do not collide on stale shared /tmp state

Tests

  • git diff --check
  • go test ./lib/guest ./lib/system/guest_agent ./lib/oapi -count=1
  • CI: linux test, test-darwin, e2e-install, semgrep, stainless preview, and socket all passed on the latest branch head

Notes

  • local ./lib/instances tests could not run in this checkout because the embedded VMM and guest-agent binaries are not present

Note

High Risk
Changes snapshot restore, fork readiness, guest networking, and guest-agent protocol on critical VM lifecycle paths; failures could leave forks unreachable or mis-addressed until fallback runs.

Overview
Adds a guest-initiated resume network handoff for networked Firecracker standby/running forks and restores: the host patches a JSON payload into a fixed mailbox in snapshot memory before resume, the guest-agent applies the new MAC/IP/route via netlink after a VMGenID resume signal, and the host waits for a UDP stage=applied ack before treating the fork as ready—falling back to host vsock ReconfigureNetwork (or the legacy shell ip path) when the mailbox cannot be armed, patched, or acknowledged in time.

Introduces a ReconfigureNetwork guest gRPC API, a shared lib/mailbox format, guest-agent resume watcher on Linux, and arms mailbox env on instance create/start. Fork/restore wiring uses prepareResumeNetworkHandoff / AfterResume, and fork return readiness skips redundant guest-agent polling when the standby→running path already completed guest network apply. Default kernel moves to Kernel_202605291 for VMGenID support.

CI/Makefile: per-run HYPEMAN_TEST_NETWORK_TMPDIR for Linux network test locks/leases, apt install timeouts, and small API/fork tests plus docs in lib/forkvm and lib/mailbox.

Reviewed by Cursor Bugbot for commit 02bda12. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

✱ Stainless preview builds for hypeman

No changes were made to the SDKs.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-06-01 18:07:22 UTC

@sjmiller609 sjmiller609 force-pushed the hypeship/network-handoff-v2 branch from dffc792 to 05bc363 Compare June 1, 2026 13:52
Comment thread .github/workflows/test.yml
Comment thread cmd/api/api/instances.go Outdated
Comment thread lib/forkvm/README.md Outdated
@sjmiller609 sjmiller609 marked this pull request as ready for review June 1, 2026 14:11
Comment thread lib/system/guest_agent/resume_network.go
Comment thread lib/instances/restore.go
@firetiger-agent
Copy link
Copy Markdown

Monitoring Plan: Guest-Initiated VM Fork Network Handoff

What this PR does: Speeds up VM fork networking by having the guest apply its own new MAC/IP/gateway immediately on resume — using a pre-patched mailbox in snapshot memory — rather than waiting for a host-to-guest vsock call. Also bumps the default Firecracker kernel to Kernel_202605291 which adds VMGenID support.

Intended effect:

  • Guest network reconfigure errors: baseline 0/hr; confirmed if it remains 0 post-deploy — any occurrence of "failed to configure guest network after restore" is anomalous.
  • Fork/spawn completion rate: baseline 4–162 completions/hr (variable by active hour); confirmed if no sustained drop to 0 during active hours.
  • Mailbox activation log ("guest resume network mailbox applied"): baseline 0 (feature is new); confirmed if it starts appearing on forks of mailbox-enabled instances.

Risks:

  • UDP ACK timeout – if the guest never sends the ACK (e.g. old guest agent, UDP loss), fork returns an error after 2s; alert if any "failed to configure guest network after restore" errors appear.
  • New kernel regressionDefaultKernelVersion bumped to ch-6.12.8-kernel-3.0-202605291; alert if "failed to create instance" exceeds 16,000/hr for 2+ consecutive hours (pre-existing baseline: ~12,600–13,100/hr).
  • Mailbox patch corrupts snapshot memory – direct write to snapshot memory file before VM resume; alert if any "resume network mailbox" WARN logs appear at >10/hr sustained.
  • Fallback path regression – old exec-based network reconfigure now requires codes.Unimplemented gRPC response to trigger; alert if "failed to configure guest network after restore" rises from 0.

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

Comment thread lib/instances/restore.go
Comment thread lib/instances/guest_resume_network.go
Comment thread lib/instances/restore.go Outdated
Comment thread lib/instances/restore.go Outdated
Comment thread lib/instances/restore.go Outdated
Comment thread lib/guest/guest.proto
Comment thread lib/guest/guest.proto
@sjmiller609 sjmiller609 requested review from hiroTamada and rgarcia June 1, 2026 15:37
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ran this as a stricter maintainability pass. i think the behavior is directionally right, but i’d consider tightening the structure before merging because this adds a new host/guest protocol to a sensitive restore path.

  • consider moving the mailbox protocol into a shared contract/codec package. right now the host constants/payload live in lib/instances/guest_resume_network.go, while the guest constants/payload live in lib/system/guest_agent/resume_network.go. a small shared package with the magic, offsets, payload type, marshal/patch helpers, and round-trip tests would make offset/payload drift much harder.

  • consider extracting the restore-side handoff into a dedicated object instead of keeping the mailbox patching, UDP waiter, fallback path, and async mode inline in restoreInstance. something like resumeNetworkHandoff.Prepare(...), AfterResume(...), and Close() would let restore read as orchestration again, and would keep the special cases out of the broader fork/restore flow.

  • the wait_for_network=false path might need a clearer ownership model. the README says the guest finishes the handoff asynchronously, but restore currently returns after logging that the mailbox was patched. if the guest never applies the payload, there is no bounded observer or fallback. maybe document this as explicitly best-effort, or keep a background/observable completion path so failures are visible.

  • consider extracting the guest-agent retry/wait loop in lib/guest/client.go. ReconfigureNetworkInInstance now duplicates most of the ExecIntoInstance retry logic, and it already differs in one important way: the no-wait reconfigure path does not close a bad pooled connection like exec does. a shared retry wrapper for guest RPCs would reduce drift.

  • worth thinking about whether the mailbox location can be made more direct. findGuestResumeNetworkMailbox mmaps and scans the whole snapshot memory file, then relies on a process-local token cache. that may be okay as a first cut, but for a hot restore path a persisted/discoverable offset would be simpler to reason about and avoids whole-memory scanning when the cache misses.

not trying to push for a large rewrite of the feature, but i do think the protocol/restore orchestration boundaries are worth cleaning up now while the shape is still fresh.

Comment thread lib/instances/resume_network_handoff.go
@sjmiller609 sjmiller609 requested a review from rgarcia June 1, 2026 17:54
Comment thread lib/instances/resume_network_handoff.go
@sjmiller609
Copy link
Copy Markdown
Collaborator Author

addressed the structure pass:

  • moved the shared mailbox contract into lib/mailbox with shared constants/payload codec/tests
  • added lib/mailbox/README.md explaining the behavior
  • extracted restore-side mailbox/wait/fallback handling into a dedicated handoff helper
  • removed wait_for_network from the API; running forks now always wait for guest network readiness (yagni for now, especially since upcoming PR will have a way to add more data to mailbox)
  • aligned ReconfigureNetworkInInstance with exec behavior by closing stale pooled guest connections on retryable errors
  • added a guest.resume_network.mailbox_patch span so we can see mailbox patch/scan cost, instead of addressing possible performance issues now

@rgarcia
Copy link
Copy Markdown
Contributor

rgarcia commented Jun 1, 2026

thanks for the cleanup here. the shared mailbox package and extracted resumeNetworkHandoff address the main structure concerns from my earlier pass.

one remaining behavior concern:

  • guestInitiatedResumeNetworkMailbox(...) is now used as a generic readiness skip in applyForkTargetState / ForkInstance, but mailbox eligibility is broader than “this restore path already waited for the resume-network ack.” ensureGuestInitiatedResumeNetworkMailbox runs during create/start for Firecracker networked guests, so a stopped -> running fork can go through startInstance, come back Initializing, and then skip ensureGuestAgentReadyForForkPhase purely because the mailbox env/token are present. No VMGenID/mailbox handoff runs on that fresh boot path, so this can return before the guest agent is actually ready. Consider tying the skip to the specific restore handoff completing, or just keeping the guest-agent readiness wait in the generic fork return path.

minor doc cleanup: lib/forkvm/README.md still documents wait_for_network=false, but the current diff no longer exposes wait_for_network in the API/schema. worth removing that paragraph or re-adding the API field if it is still intended.

focused tests I ran locally:

go test ./lib/mailbox ./lib/guest ./lib/system/guest_agent ./lib/oapi -count=1

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 431328a. Configure here.

Comment thread lib/instances/fork.go Outdated
@sjmiller609
Copy link
Copy Markdown
Collaborator Author

addressed the remaining behavior concern in 5967319: mailbox env/token is no longer treated as fork return readiness. fresh start/current-return paths still wait for guest-agent readiness; standby network restore can skip the extra probe only after restore succeeds via mailbox UDP ack or host reconfigure fallback. the stale wait_for_network docs were removed in 431328a.

@sjmiller609 sjmiller609 merged commit 6da67d7 into main Jun 1, 2026
11 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/network-handoff-v2 branch June 1, 2026 19:18
sjmiller609 added a commit that referenced this pull request Jun 3, 2026
* Revert "Add mailbox resume network handoff (#260)"

This reverts commit 6da67d7.

* Preserve RPC network reconfigure path

* Preserve fork readiness de-duplication

* Preserve CI network de-flake setup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants