Add mailbox resume network handoff by sjmiller609 · Pull Request #260 · kernel/hypeman

sjmiller609 · 2026-06-01T11:49:33Z

Summary

adds a guest-agent ReconfigureNetwork RPC that applies restored network settings with netlink, with the existing shell-based path kept as a compatibility fallback
patches a resume-network mailbox payload into Firecracker snapshot memory before resume so the guest can reconfigure itself after VMGenID without requiring first post-resume host-to-guest RPC contact
keeps wait_for_network enabled by default for running forks; callers can set it to false to return immediately after resume while guest network apply continues asynchronously
waits for the default path with a small guest UDP stage=applied ack instead of a post-resume guest RPC
documents the behavior in lib/forkvm/README.md
isolates CI test network lock/lease files per run so linux tests do not collide on stale shared /tmp state

Tests

git diff --check
go test ./lib/guest ./lib/system/guest_agent ./lib/oapi -count=1
CI: linux test, test-darwin, e2e-install, semgrep, stainless preview, and socket all passed on the latest branch head

Notes

local ./lib/instances tests could not run in this checkout because the embedded VMM and guest-agent binaries are not present

Note

High Risk
Changes snapshot restore, fork readiness, guest networking, and guest-agent protocol on critical VM lifecycle paths; failures could leave forks unreachable or mis-addressed until fallback runs.

Overview
Adds a guest-initiated resume network handoff for networked Firecracker standby/running forks and restores: the host patches a JSON payload into a fixed mailbox in snapshot memory before resume, the guest-agent applies the new MAC/IP/route via netlink after a VMGenID resume signal, and the host waits for a UDP stage=applied ack before treating the fork as ready—falling back to host vsock ReconfigureNetwork (or the legacy shell ip path) when the mailbox cannot be armed, patched, or acknowledged in time.

Introduces a ReconfigureNetwork guest gRPC API, a shared lib/mailbox format, guest-agent resume watcher on Linux, and arms mailbox env on instance create/start. Fork/restore wiring uses prepareResumeNetworkHandoff / AfterResume, and fork return readiness skips redundant guest-agent polling when the standby→running path already completed guest network apply. Default kernel moves to Kernel_202605291 for VMGenID support.

CI/Makefile: per-run HYPEMAN_TEST_NETWORK_TMPDIR for Linux network test locks/leases, apt install timeouts, and small API/fork tests plus docs in lib/forkvm and lib/mailbox.

^{Reviewed by Cursor Bugbot for commit 02bda12. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-01T11:50:16Z

✱ Stainless preview builds for hypeman

No changes were made to the SDKs.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-06-01 18:07:22 UTC

firetiger-agent · 2026-06-01T14:17:29Z

Monitoring Plan: Guest-Initiated VM Fork Network Handoff

What this PR does: Speeds up VM fork networking by having the guest apply its own new MAC/IP/gateway immediately on resume — using a pre-patched mailbox in snapshot memory — rather than waiting for a host-to-guest vsock call. Also bumps the default Firecracker kernel to Kernel_202605291 which adds VMGenID support.

Intended effect:

Guest network reconfigure errors: baseline 0/hr; confirmed if it remains 0 post-deploy — any occurrence of "failed to configure guest network after restore" is anomalous.
Fork/spawn completion rate: baseline 4–162 completions/hr (variable by active hour); confirmed if no sustained drop to 0 during active hours.
Mailbox activation log ("guest resume network mailbox applied"): baseline 0 (feature is new); confirmed if it starts appearing on forks of mailbox-enabled instances.

Risks:

UDP ACK timeout – if the guest never sends the ACK (e.g. old guest agent, UDP loss), fork returns an error after 2s; alert if any "failed to configure guest network after restore" errors appear.
New kernel regression – DefaultKernelVersion bumped to ch-6.12.8-kernel-3.0-202605291; alert if "failed to create instance" exceeds 16,000/hr for 2+ consecutive hours (pre-existing baseline: ~12,600–13,100/hr).
Mailbox patch corrupts snapshot memory – direct write to snapshot memory file before VM resume; alert if any "resume network mailbox" WARN logs appear at >10/hr sustained.
Fallback path regression – old exec-based network reconfigure now requires codes.Unimplemented gRPC response to trigger; alert if "failed to configure guest network after restore" rises from 0.

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

…off-v2 # Conflicts: # lib/instances/test_network_config_test.go

…off-v2

rgarcia

ran this as a stricter maintainability pass. i think the behavior is directionally right, but i’d consider tightening the structure before merging because this adds a new host/guest protocol to a sensitive restore path.

consider moving the mailbox protocol into a shared contract/codec package. right now the host constants/payload live in lib/instances/guest_resume_network.go, while the guest constants/payload live in lib/system/guest_agent/resume_network.go. a small shared package with the magic, offsets, payload type, marshal/patch helpers, and round-trip tests would make offset/payload drift much harder.
consider extracting the restore-side handoff into a dedicated object instead of keeping the mailbox patching, UDP waiter, fallback path, and async mode inline in restoreInstance. something like resumeNetworkHandoff.Prepare(...), AfterResume(...), and Close() would let restore read as orchestration again, and would keep the special cases out of the broader fork/restore flow.
the wait_for_network=false path might need a clearer ownership model. the README says the guest finishes the handoff asynchronously, but restore currently returns after logging that the mailbox was patched. if the guest never applies the payload, there is no bounded observer or fallback. maybe document this as explicitly best-effort, or keep a background/observable completion path so failures are visible.
consider extracting the guest-agent retry/wait loop in lib/guest/client.go. ReconfigureNetworkInInstance now duplicates most of the ExecIntoInstance retry logic, and it already differs in one important way: the no-wait reconfigure path does not close a bad pooled connection like exec does. a shared retry wrapper for guest RPCs would reduce drift.
worth thinking about whether the mailbox location can be made more direct. findGuestResumeNetworkMailbox mmaps and scans the whole snapshot memory file, then relies on a process-local token cache. that may be okay as a first cut, but for a hot restore path a persisted/discoverable offset would be simpler to reason about and avoids whole-memory scanning when the cache misses.

not trying to push for a large rewrite of the feature, but i do think the protocol/restore orchestration boundaries are worth cleaning up now while the shape is still fresh.

sjmiller609 · 2026-06-01T17:55:39Z

addressed the structure pass:

moved the shared mailbox contract into lib/mailbox with shared constants/payload codec/tests
added lib/mailbox/README.md explaining the behavior
extracted restore-side mailbox/wait/fallback handling into a dedicated handoff helper
removed wait_for_network from the API; running forks now always wait for guest network readiness (yagni for now, especially since upcoming PR will have a way to add more data to mailbox)
aligned ReconfigureNetworkInInstance with exec behavior by closing stale pooled guest connections on retryable errors
added a guest.resume_network.mailbox_patch span so we can see mailbox patch/scan cost, instead of addressing possible performance issues now

rgarcia · 2026-06-01T17:56:29Z

thanks for the cleanup here. the shared mailbox package and extracted resumeNetworkHandoff address the main structure concerns from my earlier pass.

one remaining behavior concern:

guestInitiatedResumeNetworkMailbox(...) is now used as a generic readiness skip in applyForkTargetState / ForkInstance, but mailbox eligibility is broader than “this restore path already waited for the resume-network ack.” ensureGuestInitiatedResumeNetworkMailbox runs during create/start for Firecracker networked guests, so a stopped -> running fork can go through startInstance, come back Initializing, and then skip ensureGuestAgentReadyForForkPhase purely because the mailbox env/token are present. No VMGenID/mailbox handoff runs on that fresh boot path, so this can return before the guest agent is actually ready. Consider tying the skip to the specific restore handoff completing, or just keeping the guest-agent readiness wait in the generic fork return path.

minor doc cleanup: lib/forkvm/README.md still documents wait_for_network=false, but the current diff no longer exposes wait_for_network in the API/schema. worth removing that paragraph or re-adding the API field if it is still intended.

focused tests I ran locally:

go test ./lib/mailbox ./lib/guest ./lib/system/guest_agent ./lib/oapi -count=1

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 431328a. Configure here.}

sjmiller609 · 2026-06-01T18:05:48Z

addressed the remaining behavior concern in 5967319: mailbox env/token is no longer treated as fork return readiness. fresh start/current-return paths still wait for guest-agent readiness; standby network restore can skip the extra probe only after restore succeeds via mailbox UDP ack or host reconfigure fallback. the stale wait_for_network docs were removed in 431328a.

* Revert "Add mailbox resume network handoff (#260)" This reverts commit 6da67d7. * Preserve RPC network reconfigure path * Preserve fork readiness de-duplication * Preserve CI network de-flake setup

Add mailbox resume network handoff

87821dd

This was referenced Jun 1, 2026

Add mailbox resume network handoff #253

Closed

Experiment with VMClock resume network handoff #254

Closed

Optimize Firecracker snapshot resume #256

Closed

sjmiller609 force-pushed the hypeship/network-handoff-v2 branch 2 times, most recently from 39dbf4c to dffc792 Compare June 1, 2026 13:35

Isolate linux test temp files

05bc363

sjmiller609 force-pushed the hypeship/network-handoff-v2 branch from dffc792 to 05bc363 Compare June 1, 2026 13:52

sjmiller609 commented Jun 1, 2026

View reviewed changes

Comment thread .github/workflows/test.yml

sjmiller609 commented Jun 1, 2026

View reviewed changes

Comment thread cmd/api/api/instances.go Outdated

sjmiller609 commented Jun 1, 2026

View reviewed changes

Comment thread lib/forkvm/README.md Outdated

sjmiller609 marked this pull request as ready for review June 1, 2026 14:11

cursor Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread lib/system/guest_agent/resume_network.go

sjmiller609 commented Jun 1, 2026

View reviewed changes

Comment thread lib/instances/restore.go

sjmiller609 added 2 commits June 1, 2026 14:23

Merge remote-tracking branch 'origin/main' into hypeship/network-hand…

25286d1

…off-v2 # Conflicts: # lib/instances/test_network_config_test.go

Timeout missing resume network mailbox payload

a0e9b45

cursor Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread lib/instances/restore.go

Comment thread lib/instances/guest_resume_network.go

Align resume network ack timeout

df8309e

cursor Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread lib/instances/restore.go Outdated

Comment thread lib/instances/restore.go Outdated

sjmiller609 added 3 commits June 1, 2026 15:15

Merge remote-tracking branch 'origin/main' into hypeship/network-hand…

adb7deb

…off-v2

Arm resume network mailbox for Firecracker guests

4ad440a

Fall back after resume network ack timeout

8c9b79c

cursor Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread lib/instances/restore.go Outdated

sjmiller609 commented Jun 1, 2026

View reviewed changes

Comment thread lib/guest/guest.proto

sjmiller609 commented Jun 1, 2026

View reviewed changes

Comment thread lib/guest/guest.proto

sjmiller609 requested review from hiroTamada and rgarcia June 1, 2026 15:37

rgarcia reviewed Jun 1, 2026

View reviewed changes

sjmiller609 added 5 commits June 1, 2026 17:36

Share resume network mailbox contract

571d01e

Always wait for fork network readiness

0838b80

Extract resume network handoff flow

209147e

Close stale guest connections after network RPC errors

d8e6a3b

Trace resume network mailbox patching

7d0b8c8

cursor Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread lib/instances/resume_network_handoff.go

sjmiller609 added 2 commits June 1, 2026 17:45

Move mailbox contract package

b301e74

Fall back when resume network ack listener fails

362c357

sjmiller609 requested a review from rgarcia June 1, 2026 17:54

cursor Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread lib/instances/resume_network_handoff.go

Update resume network handoff docs

431328a

cursor Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread lib/instances/fork.go Outdated

Require guest readiness after fork starts

5967319

Fix fork readiness build

02bda12

rgarcia approved these changes Jun 1, 2026

View reviewed changes

sjmiller609 merged commit 6da67d7 into main Jun 1, 2026
11 checks passed

sjmiller609 deleted the hypeship/network-handoff-v2 branch June 1, 2026 19:18

sjmiller609 mentioned this pull request Jun 2, 2026

Remove mailbox for now #268

Merged

Conversation

sjmiller609 commented Jun 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Notes

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds for hypeman

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

firetiger-agent Bot commented Jun 1, 2026

Monitoring Plan: Guest-Initiated VM Fork Network Handoff

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rgarcia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sjmiller609 commented Jun 1, 2026

Uh oh!

rgarcia commented Jun 1, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sjmiller609 commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sjmiller609 commented Jun 1, 2026 •

edited by cursor Bot

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading