Add mailbox resume network handoff#260
Conversation
✱ Stainless preview builds for hypemanNo changes were made to the SDKs. This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push. |
39dbf4c to
dffc792
Compare
dffc792 to
05bc363
Compare
Monitoring Plan: Guest-Initiated VM Fork Network HandoffWhat this PR does: Speeds up VM fork networking by having the guest apply its own new MAC/IP/gateway immediately on resume — using a pre-patched mailbox in snapshot memory — rather than waiting for a host-to-guest vsock call. Also bumps the default Firecracker kernel to Intended effect:
Risks:
Status updates will be posted automatically on this PR as monitoring progresses. |
…off-v2 # Conflicts: # lib/instances/test_network_config_test.go
rgarcia
left a comment
There was a problem hiding this comment.
ran this as a stricter maintainability pass. i think the behavior is directionally right, but i’d consider tightening the structure before merging because this adds a new host/guest protocol to a sensitive restore path.
-
consider moving the mailbox protocol into a shared contract/codec package. right now the host constants/payload live in
lib/instances/guest_resume_network.go, while the guest constants/payload live inlib/system/guest_agent/resume_network.go. a small shared package with the magic, offsets, payload type, marshal/patch helpers, and round-trip tests would make offset/payload drift much harder. -
consider extracting the restore-side handoff into a dedicated object instead of keeping the mailbox patching, UDP waiter, fallback path, and async mode inline in
restoreInstance. something likeresumeNetworkHandoff.Prepare(...),AfterResume(...), andClose()would let restore read as orchestration again, and would keep the special cases out of the broader fork/restore flow. -
the
wait_for_network=falsepath might need a clearer ownership model. the README says the guest finishes the handoff asynchronously, but restore currently returns after logging that the mailbox was patched. if the guest never applies the payload, there is no bounded observer or fallback. maybe document this as explicitly best-effort, or keep a background/observable completion path so failures are visible. -
consider extracting the guest-agent retry/wait loop in
lib/guest/client.go.ReconfigureNetworkInInstancenow duplicates most of theExecIntoInstanceretry logic, and it already differs in one important way: the no-wait reconfigure path does not close a bad pooled connection like exec does. a shared retry wrapper for guest RPCs would reduce drift. -
worth thinking about whether the mailbox location can be made more direct.
findGuestResumeNetworkMailboxmmaps and scans the whole snapshot memory file, then relies on a process-local token cache. that may be okay as a first cut, but for a hot restore path a persisted/discoverable offset would be simpler to reason about and avoids whole-memory scanning when the cache misses.
not trying to push for a large rewrite of the feature, but i do think the protocol/restore orchestration boundaries are worth cleaning up now while the shape is still fresh.
|
addressed the structure pass:
|
|
thanks for the cleanup here. the shared mailbox package and extracted one remaining behavior concern:
minor doc cleanup: focused tests I ran locally: |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 431328a. Configure here.
|
addressed the remaining behavior concern in 5967319: mailbox env/token is no longer treated as fork return readiness. fresh start/current-return paths still wait for guest-agent readiness; standby network restore can skip the extra probe only after restore succeeds via mailbox UDP ack or host reconfigure fallback. the stale wait_for_network docs were removed in 431328a. |

Summary
ReconfigureNetworkRPC that applies restored network settings with netlink, with the existing shell-based path kept as a compatibility fallbackwait_for_networkenabled by default for running forks; callers can set it tofalseto return immediately after resume while guest network apply continues asynchronouslystage=appliedack instead of a post-resume guest RPClib/forkvm/README.md/tmpstateTests
Notes
./lib/instancestests could not run in this checkout because the embedded VMM and guest-agent binaries are not presentNote
High Risk
Changes snapshot restore, fork readiness, guest networking, and guest-agent protocol on critical VM lifecycle paths; failures could leave forks unreachable or mis-addressed until fallback runs.
Overview
Adds a guest-initiated resume network handoff for networked Firecracker standby/running forks and restores: the host patches a JSON payload into a fixed mailbox in snapshot memory before resume, the guest-agent applies the new MAC/IP/route via netlink after a VMGenID resume signal, and the host waits for a UDP
stage=appliedack before treating the fork as ready—falling back to host vsockReconfigureNetwork(or the legacy shellippath) when the mailbox cannot be armed, patched, or acknowledged in time.Introduces a
ReconfigureNetworkguest gRPC API, a sharedlib/mailboxformat, guest-agent resume watcher on Linux, and arms mailbox env on instance create/start. Fork/restore wiring usesprepareResumeNetworkHandoff/AfterResume, and fork return readiness skips redundant guest-agent polling when the standby→running path already completed guest network apply. Default kernel moves toKernel_202605291for VMGenID support.CI/Makefile: per-run
HYPEMAN_TEST_NETWORK_TMPDIRfor Linux network test locks/leases, apt install timeouts, and small API/fork tests plus docs inlib/forkvmandlib/mailbox.Reviewed by Cursor Bugbot for commit 02bda12. Bugbot is set up for automated code reviews on this repo. Configure here.