Skip to content

Add UFFD snapshot pager#262

Merged
sjmiller609 merged 12 commits into
mainfrom
hypeship/uffd-pager-v2
Jun 4, 2026
Merged

Add UFFD snapshot pager#262
sjmiller609 merged 12 commits into
mainfrom
hypeship/uffd-pager-v2

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Jun 1, 2026

Summary

  • adds a config-gated Firecracker UFFD memory backend for snapshot restore
  • starts a per-restore UFFD pager session backed by the snapshot memory file and an optional shared page cache
  • applies resume-network mailbox updates through UFFD overlay pages so restore does not mutate the backing memory file
  • shards the pager cache and tracks pager timing counters for page-fault, lookup, backing-read, and copy latency

Tests

  • go test ./lib/hypervisor/firecracker ./lib/uffdpager ./lib/mailbox ./lib/guest ./lib/system/guest_agent ./lib/oapi -count=1

Note

High Risk
Changes Firecracker snapshot restore and host VM memory handling, depends on a correctly installed/versioned systemd pager, and misconfiguration or pager failure can leave restored VMs unhealthy until recycled.

Overview
Adds an opt-in Firecracker UFFD snapshot memory backend (hypervisor.firecracker_snapshot_memory_backend=uffd, default file) with a bounded shared page cache (hypervisor.firecracker_uffd_cache_max_bytes).

On Linux, enabling UFFD starts a versioned hypeman-uffd-pager via the hypeman-uffd@<version>.service systemd template; Hypeman wires the Firecracker starter to create per-restore pager sessions and passes a mem_backend of type Uffd (per-session socket) instead of mmap’ing the snapshot memory file. Instance metadata tracks session/cache keys, restores go through new RestoreOptions, and stop/delete/standby/restore failure paths close UFFD sessions; state queries can mark instances unknown if the pager is unhealthy.

Also ships the pager in build/release/install (cmd/uffd-pager, Makefile, GoReleaser, install scripts) and adds CI scripts/check-uffd-version.sh so runtime pager changes require bumping lib/uffdpager/VERSION.

Reviewed by Cursor Bugbot for commit 6fb4e46. Bugbot is set up for automated code reviews on this repo. Configure here.

@sjmiller609 sjmiller609 force-pushed the hypeship/uffd-pager-v2 branch from 9c706a1 to fb5341c Compare June 1, 2026 19:53
@sjmiller609 sjmiller609 changed the base branch from hypeship/fc-resume-on-load-v2 to main June 1, 2026 19:53
Comment thread cmd/api/config/config.go
Comment thread cmd/api/config/config.go
Comment thread lib/instances/firecracker_uffd.go
Comment thread lib/instances/guest_resume_network.go Outdated
Comment thread lib/uffdpager/cache.go
Comment thread lib/uffdpager/server_linux.go
Comment thread lib/uffdpager/server_linux.go Outdated
Comment thread lib/uffdpager/supervisor_linux.go
Comment thread cmd/api/config/config.go
Comment thread scripts/install.sh
@sjmiller609 sjmiller609 force-pushed the hypeship/uffd-pager-v2 branch from 38b3cf8 to 6c8c898 Compare June 3, 2026 15:06
@sjmiller609 sjmiller609 marked this pull request as ready for review June 3, 2026 15:32
Comment thread lib/instances/restore.go
@firetiger-agent
Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Adds opt-in lazy-memory paging for Firecracker snapshot restores using Linux UFFD. The default backend remains file — no behavior changes until an operator explicitly sets FIRECRACKER_SNAPSHOT_MEMORY_BACKEND=uffd on a hypeman node.

Intended effect:

  • Snapshot restore success rate: baseline 0 WARN-level restore failures/hr; confirmed if "failed to restore from snapshot" and "configure snapshot memory backend" logs remain at 0 after deploy
  • Instance spawn rate: baseline 10K–31K/hr; confirmed if rate stays within range (no regression from the RestoreVM interface or mem_backend payload change)
  • API 5xx error rate: baseline 0.006–0.026%; confirmed if no sustained increase post-deploy

Risks:

  • mem_backend Firecracker API incompatibilitysnapshotLoadParams now sends mem_backend struct instead of mem_file_path; alert if any "load firecracker snapshot" ERROR appears post-deploy (baseline: 0/hr)
  • Hypeman node crash at startupNewManagerWithConfig now panics if the UFFD pager fails to start; alert if any hypeman process restart occurs within 1h of deploy (only applies when uffd backend is set, but worth confirming)
  • Stale UFFD sessions on failure — on restore failure, UFFD session cleanup must succeed within 2s; alert if "failed to close firecracker uffd session" WARN appears (baseline: 0/hr; relevant when UFFD mode is first activated)
  • StateUnknown proliferation — UFFD session health check during instance query can return StateUnknown if pager dies; alert if any "firecracker uffd session is unhealthy" WARN log appears (baseline: 0/hr; relevant when UFFD mode is active)

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

@sjmiller609 sjmiller609 requested a review from rgarcia June 3, 2026 15:43
Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed the uffd pager slice — architecture looks solid (separate pager process, opt-in file backend, versioned systemd + drain). left a few nits on docs, restore cleanup, cache key wording, and pager vs session health. nice work.

Comment thread lib/uffdpager/README.md
Comment thread lib/uffdpager/README.md Outdated
Comment thread lib/instances/restore.go
Comment thread lib/instances/firecracker_uffd.go
Comment thread lib/uffdpager/supervisor_linux.go Outdated
Comment thread lib/uffdpager/server_faults_linux.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6fb4e46. Configure here.

Comment thread lib/instances/fork.go
@sjmiller609 sjmiller609 merged commit 6458cf3 into main Jun 4, 2026
11 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/uffd-pager-v2 branch June 4, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants