Skip to content

[v1.14] Expose memory mapping & dirty pages; Make memfile dump optional#8

Draft
bchalios wants to merge 42 commits into
firecracker-v1.14from
firecracker-v1.14-direct-mem
Draft

[v1.14] Expose memory mapping & dirty pages; Make memfile dump optional#8
bchalios wants to merge 42 commits into
firecracker-v1.14from
firecracker-v1.14-direct-mem

Conversation

@bchalios
Copy link
Copy Markdown
Collaborator

@bchalios bchalios commented Feb 5, 2026

WIP

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

@bchalios bchalios marked this pull request as draft February 5, 2026 19:16
@ValentaTomas ValentaTomas added the wontfix This will not be worked on label Feb 5, 2026
@bchalios bchalios force-pushed the firecracker-v1.14-direct-mem branch from 88151c3 to 89538bc Compare February 5, 2026 20:20
@bchalios bchalios force-pushed the firecracker-v1.14-direct-mem branch from a041675 to 61fdd9d Compare February 12, 2026 23:10
Comment thread src/vmm/src/utils/pagemap.rs
@bchalios bchalios force-pushed the firecracker-v1.14-direct-mem branch from d91bdb1 to 61fdd9d Compare February 13, 2026 23:58
@ValentaTomas ValentaTomas requested review from ValentaTomas and removed request for ValentaTomas March 12, 2026 19:56
@ValentaTomas ValentaTomas self-assigned this Mar 12, 2026
@ValentaTomas ValentaTomas self-requested a review March 12, 2026 19:57
@ValentaTomas ValentaTomas removed their request for review March 13, 2026 00:03
@bchalios bchalios force-pushed the firecracker-v1.14-direct-mem branch from f01905f to af9c995 Compare March 23, 2026 15:33
@ValentaTomas ValentaTomas removed their assignment Apr 8, 2026
bchalios added 11 commits April 14, 2026 17:46
Add a few APIs to get information about guest memory:

* An endpoint for guest memory mappings (guest physical to host
  virtual).
* An endpoint for resident and empty pages.
* An endpoint for dirty pages.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
There are cases where a user might want to snapshot the memoyr of a VM
externally. In these cases, we can ask Firecracker to avoid serializing
the memory file to disk when we create a snapshot.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory/mappings which returns the memory mappings of
guest physical to host virtual memory.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory which returns two bitmaps: resident and empty.
`resident` tracks whether a guest page is in the resident set and `empty`
tracks whether it's actually all 0s.

Both bitmaps are structures as vectors of u64, so their length is:
total_number_of_pages.div_ceil(64).

Pages are ordered in the order of pages as reported by/memory/mappings.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory/dirty which returns a bitmap tracking dirty guest
memory. The bitmap is structured as a vector of u64, so its length is:
total_number_of_pages.div_ceil(64).

Pages are ordered in the order of pages as reported by /memory/mappings.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
UFFD provides an API to enable write-protection for memory ranges
tracked by a userfault file descriptor. Detailed information can be
found here: https://docs.kernel.org/admin-guide/mm/userfaultfd.html.

To use the feature, users need to register the memory region with
UFFDIO_REGISTER_MODE_WP. Then, users need to enable explicitly
write-protection for sub-ranges of the registered region.

Writes in pages within write-protected memory ranges can be handled in
one of two ways. In synchronous mode, writes in a protected page will
cause kernel to send a write protection event over the userfaultfd.
In asynchronous mode, the kernel will automatically handle writes to
protected pages by clearing the write-protection bit. Userspace can
later observe the write protection bit by looking into the corresponding
entry of /proc/<pid>/pagemap.

This commit, uncoditionally, enables write protection for guest memory
using the asynchronous mode.

!NOTE!: asynchronous write protection requires (host) kernel version 6.7
or later).

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
This is an optional test on the Firecracker side and most of the times
it's ignored (when valid dependency changes happen). Having this fail
blocks our fc-versions releases.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
TODO

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Add descriptions for MicovmState from previous Firecracker versions.
Moreover, add methods to translate a snapshot file from previous
versions in the current one.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Now that we have logic for translating snapshot formats, we can allow
the /snapshot/load API to parse v1.10 and v1.12 snapshots. We change the
logic that parses the snapshot file to first read the version from the
file and then (if needed) translate it to the expected v1.14 version.

Currently older versions supported are v1.10 and v1.12.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Changes we did for supporting older snapshot formats, did not really
compile on ARM systems. Fix the compilation issues. The issues were
mainly bad re-exports.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
@bchalios bchalios force-pushed the firecracker-v1.14-direct-mem branch from 76f16f0 to 458ca91 Compare April 14, 2026 15:51
@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 14, 2026

PR Summary

High Risk
Touches guest memory exposure APIs, snapshot/UFFD restore (write-protect, memfd FD passing), dependency fork, and block I/O paths that use fallocate—errors could affect data integrity or compatibility.

Overview
Adds GET /memory, /memory/mappings, and /memory/dirty so operators can inspect guest RAM (resident/empty bitmaps, host mapping metadata, and dirty pages via mincore + /proc/pagemap) without dumping a memfile.

Snapshots: mem_file_path on create is now optional—VM state can be snapshotted while memory is captured externally. UFFD restore gains use_memfd (validated only for Uffd backends) and sends an optional memfd over the UDS handshake. Restore loads snapshot format 8.0 natively and upgrades 6.0 / 4.0 via new v1_10 / v1_12 persist shims; device save order is fixed so virtio interrupts aren’t lost. GICv3 ITS restore is skipped when missing from older snapshots.

Virtio block: writable drives always advertise DISCARD and WRITE_ZEROES (fallocate / io_uring), with EOPNOTSUPP caching and seccomp fallocate / x86 pread64. Balloon advertises HINT_WAIT_ON_ACK with free-page hinting. UFFD switches to a forked userfaultfd-rs with MISSING + WRITE_PROTECT (and hugetlb WP where applicable). Removes the CI workflow that blocked Cargo.lock changes.

Also documents block TRIM/write-zeroes and balloon hint-ACK behavior; tightens virtio prepare_save ordering; makes several persist structs pub for migration.

Reviewed by Cursor Bugbot for commit 431f1fc. Bugbot is set up for automated code reviews on this repo. Configure here.

Add docs/api_requests/block-write-zeroes.md describing:
  - automatic advertisement on writable devices
  - UNMAP=0 → FALLOC_FL_ZERO_RANGE (zeros in place)
  - UNMAP=1 → FALLOC_FL_PUNCH_HOLE (zeros + deallocate)
  - host filesystem requirements
  - EOPNOTSUPP fallback (silent VIRTIO_BLK_S_UNSUPP, shared cache)
  - known limitations

Remove the "write_zeroes is not supported" line from block-discard.md
now that the feature is implemented.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 12, 2026

We require contributors to sign our Contributor License Agreement, and we don't have @ilstam, @ShadowCurse, @JackThomson2, @Manciukic, @zulinx86 on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check'

let is_eopnotsupp = matches!(
&cqe_result,
Err(e) if e.raw_os_error() == Some(-libc::EOPNOTSUPP)
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async CQE EOPNOTSUPP check uses wrong errno sign

High Severity

The async completion path checks e.raw_os_error() == Some(-libc::EOPNOTSUPP) (negative), but the Cqe wrapper almost certainly converts the negative CQE result to a positive errno when constructing io::Error (the standard convention for from_raw_os_error). The sync path in BlockIoError::is_eopnotsupp() correctly checks for positive libc::EOPNOTSUPP. If the sign is wrong, discard/write-zeroes EOPNOTSUPP caching never triggers for the async io_uring engine, and every unsupported request hits the generic error path instead.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ab2399e. Configure here.

kalyazin and others added 3 commits May 12, 2026 10:02
Whenever free-page hinting is enabled, also advertise the new
VIRTIO_BALLOON_F_HINT_WAIT_ON_ACK feature bit (6). When negotiated,
the guest driver waits for the device to signal-used each hint buffer
before pushing the just-hinted page onto vb->free_page_list, closing
a stale-hint data-loss race where the shrinker could recycle a page
back to the buddy allocator before discard_range completed on the host.

Guests without kernel support for bit 6 simply do not negotiate it
(the driver self-clears the bit if VIRTIO_BALLOON_F_FREE_PAGE_HINT is
not also negotiated), so this is forward-compatible with stock guests.
No host-side protocol change is required: process_free_page_hinting_queue
already calls signal_used_queue once per drain, which serves as the
ACK the guest waits on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Adds a guest-side check that the negotiated balloon features in
/sys/bus/virtio/devices/virtioN/features include bit 3 (FREE_PAGE_HINT)
and bit 6 (HINT_WAIT_ON_ACK) when free_page_hinting is enabled.

The test is gated on a new dedicated marker, requires_patched_kernel,
which is registered in tests/pytest.ini and added to the default -m
exclusion filter so the test is auto-skipped by every CI run (regular
and nightly). To run it, replace the 6.1 artifact vmlinux with a build
that carries Jack Thomson's wait-on-ACK patch and invoke:

    tools/devtool -y test -- -m requires_patched_kernel \
        tests/integration_tests/functional/test_balloon_wait_on_ack.py

If the kernel is not patched, the bit-6 assertion fails with a clear
"did you replace the kernel?" message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add a subsection under free_page_hinting describing the behaviour of
VIRTIO_BALLOON_F_HINT_WAIT_ON_ACK: always advertised alongside FPH,
self-cleared by guests without the supporting kernel patch, no
separate config knob, and a note on the per-buffer round-trip cost on
supported guests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 12, 2026

We require contributors to sign our Contributor License Agreement, and we don't have @ilstam, @ShadowCurse, @JackThomson2, @Manciukic, @zulinx86 on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check'

@djeebus
Copy link
Copy Markdown

djeebus commented May 12, 2026

@cla-bot check

@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 12, 2026

We require contributors to sign our Contributor License Agreement, and we don't have @ilstam, @ShadowCurse, @JackThomson2, @Manciukic, @zulinx86 on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check'

@cla-bot cla-bot Bot removed the cla-signed label May 12, 2026
@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 12, 2026

The cla-bot has been summoned, and re-checked this pull request!

When saving the state of a microVM with one or more block devices backed
by the async IO engine, we need to take a few steps extra steps before
serializing the state to the disk, as we need to make sure that there
aren't any pending io_uring requests that have not been handled by the
kernel yet. For these types of devices that need that we call a
prepare_save() hook before serializing the device state.

If there are indeed pending requests, once we handle them we need to let
the guest know, by adding the corresponding VirtIO descriptors to the
used ring. Moreover, since we use notification suppression, this might
or might not require us to send an interrupt to the guest.

Now, when we save the state of a VirtIO device, we save the device
specific state **and** the transport (MMIO or PCI) state along with it.

There were a few issues with how we were doing the serialization:

1. We were saving the transport state before we run the prepare_save()
   hook. The transport state includes information such as the
   `interrupt_status` in MMIO or `MSI-X config` in PCI. prepare_save()
   in the case of async IO might change this state, so us running it
   after saving the transport state essentially looses information.
2. We were saving the devices states after saving the KVM state. This is
   problematic because, if prepare_save() sends an interrupt to the
   guest we don't save that "pending interrupt" bit of information in
   the snapshot.

These two issues, were making microVMs with block devices backed by
async IO freeze in some cases post snapshot resume, since the guest is
stuck in the kernel waiting for some notification for the device
emulation which never arrives.

Currently, this is only a problem with virtio-block with async IO
engine. The only other device using the prepare_save() hook is currently
virtio-net, but this one doesn't modify any VirtIO state, neither sends
interrupts.

Fix this by ensuring the correct ordering of operations during the
snapshot phase.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
(cherry picked from commit 67ba7a2)
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 18, 2026

We require contributors to sign our Contributor License Agreement, and we don't have @ilstam, @ShadowCurse, @JackThomson2, @Manciukic, @zulinx86 on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check'

@kalyazin kalyazin force-pushed the firecracker-v1.14-direct-mem branch from c3d2d61 to 639196c Compare May 18, 2026 09:44
@cla-bot cla-bot Bot added the cla-signed label May 18, 2026
Comment thread src/vmm/src/rpc_interface.rs
ValentaTomas added a commit that referenced this pull request May 18, 2026
PR #8 picked up the upstream ordering fix as 639196c (cherry-pick of
67ba7a2), which closes:

- Bug 1 (Vmm::save_state KVM-state-before-device-state)
- Bug 2 (MMIO transport_state captured before prepare_save)
- Bug 3 (PCI transport_state captured before prepare_save)

Remove those sections entirely. Findings 4-10 keep their numbers
unchanged so external references stay stable. Re-pin all source links
from f0a35a1 to 639196c (the new HEAD). Refresh line numbers for
the items that shifted (block kick 219-228 -> 212-222, net kick
1062-1071 -> 1042-1052). Update cross-references that previously read
"Bug 1" / "Bugs 1-3" to refer to the upstream-fixed ordering bugs
instead. Per-branch backport table simplified to two columns
(ordering fix vs vsock companion); PR #8 row shows the ordering fix
applied and the vsock companion still missing.

The vsock companion 48a5ae3 is still not on PR #8, so Bug 9 remains
open. Findings 4-10 and P2-1..P2-8 are unchanged in substance.
bchalios added a commit that referenced this pull request May 19, 2026
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
bchalios and others added 3 commits May 21, 2026 14:52
Previous commit (cd3fe9a) changed the signature of
ArchVm::get_dirty_bitmap() to get a page_size argument, but
corresponding integration test was not updated to match this change.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
GuestRegionMmapExt::discard_range() is used to deallocate guest memory
that we don't use any more, for example when we use balloon inflation or
free page reporting/hinting. There is the implicit requirement that the
range we are discarding is aligned (both starting address and lenght) to
the page size used to back the guest memory.

If this alignment is not respected by the caller, we can end up with
undefined behaviour. For example, if we use huge pages to back memory
but we receive from the guest regions to discard that are 4K pages
aligned, we might end up removing memory that we are not meant to.

This currently doesn't happen but the requirement is not explicitly
encoded in the type system. Add a check for these requirements and
return an error when they are not met. This way, we can't shoot
ourselves in the foot in the future.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Use fallocate(PUNCH_HOLE|KEEP_SIZE) for MAP_SHARED file-backed guest
memory so memfd-backed balloon hinting/reporting clears the shared
backing instead of only dropping PTEs.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
{
"syscall": "fallocate",
"comment": "Used by the block device for VIRTIO_BLK_F_DISCARD (FALLOC_FL_PUNCH_HOLE)"
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing pread64 syscall in aarch64 seccomp filter

High Severity

The pread64 syscall is added to the x86_64 seccomp filter but not to the aarch64 filter. The new PagemapReader (used by get_dirty_memory) reads from /proc/self/pagemap at specific offsets, which requires pread64 on both architectures. On aarch64, calling this new API endpoint will trigger a seccomp violation and kill the VMM process.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit bd85e43. Configure here.

bchalios added 2 commits May 26, 2026 17:01
io_uring_enter() might return with a EINTR when called with
IORING_ENTER_GETEVENTS. Make the submit() call a bit more robust by
retrying when we observe this error.

Retry 3 times. This is a semi-arbitrary choice. The assumption is that
if an interrupt arrives subsequent call to the system call should most
likely succeed. If we keep receiving interrupts something is more
severely broken, so propagate to caller.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
If prepare_save() fails to drain the io_uring queues (when used) and
sync the host filesystem we might end up with a corrupted disk snapshot.
Currently, Firecracker ignores that, only emitting an error message.

Be more strict and expect no errors, so that we can have a better
post-mortem analysis of what happened.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Comment thread src/vmm/src/lib.rs
Comment thread resources/seccomp/x86_64-unknown-linux-musl.json
kalyazin added 2 commits May 26, 2026 16:04
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Replace fallocate(PUNCH_HOLE) with madvise(MADV_REMOVE) for the
memfd-backed (MAP_SHARED) memory discard path.

The critical difference is that madvise(MADV_REMOVE) calls
userfaultfd_remove() on the VMA before issuing the fallocate, which
delivers a UFFD_EVENT_REMOVE to any userfaultfd registered on that VMA.
fallocate(PUNCH_HOLE) called directly on the file descriptor does not go
through this path and produces no uffd event. Without the event, a uffd
handler cannot learn that the pages have been freed and may serve stale
data on subsequent faults in the discarded range.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 5 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 431f1fc. Configure here.

let entry = PagemapEntry::from_bytes(entry_bytes);

// Page must be present and the write_protected bit cleared (indicating it was written to)
Ok(entry.is_present() && !entry.is_write_protected())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dirty API marks all resident pages

Medium Severity

GET /memory/dirty treats a page as dirty when pagemap shows it present and the UFFD write-protected bit is clear. Without UFFD write protection, that bit is typically unset for normal RAM, so every resident page is reported dirty instead of only pages written since the last snapshot.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 431f1fc. Configure here.

self.disk
.file_engine
.drain_and_flush(discard)
.expect("virtio-block: failed to drain ops and flush block data");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshot flush failure panics VMM

Medium Severity

During prepare_save, virtio-block now calls drain_and_flush with .expect(...). Any drain or flush error aborts the whole Firecracker process instead of returning a snapshot error, so a transient I/O failure while creating a snapshot becomes a hard crash.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 431f1fc. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed wontfix This will not be worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants