[v1.14] Expose memory mapping & dirty pages; Make memfile dump optional#8
[v1.14] Expose memory mapping & dirty pages; Make memfile dump optional#8bchalios wants to merge 42 commits into
Conversation
88151c3 to
89538bc
Compare
a041675 to
61fdd9d
Compare
d91bdb1 to
61fdd9d
Compare
f01905f to
af9c995
Compare
Add a few APIs to get information about guest memory: * An endpoint for guest memory mappings (guest physical to host virtual). * An endpoint for resident and empty pages. * An endpoint for dirty pages. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
There are cases where a user might want to snapshot the memoyr of a VM externally. In these cases, we can ask Firecracker to avoid serializing the memory file to disk when we create a snapshot. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory/mappings which returns the memory mappings of guest physical to host virtual memory. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory which returns two bitmaps: resident and empty. `resident` tracks whether a guest page is in the resident set and `empty` tracks whether it's actually all 0s. Both bitmaps are structures as vectors of u64, so their length is: total_number_of_pages.div_ceil(64). Pages are ordered in the order of pages as reported by/memory/mappings. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory/dirty which returns a bitmap tracking dirty guest memory. The bitmap is structured as a vector of u64, so its length is: total_number_of_pages.div_ceil(64). Pages are ordered in the order of pages as reported by /memory/mappings. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
UFFD provides an API to enable write-protection for memory ranges tracked by a userfault file descriptor. Detailed information can be found here: https://docs.kernel.org/admin-guide/mm/userfaultfd.html. To use the feature, users need to register the memory region with UFFDIO_REGISTER_MODE_WP. Then, users need to enable explicitly write-protection for sub-ranges of the registered region. Writes in pages within write-protected memory ranges can be handled in one of two ways. In synchronous mode, writes in a protected page will cause kernel to send a write protection event over the userfaultfd. In asynchronous mode, the kernel will automatically handle writes to protected pages by clearing the write-protection bit. Userspace can later observe the write protection bit by looking into the corresponding entry of /proc/<pid>/pagemap. This commit, uncoditionally, enables write protection for guest memory using the asynchronous mode. !NOTE!: asynchronous write protection requires (host) kernel version 6.7 or later). Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
This is an optional test on the Firecracker side and most of the times it's ignored (when valid dependency changes happen). Having this fail blocks our fc-versions releases. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
TODO Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Add descriptions for MicovmState from previous Firecracker versions. Moreover, add methods to translate a snapshot file from previous versions in the current one. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Now that we have logic for translating snapshot formats, we can allow the /snapshot/load API to parse v1.10 and v1.12 snapshots. We change the logic that parses the snapshot file to first read the version from the file and then (if needed) translate it to the expected v1.14 version. Currently older versions supported are v1.10 and v1.12. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Changes we did for supporting older snapshot formats, did not really compile on ARM systems. Fix the compilation issues. The issues were mainly bad re-exports. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
76f16f0 to
458ca91
Compare
PR SummaryHigh Risk Overview Snapshots: Virtio block: writable drives always advertise DISCARD and WRITE_ZEROES ( Also documents block TRIM/write-zeroes and balloon hint-ACK behavior; tightens virtio Reviewed by Cursor Bugbot for commit 431f1fc. Bugbot is set up for automated code reviews on this repo. Configure here. |
Add docs/api_requests/block-write-zeroes.md describing: - automatic advertisement on writable devices - UNMAP=0 → FALLOC_FL_ZERO_RANGE (zeros in place) - UNMAP=1 → FALLOC_FL_PUNCH_HOLE (zeros + deallocate) - host filesystem requirements - EOPNOTSUPP fallback (silent VIRTIO_BLK_S_UNSUPP, shared cache) - known limitations Remove the "write_zeroes is not supported" line from block-discard.md now that the feature is implemented. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
|
We require contributors to sign our Contributor License Agreement, and we don't have @ilstam, @ShadowCurse, @JackThomson2, @Manciukic, @zulinx86 on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check' |
| let is_eopnotsupp = matches!( | ||
| &cqe_result, | ||
| Err(e) if e.raw_os_error() == Some(-libc::EOPNOTSUPP) | ||
| ); |
There was a problem hiding this comment.
Async CQE EOPNOTSUPP check uses wrong errno sign
High Severity
The async completion path checks e.raw_os_error() == Some(-libc::EOPNOTSUPP) (negative), but the Cqe wrapper almost certainly converts the negative CQE result to a positive errno when constructing io::Error (the standard convention for from_raw_os_error). The sync path in BlockIoError::is_eopnotsupp() correctly checks for positive libc::EOPNOTSUPP. If the sign is wrong, discard/write-zeroes EOPNOTSUPP caching never triggers for the async io_uring engine, and every unsupported request hits the generic error path instead.
Reviewed by Cursor Bugbot for commit ab2399e. Configure here.
Whenever free-page hinting is enabled, also advertise the new VIRTIO_BALLOON_F_HINT_WAIT_ON_ACK feature bit (6). When negotiated, the guest driver waits for the device to signal-used each hint buffer before pushing the just-hinted page onto vb->free_page_list, closing a stale-hint data-loss race where the shrinker could recycle a page back to the buddy allocator before discard_range completed on the host. Guests without kernel support for bit 6 simply do not negotiate it (the driver self-clears the bit if VIRTIO_BALLOON_F_FREE_PAGE_HINT is not also negotiated), so this is forward-compatible with stock guests. No host-side protocol change is required: process_free_page_hinting_queue already calls signal_used_queue once per drain, which serves as the ACK the guest waits on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Adds a guest-side check that the negotiated balloon features in
/sys/bus/virtio/devices/virtioN/features include bit 3 (FREE_PAGE_HINT)
and bit 6 (HINT_WAIT_ON_ACK) when free_page_hinting is enabled.
The test is gated on a new dedicated marker, requires_patched_kernel,
which is registered in tests/pytest.ini and added to the default -m
exclusion filter so the test is auto-skipped by every CI run (regular
and nightly). To run it, replace the 6.1 artifact vmlinux with a build
that carries Jack Thomson's wait-on-ACK patch and invoke:
tools/devtool -y test -- -m requires_patched_kernel \
tests/integration_tests/functional/test_balloon_wait_on_ack.py
If the kernel is not patched, the bit-6 assertion fails with a clear
"did you replace the kernel?" message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add a subsection under free_page_hinting describing the behaviour of VIRTIO_BALLOON_F_HINT_WAIT_ON_ACK: always advertised alongside FPH, self-cleared by guests without the supporting kernel patch, no separate config knob, and a note on the per-buffer round-trip cost on supported guests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
|
We require contributors to sign our Contributor License Agreement, and we don't have @ilstam, @ShadowCurse, @JackThomson2, @Manciukic, @zulinx86 on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check' |
|
@cla-bot check |
|
We require contributors to sign our Contributor License Agreement, and we don't have @ilstam, @ShadowCurse, @JackThomson2, @Manciukic, @zulinx86 on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check' |
|
The cla-bot has been summoned, and re-checked this pull request! |
When saving the state of a microVM with one or more block devices backed by the async IO engine, we need to take a few steps extra steps before serializing the state to the disk, as we need to make sure that there aren't any pending io_uring requests that have not been handled by the kernel yet. For these types of devices that need that we call a prepare_save() hook before serializing the device state. If there are indeed pending requests, once we handle them we need to let the guest know, by adding the corresponding VirtIO descriptors to the used ring. Moreover, since we use notification suppression, this might or might not require us to send an interrupt to the guest. Now, when we save the state of a VirtIO device, we save the device specific state **and** the transport (MMIO or PCI) state along with it. There were a few issues with how we were doing the serialization: 1. We were saving the transport state before we run the prepare_save() hook. The transport state includes information such as the `interrupt_status` in MMIO or `MSI-X config` in PCI. prepare_save() in the case of async IO might change this state, so us running it after saving the transport state essentially looses information. 2. We were saving the devices states after saving the KVM state. This is problematic because, if prepare_save() sends an interrupt to the guest we don't save that "pending interrupt" bit of information in the snapshot. These two issues, were making microVMs with block devices backed by async IO freeze in some cases post snapshot resume, since the guest is stuck in the kernel waiting for some notification for the device emulation which never arrives. Currently, this is only a problem with virtio-block with async IO engine. The only other device using the prepare_save() hook is currently virtio-net, but this one doesn't modify any VirtIO state, neither sends interrupts. Fix this by ensuring the correct ordering of operations during the snapshot phase. Signed-off-by: Babis Chalios <bchalios@amazon.es> (cherry picked from commit 67ba7a2) Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
|
We require contributors to sign our Contributor License Agreement, and we don't have @ilstam, @ShadowCurse, @JackThomson2, @Manciukic, @zulinx86 on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check' |
c3d2d61 to
639196c
Compare
PR #8 picked up the upstream ordering fix as 639196c (cherry-pick of 67ba7a2), which closes: - Bug 1 (Vmm::save_state KVM-state-before-device-state) - Bug 2 (MMIO transport_state captured before prepare_save) - Bug 3 (PCI transport_state captured before prepare_save) Remove those sections entirely. Findings 4-10 keep their numbers unchanged so external references stay stable. Re-pin all source links from f0a35a1 to 639196c (the new HEAD). Refresh line numbers for the items that shifted (block kick 219-228 -> 212-222, net kick 1062-1071 -> 1042-1052). Update cross-references that previously read "Bug 1" / "Bugs 1-3" to refer to the upstream-fixed ordering bugs instead. Per-branch backport table simplified to two columns (ordering fix vs vsock companion); PR #8 row shows the ordering fix applied and the vsock companion still missing. The vsock companion 48a5ae3 is still not on PR #8, so Bug 9 remains open. Findings 4-10 and P2-1..P2-8 are unchanged in substance.
Previous commit (cd3fe9a) changed the signature of ArchVm::get_dirty_bitmap() to get a page_size argument, but corresponding integration test was not updated to match this change. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
GuestRegionMmapExt::discard_range() is used to deallocate guest memory that we don't use any more, for example when we use balloon inflation or free page reporting/hinting. There is the implicit requirement that the range we are discarding is aligned (both starting address and lenght) to the page size used to back the guest memory. If this alignment is not respected by the caller, we can end up with undefined behaviour. For example, if we use huge pages to back memory but we receive from the guest regions to discard that are 4K pages aligned, we might end up removing memory that we are not meant to. This currently doesn't happen but the requirement is not explicitly encoded in the type system. Add a check for these requirements and return an error when they are not met. This way, we can't shoot ourselves in the foot in the future. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Use fallocate(PUNCH_HOLE|KEEP_SIZE) for MAP_SHARED file-backed guest memory so memfd-backed balloon hinting/reporting clears the shared backing instead of only dropping PTEs. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
| { | ||
| "syscall": "fallocate", | ||
| "comment": "Used by the block device for VIRTIO_BLK_F_DISCARD (FALLOC_FL_PUNCH_HOLE)" | ||
| }, |
There was a problem hiding this comment.
Missing pread64 syscall in aarch64 seccomp filter
High Severity
The pread64 syscall is added to the x86_64 seccomp filter but not to the aarch64 filter. The new PagemapReader (used by get_dirty_memory) reads from /proc/self/pagemap at specific offsets, which requires pread64 on both architectures. On aarch64, calling this new API endpoint will trigger a seccomp violation and kill the VMM process.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit bd85e43. Configure here.
io_uring_enter() might return with a EINTR when called with IORING_ENTER_GETEVENTS. Make the submit() call a bit more robust by retrying when we observe this error. Retry 3 times. This is a semi-arbitrary choice. The assumption is that if an interrupt arrives subsequent call to the system call should most likely succeed. If we keep receiving interrupts something is more severely broken, so propagate to caller. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
If prepare_save() fails to drain the io_uring queues (when used) and sync the host filesystem we might end up with a corrupted disk snapshot. Currently, Firecracker ignores that, only emitting an error message. Be more strict and expect no errors, so that we can have a better post-mortem analysis of what happened. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Replace fallocate(PUNCH_HOLE) with madvise(MADV_REMOVE) for the memfd-backed (MAP_SHARED) memory discard path. The critical difference is that madvise(MADV_REMOVE) calls userfaultfd_remove() on the VMA before issuing the fallocate, which delivers a UFFD_EVENT_REMOVE to any userfaultfd registered on that VMA. fallocate(PUNCH_HOLE) called directly on the file descriptor does not go through this path and produces no uffd event. Without the event, a uffd handler cannot learn that the pages have been freed and may serve stale data on subsequent faults in the discarded range. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 5 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 431f1fc. Configure here.
| let entry = PagemapEntry::from_bytes(entry_bytes); | ||
|
|
||
| // Page must be present and the write_protected bit cleared (indicating it was written to) | ||
| Ok(entry.is_present() && !entry.is_write_protected()) |
There was a problem hiding this comment.
Dirty API marks all resident pages
Medium Severity
GET /memory/dirty treats a page as dirty when pagemap shows it present and the UFFD write-protected bit is clear. Without UFFD write protection, that bit is typically unset for normal RAM, so every resident page is reported dirty instead of only pages written since the last snapshot.
Reviewed by Cursor Bugbot for commit 431f1fc. Configure here.
| self.disk | ||
| .file_engine | ||
| .drain_and_flush(discard) | ||
| .expect("virtio-block: failed to drain ops and flush block data"); |
There was a problem hiding this comment.
Snapshot flush failure panics VMM
Medium Severity
During prepare_save, virtio-block now calls drain_and_flush with .expect(...). Any drain or flush error aborts the whole Firecracker process instead of returning a snapshot error, so a transient I/O failure while creating a snapshot becomes a hard crash.
Reviewed by Cursor Bugbot for commit 431f1fc. Configure here.


WIP
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.PR Checklist
tools/devtool checkbuild --allto verify that the PR passesbuild checks on all supported architectures.
tools/devtool checkstyleto verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md.Runbook for Firecracker API changes.
integration tests.
TODO.rust-vmm.