Skip to content

Honor MAP_SHARED file-backed semantics#13

Merged
jserv merged 1 commit intomainfrom
mmap
May 7, 2026
Merged

Honor MAP_SHARED file-backed semantics#13
jserv merged 1 commit intomainfrom
mmap

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 7, 2026

sys_mmap previously pread-snapshotted file contents into private guest pages and emulated write-back via msync's pwrite-the-diff path. The guest never saw concurrent host writes through its mapping, and its own writes only landed on the file at msync time. Database WAL, lock files, and any cross-process file-sharing protocol were silently broken.

Install a real host mmap(MAP_FIXED|MAP_SHARED, fd) overlay on top of the guest slab so the kernel page cache keeps the mapping coherent with the file (and with peer overlays of the same inode). HVF requires hv_vm_unmap to target an exactly-previously-mapped range and rejects sub-range unmap with HV_BAD_ARGUMENT, so the slab is now tracked as a sorted list of 2 MiB-aligned hvf_segment_t entries (guest.h). Each overlay request splits the containing segment, hv_vm_unmaps it, lays down the host file mmap at the exact host VA, and hv_vm_maps the segment back so HVF re-walks the host page tables. Sibling vCPUs are quiesced via thread_quiesce_siblings during the brief stage-2 window so concurrent guest accesses cannot fault on the temporarily-unmapped IPA. HVF resolves stage-2 at sub-2 MiB granularity within a mapped segment, so a 4 KiB lock-file overlay inside a 2 MiB segment is honored without per-page hv_vm_map calls (empirically verified during implementation).

Apple Silicon enforces 16 KiB host pages: mmap MAP_FIXED requires the addr and offset to be 16 KiB-aligned. Two paths handle the gap: the gap-finder hint advances to the next host-page boundary after each allocation so sequential mmaps stay overlay-eligible, and find_free_gap now aligns to host_page_size_cached() rather than the guest 4 KiB page when stepping past a region. Misaligned MAP_FIXED requests fall back to the snapshot pread path, so mremap-style guest-supplied addresses that the host kernel cannot honor still produce correct behaviour through the legacy emulation.

guest_region_t carries new overlay_active / overlay_start / overlay_end fields so msync collapses to a plain fsync for overlay regions (the kernel page cache already keeps them coherent), the snapshot-style refresh-from-file alias pass is skipped for overlay peers, and MADV_DONTNEED is a no-op for overlay regions (the existing memset+pread reset would have written zeros straight into the file via the overlay). The metadata is clipped through every region split / trim site in src/core/guest.c so the overlay bounds always match the host-page-aligned region bounds. cleanup_overlays_in_range now returns int and propagates -EIO; metadata is cleared only after per-overlay host-VA tear-down succeeds so a partial failure does not leave the runtime believing an overlay is gone while the host mmap is still live (which would otherwise let a later memset write zeros into the user file).

The CoW fork path syncs each live overlay region back into shm_fd before sending shm_fd over SCM_RIGHTS so the child's MAP_PRIVATE snapshot reflects the parent's view at fork time, and the child's fork-state restore demotes every inherited overlay flag to the snapshot path (live cross-fork MAP_SHARED coherence is the next P1 TODO item, deliberately deferred). The sync-back loop now treats a short pwrite of zero as -EIO instead of spinning.

Multi-model review (Gemini, Codex) closed six issues in the same change: (1) sys_mmap rollback when guest_region_add_ex_owned failed after the host overlay succeeded left the file mmap'd at host_base+ipa with no region tracking, so a later operation in that range would memset zeros directly into the user's file; the failure path now calls hvf_remove_file_overlay before returning -ENOMEM. (2) The final hv_vm_map failure path in hvf_apply_file_overlay restored slab backing but never re-issued hv_vm_map for the segment, so sibling vCPUs would page-fault on that IPA after thread_resume_siblings; the rollback now re-establishes the segment before returning. (3) cleanup_overlays_in_range cleared overlay_active before calling hvf_remove_file_overlay and ignored its failure (described above). (4) fork_ipc_recv_memory_regions inherited overlay_active=true on every shared region but the child never re-established the host-VA overlay, so guest writes went to private CoW slab while msync silently skipped writeback; the child now demotes every inherited overlay to the snapshot path. (5) find_free_gap_inner advanced gap_start with PAGE_ALIGN_UP (4 KiB), not host_page_size_cached() (16 KiB on Apple Silicon); after hint rewind, a new mapping could start mid-host-page and silently share an already-overlay-mapped 16 KiB host page, exposing live file content (or causing zero writes back into the file) through the wrong VMA. (6) The CoW fork sync-back pwrite=0-with-len>0 spin described above.

Sync handling needed a parallel fix: sc_sync was forwarding to host sync(2), which flushes every dirty page system-wide. The slab is mmap'd MAP_SHARED to an internal tempfile (g->shm_fd) for the CoW fork fast path, so a global flush had to walk multi-GB of demand-paged dirty pages from that tempfile plus the same from any other elfuse process on the host. In practice this stalled make check for hundreds of seconds when prior killed test runs had left stuck-in-uninterruptible busybox/elfuse processes holding mmap'd tempfiles. sc_sync now iterates the guest fd_table plus the live overlay regions, dups each target host fd under the matching lock, and fsyncs them outside the lock so a slow disk does not stall concurrent FD operations on other threads. The slab tempfile is implementation detail and is no longer touched. Full make check went from timing out to ~13 seconds end to end.

Locked in by tests/test-msync.c with three new cases on top of the existing four: host pwrite is visible through the mapping without msync, guest writes through the mapping reach the file without msync, and an adjacent MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS allocation does not inherit the shared overlay through host-page sharing (regression lock for the gap-finder host-page alignment fix).


Summary by cubic

Implements real MAP_SHARED by overlaying host mmap(MAP_FIXED|MAP_SHARED) on the guest slab so file-backed mappings stay coherent with the file and peers. Fixes broken WAL/lock-file behavior and makes sync fast.

  • New Features

    • Host overlay for MAP_SHARED files with HVF segment splitting (2 MiB) and vCPU quiesce; HVF is re-mapped so stage-2 walks the updated host PTEs.
    • Gap finder and hints align to host page size (16 KiB on Apple Silicon); misaligned requests fall back to snapshotting.
    • guest_region_t tracks overlay bounds; msync collapses to fsync for overlays, and MADV_DONTNEED skips zero+reload; CoW fork copies overlay bytes into shm_fd and the child demotes overlays to snapshot mode.
    • sync(2) now fsyncs open files and live overlay fds instead of calling host sync(), cutting test runtime to ~13s.
  • Bug Fixes

    • Robust rollback: undo leaked overlays on sys_mmap/HVF failures and always re-establish HVF segments; overlay teardown returns errors and clears metadata only after successful host-VA cleanup.
    • Fix host-page aliasing: gap scan advances by host page size to avoid inadvertently sharing an already-overlaid host page.
    • mremap is overlay-aware: reads from the file when the source is overlaid; error paths restore the source overlay, invalidate dest PTEs, and rebuild page tables for MREMAP_FIXED. Snapshot/rollback buffers moved to heap to avoid macOS thread stack limits.
    • HVF segment-split recovery unmaps partial pieces before remapping the original segment; hard failures are logged.
    • Tests added: host pwrite visible via mapping without msync, guest writes reach the file without msync, and adjacent MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS does not inherit a shared overlay.

Written for commit 92c13c1. Summary will update on new commits.

cubic-dev-ai[bot]

This comment was marked as resolved.

sys_mmap previously pread-snapshotted file contents into private guest
pages and emulated write-back via msync's pwrite-the-diff path. The
guest never saw concurrent host writes through its mapping, and its own
writes only landed on the file at msync time. Database WAL, lock files,
and any cross-process file-sharing protocol were silently broken.

Install a real host mmap(MAP_FIXED|MAP_SHARED, fd) overlay on top of the
guest slab so the kernel page cache keeps the mapping coherent with the
file (and with peer overlays of the same inode). HVF requires hv_vm_unmap
to target an exactly-previously-mapped range and rejects sub-range unmap
with HV_BAD_ARGUMENT, so the slab is now tracked as a sorted list of
2 MiB-aligned hvf_segment_t entries (guest.h). Each overlay request
splits the containing segment, hv_vm_unmaps it, lays down the host file
mmap at the exact host VA, and hv_vm_maps the segment back so HVF
re-walks the host page tables. Sibling vCPUs are quiesced via
thread_quiesce_siblings during the brief stage-2 window so concurrent
guest accesses cannot fault on the temporarily-unmapped IPA. HVF
resolves stage-2 at sub-2 MiB granularity within a mapped segment, so
a 4 KiB lock-file overlay inside a 2 MiB segment is honored without
per-page hv_vm_map calls (empirically verified during implementation).

Apple Silicon enforces 16 KiB host pages: mmap MAP_FIXED requires the
addr and offset to be 16 KiB-aligned. Two paths handle the gap: the
gap-finder hint advances to the next host-page boundary after each
allocation so sequential mmaps stay overlay-eligible, and find_free_gap_inner
aligns gap_start (and every advance past a walked region) to
host_page_size_cached() rather than the guest 4 KiB page, so an unaligned
addr-hint cannot return a result that lands inside a host page already
covered by another region's overlay tail. Misaligned MAP_FIXED requests
fall back to the snapshot pread path, so guest-supplied addresses that
the host kernel cannot honor still produce correct behaviour through the
legacy emulation.

guest_region_t carries new overlay_active / overlay_start / overlay_end
fields so msync collapses to a plain fsync for overlay regions (the
kernel page cache already keeps them coherent), the snapshot-style
refresh-from-file alias pass is skipped for overlay peers, and MADV_DONTNEED
is a no-op for overlay regions (the existing memset+pread reset would
have written zeros straight into the file via the overlay). The metadata
is clipped through every region split / trim site in src/core/guest.c so
the overlay bounds always match the host-page-aligned region bounds.
cleanup_overlays_in_range now returns int and propagates -EIO; metadata
is cleared only after per-overlay host-VA tear-down succeeds so a
partial failure does not leave the runtime believing an overlay is gone
while the host mmap is still live (which would otherwise let a later
memset write zeros into the user file).

The CoW fork path syncs each live overlay region back into shm_fd before
sending shm_fd over SCM_RIGHTS so the child's MAP_PRIVATE snapshot
reflects the parent's view at fork time, and the child's fork-state
restore demotes every inherited overlay flag to the snapshot path. The
sync-back loop now treats a short pwrite of zero as -EIO instead of
spinning.

Snapshot buffers used by the FIXED replacement / mremap rollback paths
are heap-allocated. region_snapshot_t * GUEST_MAX_REGIONS on the stack
is on the order of half a megabyte, and macOS secondary thread stacks
default to ~512 KiB; the stack-allocated original would crash any worker
pthread that hit the path. A new dispose_region_snapshots helper closes
any dup'd backing fds, frees the heap buffer, and zeros the caller's
pointer so a follow-on call is a no-op.

Notes:
1. sys_mmap rollback when guest_region_add_ex_owned failed after the
   host overlay succeeded left the file mmap'd at host_base+ipa with no
   region tracking, so a later operation in that range would memset
   zeros directly into the user's file; the failure path now calls
   hvf_remove_file_overlay before returning -ENOMEM.
2. The final hv_vm_map failure path in hvf_apply_file_overlay restored
   slab backing but never re-issued hv_vm_map for the segment, so
   sibling vCPUs would page-fault on that IPA after thread_resume_siblings;
   the rollback now re-establishes the segment before returning.
3. cleanup_overlays_in_range cleared overlay_active before calling
   hvf_remove_file_overlay and ignored its failure; the helper is now
   fallible and clears metadata only on per-overlay tear-down success.
4. fork_ipc_recv_memory_regions inherited overlay_active=true on every
   shared region but the child never re-established the host-VA overlay;
   the child now demotes every inherited overlay to the snapshot path.
5. find_free_gap_inner advanced gap_start with PAGE_ALIGN_UP (4 KiB),
   not host_page_size_cached() (16 KiB on Apple Silicon), and the
   initial gap_start = min_addr did not round up either; both now align
   to the host page so a new mapping cannot land in a host page already
   covered by an existing overlay.
6. The CoW fork sync-back loop never treated pwrite returning 0 with
   len > 0 as failure and could spin forever; treated as -EIO.
7. hvf_segment_split partial-failure recovery re-issued hv_vm_map for
   the original segment while pieces[0..i-1] were still mapped, which
   HVF rejects with HV_BAD_ARGUMENT; the recovery now unmaps those
   pieces first, and a hard failure of the recovery hv_vm_map is logged
   so post-mortem points at the right culprit.
8. sc_sync was wrapped in SC_LOCKED so the fsync loop ran while holding
   mmap_lock, blocking concurrent guest mmap on every other thread for
   the duration of the global flush; the locks are now taken inline only
   for the brief snapshot phase and released before fsync. Under malloc
   failure the bulk-dup path falls back to inline per-fd fsync instead
   of silently no-opping, since Linux sync(2) is best-effort but is
   still expected to initiate writeback.
9. sys_mremap MAYMOVE-grow path: read_file_range_to_guest failure after
   cleanup_overlays_in_range tore down the source overlay used to leave
   the source on slab backing (silent demotion of MAP_SHARED) and the
   dest with phantom PTEs from the just-completed mremap_extend_range;
   the failure path now restores the source overlay and invalidates the
   dest PTEs.
10. sys_mremap MREMAP_FIXED path: read_file_range_to_guest failure
    restored dest region metadata but never restored the destination
    page tables; the rollback now also calls restore_snapshot_page_tables.

Sync handling: sc_sync was forwarding to host sync(2), which flushes
every dirty page system-wide. The slab is mmap'd MAP_SHARED to an
internal tempfile (g->shm_fd) for the CoW fork fast path, so a global
flush had to walk multi-GB of demand-paged dirty pages from that
tempfile plus the same from any other elfuse process on the host. In
practice this stalled make check for hundreds of seconds when prior
killed test runs had left stuck-in-uninterruptible busybox/elfuse
processes holding mmap'd tempfiles. sc_sync now iterates the guest
fd_table plus the live overlay regions, dups each target host fd under
the matching lock, and fsyncs them outside both locks so a slow disk
does not stall concurrent FD or memory operations on other threads. The
slab tempfile is implementation detail and is no longer touched. make
check completes in ~13 seconds.

Locked in by tests/test-msync.c with three new cases on top of the
existing four: host pwrite is visible through the mapping without
msync, guest writes through the mapping reach the file without msync,
and an adjacent MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS allocation does
not inherit the shared overlay through host-page sharing (regression
lock for the gap-finder host-page alignment fix).
@jserv jserv merged commit 5ce01d9 into main May 7, 2026
4 checks passed
@jserv jserv deleted the mmap branch May 7, 2026 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant