seccomp: cap dirent/BPF size, seal injected memfds (#13)#25
Open
congwang-mk wants to merge 1 commit intomainfrom
Open
seccomp: cap dirent/BPF size, seal injected memfds (#13)#25congwang-mk wants to merge 1 commit intomainfrom
congwang-mk wants to merge 1 commit intomainfrom
Conversation
Three defensive hardenings from the audit in #13: * `build_dirent64` now rejects names longer than 255 bytes (NAME_MAX). Such a name would push `d_reclen` past u16 and produce a malformed record. Callers in procfs and cowfs propagate the rejection. * `assemble_filter` now returns `Result` and rejects programs that exceed BPF_MAXINSNS (4096). The kernel rejects them anyway via `seccomp(2)`; surfacing the error here gives a clearer message and guards the u8 jump-offset arithmetic against silent truncation. * The memfds injected for synthesised /proc files and /dev/{u,}random are created with MFD_ALLOW_SEALING | MFD_CLOEXEC and sealed against write/grow/shrink/seal after population. The child receives a fd to the same RW description, so without sealing it could overwrite the synthesised content. Other fixes proposed in the patch attached to #13 are not applied here: they either misdiagnose serialised access through the supervisor's tokio Mutex as a race, break intentional behaviour (raw fork(57) is the COW-fork escape hatch documented in fork.rs; AF_UNIX is needed for socketpair and runtime sockets), or change defaults in ways that need to be policy-driven rather than unconditional. Signed-off-by: Cong Wang <cwang@multikernel.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three defensive hardenings cherry-picked from the audit patch attached to #13.
build_dirent64now rejects names longer than 255 bytes (NAME_MAX). Such names overflowd_reclen(u16). ReturnsOption<Vec<u8>>; callers inprocfsandcow/dispatchpropagate the rejection.assemble_filternow returnsResultand rejects programs overBPF_MAXINSNS(4096). The kernel rejects them viaseccomp(2)anyway; surfacing the error here gives a clearer message and guards the u8 jump-offset arithmetic against silent truncation./proc/*and/dev/{u,}random) are now created withMFD_ALLOW_SEALING | MFD_CLOEXECand sealed (F_SEAL_SEAL | F_SEAL_WRITE | F_SEAL_GROW | F_SEAL_SHRINK) after population. The child receives an fd to the same RW description, so without seals it could overwrite the synthesised content.What is not in this PR
The bundled patch in #13 also proposed changes I declined:
proc_count/mem_used/disk_used— those fields already live insideArc<Mutex<…>>and notifications serialise through one supervisor task, so the claimed "races" can't actually occur. The atomic version isload → check → store, which is less defensive if the mutex is later removed.fork(2)(NR 57) tonotif_syscalls— this would break the COW-fork escape hatch deliberately documented incrates/sandlock-core/src/fork.rs:1-43. The hardcoded57u32in the proposal is also non-portable.EAFNOSUPPORTinconnect_on_behalf— breakssocketpair,\$XDG_RUNTIME_DIRsockets, etc.; also doesn't actually close the abstract-socket bypass (which goes throughsocket(), notconnect()). A path-based filter on thesun_pathwould be the correct fix.DispatchTabledefault-deny — fragile; the BPF filter is the gatekeeper, and forcing EPERM on missing handlers makes it easy to silently break a new notif syscall before its handler is wired.getrandomfallback removal — only fires whenwrite_child_memfails (rare); the rationale ("breaks determinism") doesn't hold in the failure path.Test plan
cargo buildcleancargo test -p sandlock-core --lib— 219 passedcargo test -p sandlock-core -- --test-threads=2— 183 passed (parallel runs occasionally hit "Cannot fork" / temp-file flakes due to local resource pressure, not this diff)test_build_dirent64_rejects_oversize_name,test_oversized_filter_is_rejected