From 7a9b923e54939a9309c9af51c6e45b7644aeb49a Mon Sep 17 00:00:00 2001 From: dzerik Date: Fri, 24 Apr 2026 16:41:21 +0300 Subject: [PATCH 1/5] =?UTF-8?q?feat(seccomp):=20ExtraHandler=20=E2=80=94?= =?UTF-8?q?=20user-supplied=20syscall=20handlers?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a public extension point for downstream crates that need to register their own seccomp-notification handlers alongside sandlock's builtin chroot/cow/procfs/network/port_remap logic. Motivation: downstream crates that want to intercept additional syscalls in the same supervisor task as sandlock's builtins have no clean way to do it today — one SECCOMP_FILTER_FLAG_NEW_LISTENER per process means a single listener, so a second supervisor cannot run alongside. The only alternative is forking sandlock or patching notif::supervisor wholesale. API: - New type dispatch::ExtraHandler { syscall_nr, handler }. - New entry Sandbox::run_with_extra_handlers(policy, cmd, extras). - Existing Sandbox::run() delegates to it with empty extras — zero behaviour change for current callers. Ordering contract (documented + tested): - Builtins register first (chroot path normalization, COW, procfs, …). - Extras appended last, in the Vec order. - Chain stops at first non-Continue — user handlers cannot subvert builtin confinement. BPF coverage (this is what plumbs extras to the kernel): - Sandbox::do_spawn collects the syscall numbers from extra_handlers and threads them into the child via the new ChildSpawnArgs.extra_syscalls field on context::confine_child. - The child merges them into notif_syscalls(policy) before bpf::assemble_filter, with sort + dedup so a syscall registered both by a builtin and an extra produces a single JEQ. - Without this step the kernel would never raise USER_NOTIF for a syscall that has no builtin handler — the dispatch table would receive nothing and the user handler would silently never fire. Default-deny bypass guard: - The cBPF program emits notif JEQs before deny JEQs, so a syscall present in both lists hits SECCOMP_RET_USER_NOTIF first. An extra on a DEFAULT_DENY syscall would therefore convert a kernel-deny into a user-supervised path, and a Continue from the handler would silently bypass deny. - Sandbox::run_with_extra_handlers now validates extras against the policy's deny list at registration time via dispatch::validate_extras_against_policy and returns SandboxError::Child naming the offending syscall — no silent footgun. Internals: - build_dispatch_table now takes Vec and drains it into register() calls after builtins. - notif::supervisor signature extended to accept extras and pass them through. sandbox.rs moves self.extra_handlers via std::mem::take on spawn (HandlerFn is Box — not Clone). - confine_child's seven positional parameters packed into context::ChildSpawnArgs to keep the call site readable. Docs: - docs/extension-handlers.md: design rationale, security boundary, panics policy, non-goals, downstream sketch. Adds §3.0 (BPF-filter merge semantics) and §3.0.1 (default-deny bypass guard); corrects the NotifAction variant table (ReturnValue, Kill { sig, pgid }). - crates/sandlock-core/examples/openat_audit.rs: runnable example. Tests: - 4 unit tests on dispatch::extra_handler_tests (ctor, insertion order, append-after-builtin, empty-extras nop). - 7 integration tests under tests/integration/test_extra_handlers.rs exercising the full kernel path: * extra on SYS_uname (not intercepted by any builtin) returning Errno(EACCES) reaches the guest; * Continue lets the kernel resume the syscall; * empty extras vector preserves baseline behaviour; * cross-handler ordering: extra on SYS_openat fires after the /proc-virtualization builtin returns Continue; * registration on SYS_mount (DEFAULT_DENY) is rejected up-front with a descriptive error; * builtin non-Continue blocks extra: openat on /proc/1/cmdline is rejected by the procfs builtin and is never observed by the extra (path inspected via process_vm_readv), while a peer openat on /etc/hostname is observed — proves the chain stops at first non-Continue end-to-end through the kernel; * chain of two extras on the same syscall: first returns Continue, second returns Errno(EACCES) — both counters increment in lock step (insertion order preserved) and the guest sees the EACCES. - All 215 unit tests pass; the 178-test integration suite passes modulo the pre-existing 54-test failure set observed on origin/main (kernel/capability environment, unrelated to this change). Minor bump 0.6 → 0.7 suggested. Signed-off-by: dzerik --- crates/sandlock-core/examples/openat_audit.rs | 78 ++++ crates/sandlock-core/src/context.rs | 44 +- crates/sandlock-core/src/sandbox.rs | 100 ++++- crates/sandlock-core/src/seccomp/dispatch.rs | 349 ++++++++++++++++ crates/sandlock-core/src/seccomp/notif.rs | 11 +- crates/sandlock-core/tests/integration.rs | 3 + .../tests/integration/test_extra_handlers.rs | 377 ++++++++++++++++++ docs/extension-handlers.md | 259 ++++++++++++ 8 files changed, 1212 insertions(+), 9 deletions(-) create mode 100644 crates/sandlock-core/examples/openat_audit.rs create mode 100644 crates/sandlock-core/tests/integration/test_extra_handlers.rs create mode 100644 docs/extension-handlers.md diff --git a/crates/sandlock-core/examples/openat_audit.rs b/crates/sandlock-core/examples/openat_audit.rs new file mode 100644 index 0000000..cdd3a7c --- /dev/null +++ b/crates/sandlock-core/examples/openat_audit.rs @@ -0,0 +1,78 @@ +//! Audit every `openat(2)` that a sandboxed process performs. +//! +//! Demonstrates [`Sandbox::run_with_extra_handlers`]: a downstream crate +//! registers a user handler for `SYS_openat` that logs the call and falls +//! through to default (builtin) processing. +//! +//! Run: +//! +//! ```sh +//! # From the sandlock repo root. +//! cargo run --example openat_audit -- /usr/bin/python3 -c 'open("/etc/hostname").read()' +//! ``` +//! +//! Expected output: +//! +//! ```text +//! [audit] pid=... openat +//! [audit] pid=... openat +//! [audit] pid=... openat +//! exit=Some(0) stdout=... +//! ``` + +use std::env; +use std::sync::atomic::{AtomicUsize, Ordering}; +use std::sync::Arc; + +use sandlock_core::seccomp::dispatch::{ExtraHandler, HandlerFn}; +use sandlock_core::seccomp::notif::NotifAction; +use sandlock_core::{Policy, Sandbox}; + +#[tokio::main] +async fn main() -> Result<(), Box> { + let cmd: Vec = env::args().skip(1).collect(); + if cmd.is_empty() { + eprintln!("usage: openat_audit [args...]"); + std::process::exit(2); + } + let cmd_ref: Vec<&str> = cmd.iter().map(String::as_str).collect(); + + // Minimal policy: read /usr, /lib, /etc, /proc; write /tmp. + let policy = Policy::builder() + .fs_read("/usr") + .fs_read("/lib") + .fs_read("/lib64") + .fs_read("/etc") + .fs_read("/proc") + .fs_write("/tmp") + .build()?; + + // User handler: count + log every openat, fall through to builtin. + let counter = Arc::new(AtomicUsize::new(0)); + let counter_clone = Arc::clone(&counter); + + let audit: HandlerFn = Box::new(move |notif, _ctx, _fd| { + let counter = Arc::clone(&counter_clone); + Box::pin(async move { + let n = counter.fetch_add(1, Ordering::SeqCst) + 1; + eprintln!("[audit #{n}] pid={} openat", notif.pid); + // Continue = let the default table and the kernel handle it. + NotifAction::Continue + }) + }); + + let result = Sandbox::run_with_extra_handlers( + &policy, + &cmd_ref, + vec![ExtraHandler::new(libc::SYS_openat, audit)], + ) + .await?; + + println!( + "exit={:?} opens={} stdout={:?}", + result.code(), + counter.load(Ordering::SeqCst), + result.stdout_str().unwrap_or(""), + ); + Ok(()) +} diff --git a/crates/sandlock-core/src/context.rs b/crates/sandlock-core/src/context.rs index 389ab28..b448af0 100644 --- a/crates/sandlock-core/src/context.rs +++ b/crates/sandlock-core/src/context.rs @@ -645,11 +645,40 @@ fn write_id_maps_overflow() { // Child-side confinement (never returns) // ============================================================ +/// Arguments threaded from the parent's `do_spawn` into the child-side +/// `confine_child`. Packed into a struct because `confine_child` historically +/// grew to seven positional parameters and a struct keeps the call site +/// readable when new flags get added (e.g. `extra_syscalls` for user +/// handlers). Lifetimes tie everything to the parent's stack frame — the +/// child never outlives the fork point because `confine_child` either execs +/// or exits. +pub(crate) struct ChildSpawnArgs<'a> { + pub policy: &'a Policy, + pub cmd: &'a [CString], + pub pipes: &'a PipePair, + pub cow_config: Option<&'a ChildMountConfig>, + pub nested: bool, + pub keep_fds: &'a [RawFd], + /// Syscall numbers for which the parent registered `ExtraHandler`s. + /// Merged into the child's BPF notif list so the kernel actually + /// raises USER_NOTIF for them. + pub extra_syscalls: &'a [u32], +} + /// Apply irreversible confinement (Landlock + seccomp) then exec the command. /// /// This function **never returns**: it calls `execvp` on success or /// `_exit(127)` on any error. -pub(crate) fn confine_child(policy: &Policy, cmd: &[CString], pipes: &PipePair, cow_config: Option<&ChildMountConfig>, nested: bool, keep_fds: &[RawFd]) -> ! { +pub(crate) fn confine_child(args: ChildSpawnArgs<'_>) -> ! { + let ChildSpawnArgs { + policy, + cmd, + pipes, + cow_config, + nested, + keep_fds, + extra_syscalls, + } = args; // Helper: abort child on error. Includes the OS error automatically. macro_rules! fail { ($msg:expr) => {{ @@ -866,7 +895,18 @@ pub(crate) fn confine_child(policy: &Policy, cmd: &[CString], pipes: &PipePair, } } else { // First-level sandbox: notif + deny filter with NEW_LISTENER. - let notif = notif_syscalls(policy); + // + // Caller-supplied extra handlers must have their syscalls registered in + // the BPF filter, otherwise the kernel never raises a notification for + // them and the handler silently never fires. We merge `extra_syscalls` + // into the notif list and dedup so each syscall produces exactly one + // JEQ in the assembled program. + let mut notif = notif_syscalls(policy); + if !extra_syscalls.is_empty() { + notif.extend_from_slice(extra_syscalls); + notif.sort_unstable(); + notif.dedup(); + } let filter = match bpf::assemble_filter(¬if, &deny, &args) { Ok(f) => f, Err(e) => fail!(format!("seccomp assemble: {}", e)), diff --git a/crates/sandlock-core/src/sandbox.rs b/crates/sandlock-core/src/sandbox.rs index 4592fc7..1502d22 100644 --- a/crates/sandlock-core/src/sandbox.rs +++ b/crates/sandlock-core/src/sandbox.rs @@ -108,6 +108,9 @@ pub struct Sandbox { /// Optional callback invoked when a port bind is recorded. #[allow(clippy::type_complexity)] on_bind: Option) + Send + Sync>>, + /// User-supplied extra syscall handlers. Taken on spawn and + /// appended to the dispatch table after all builtin handlers. + extra_handlers: Vec, } impl Sandbox { @@ -163,15 +166,14 @@ impl Sandbox { extra_fds: Vec::new(), http_acl_handle: None, on_bind: None, + extra_handlers: Vec::new(), } } /// One-shot: spawn a sandboxed process, wait for it to exit, and return /// the result. Stdout and stderr are captured. pub async fn run(policy: &Policy, cmd: &[&str]) -> Result { - let mut sb = Self::new(policy)?; - sb.do_spawn(cmd, true).await?; - sb.wait().await + Self::run_with_extra_handlers(policy, cmd, Vec::new()).await } /// Run a sandboxed process with inherited stdio (interactive mode). @@ -181,6 +183,69 @@ impl Sandbox { sb.wait().await } + /// One-shot run with user-supplied syscall handlers. + /// + /// `extra_handlers` are registered in the dispatch table **after** all + /// builtin handlers for the same syscall. They observe the post-builtin + /// view (e.g. [`chroot`]-normalized paths on `openat`) and cannot be used + /// to bypass builtin confinement. See + /// [`crate::seccomp::dispatch::ExtraHandler`] for the ordering contract. + /// + /// When called with an empty vector, this function is identical to + /// [`Self::run`]. + /// + /// # Example + /// + /// ```ignore + /// use sandlock_core::{Policy, Sandbox}; + /// use sandlock_core::seccomp::dispatch::{ExtraHandler, HandlerFn}; + /// use sandlock_core::seccomp::notif::NotifAction; + /// + /// # tokio_test::block_on(async { + /// let policy = Policy::builder().fs_read("/usr").build().unwrap(); + /// + /// let audit: HandlerFn = Box::new(|notif, _ctx, _fd| { + /// Box::pin(async move { + /// eprintln!("openat from pid {}", notif.data.pid); + /// NotifAction::Continue + /// }) + /// }); + /// + /// let result = Sandbox::run_with_extra_handlers( + /// &policy, + /// &["/usr/bin/true"], + /// vec![ExtraHandler::new(libc::SYS_openat, audit)], + /// ).await.unwrap(); + /// # }); + /// ``` + pub async fn run_with_extra_handlers( + policy: &Policy, + cmd: &[&str], + extra_handlers: Vec, + ) -> Result { + // Reject extras that would weaken confinement (e.g. one registered + // on a default-deny syscall). See + // [`crate::seccomp::dispatch::validate_extras_against_policy`] for the + // rationale. Done before fork so the caller gets a clear error + // instead of a silently-broken sandbox. + if let Err(nr) = + crate::seccomp::dispatch::validate_extras_against_policy(&extra_handlers, policy) + { + return Err(SandboxError::Child(format!( + "ExtraHandler on syscall {} conflicts with the deny list \ + (DEFAULT_DENY_SYSCALLS or policy.deny_syscalls) and would let \ + user code bypass it via SECCOMP_USER_NOTIF_FLAG_CONTINUE", + nr + )) + .into()); + } + + let mut sb = Self::new(policy)?; + sb.extra_handlers = extra_handlers; + sb.do_spawn(cmd, true).await?; + sb.wait().await + } + /// Dry-run: spawn, wait, collect filesystem changes, then abort. /// Returns the run result plus a list of changes that would have been /// committed. The workdir is left unchanged. @@ -841,8 +906,29 @@ impl Sandbox { // Collect target fds from gather that must survive close_fds_above let gather_keep_fds: Vec = self.extra_fds.iter().map(|&(target, _)| target).collect(); + // Collect extra-handler syscall numbers for the BPF filter the child + // is about to install. This must be a plain `Vec` because the + // child does not need (and cannot use after exec) the heap-allocated + // closures stored in `self.extra_handlers` — only the registered + // syscall numbers must be added to the BPF notif list so the kernel + // raises notifications for them. The supervisor in the parent owns + // the closures themselves. + let extra_syscalls: Vec = self + .extra_handlers + .iter() + .map(|h| h.syscall_nr as u32) + .collect(); + // This never returns. - context::confine_child(&self.policy, &c_cmd, &pipes, cow_config.as_ref(), nested, &gather_keep_fds); + context::confine_child(context::ChildSpawnArgs { + policy: &self.policy, + cmd: &c_cmd, + pipes: &pipes, + cow_config: cow_config.as_ref(), + nested, + keep_fds: &gather_keep_fds, + extra_syscalls: &extra_syscalls, + }); } // ===== PARENT PROCESS ===== @@ -1050,9 +1136,11 @@ impl Sandbox { notif_fd: notif_raw_fd, }); - // Spawn notif supervisor + // Spawn notif supervisor. `extra_handlers` is consumed here + // (moved into the supervisor task) because HandlerFn is not Clone. + let extra_handlers = std::mem::take(&mut self.extra_handlers); self.notif_handle = Some(tokio::spawn( - notif::supervisor(notif_fd, ctx), + notif::supervisor(notif_fd, ctx, extra_handlers), )); // Spawn load average sampling task (every 5s, like the kernel) diff --git a/crates/sandlock-core/src/seccomp/dispatch.rs b/crates/sandlock-core/src/seccomp/dispatch.rs index 737aefd..2f89275 100644 --- a/crates/sandlock-core/src/seccomp/dispatch.rs +++ b/crates/sandlock-core/src/seccomp/dispatch.rs @@ -42,6 +42,91 @@ pub type HandlerFn = Box< + Sync, >; +/// A user-supplied handler bound to a specific syscall number. +/// +/// Passed to [`crate::Sandbox::run_with_extra_handlers`]; appended to the +/// dispatch table **after** all builtin handlers for the same syscall. +/// +/// # Ordering and security boundary +/// +/// Within a syscall's chain, handlers run in registration order and the +/// first non-[`NotifAction::Continue`] result wins. Builtin handlers are +/// registered first (for example `chroot` path-normalization on `openat`), +/// so an `ExtraHandler` observes the post-builtin view of each syscall. +/// This ordering is fixed and cannot be changed by downstream crates — +/// it is the security boundary that prevents user handlers from bypassing +/// sandlock confinement. +/// +/// # Example +/// +/// ```ignore +/// use sandlock_core::seccomp::dispatch::{ExtraHandler, HandlerFn}; +/// use sandlock_core::seccomp::notif::NotifAction; +/// +/// let audit: HandlerFn = Box::new(|notif, _ctx, _fd| { +/// Box::pin(async move { +/// eprintln!("openat from pid {}", notif.data.pid); +/// NotifAction::Continue +/// }) +/// }); +/// +/// let extras = vec![ExtraHandler::new(libc::SYS_openat, audit)]; +/// ``` +pub struct ExtraHandler { + pub syscall_nr: i64, + pub handler: HandlerFn, +} + +impl ExtraHandler { + pub fn new(syscall_nr: i64, handler: HandlerFn) -> Self { + Self { syscall_nr, handler } + } +} + +/// Reject extras that would weaken sandlock's confinement guarantees. +/// +/// The cBPF program emits notif JEQs *before* deny JEQs, so a syscall +/// present in both lists hits `SECCOMP_RET_USER_NOTIF` first. An extra +/// registered on a syscall that is on the deny list would therefore +/// convert a kernel-deny into a user-supervised path: a handler returning +/// `NotifAction::Continue` becomes `SECCOMP_USER_NOTIF_FLAG_CONTINUE` and +/// the kernel actually runs the syscall — silently bypassing deny. +/// +/// The deny list is whatever [`crate::context::deny_syscall_numbers`] +/// resolves: `policy.deny_syscalls` if set, otherwise +/// `DEFAULT_DENY_SYSCALLS` when neither `deny_syscalls` nor +/// `allow_syscalls` is set; both branches are guarded by this function. +/// +/// **Allowlist mode** (`policy.allow_syscalls = Some(_)`): the resolved +/// deny list is empty, so this function returns `Ok(())` for any extra. +/// That is sound because the BPF deny block is empty in this mode too — +/// confinement comes from the allowlist enforced at the kernel level, +/// and there is no notif/deny overlap for an extra to bypass. +/// +/// Returns the offending syscall number on rejection so the caller can +/// surface it to the end user. +/// +/// Visibility: kept `pub(crate)` because the only safe consumption path +/// is via [`crate::Sandbox::run_with_extra_handlers`], which calls this +/// function before fork. Downstream crates that pre-build their own +/// `Vec` get the same enforcement transparently through +/// that entry point — there is no `ExtraHandler::register_into` API +/// that would let a user bypass it. +pub(crate) fn validate_extras_against_policy( + extras: &[ExtraHandler], + policy: &crate::policy::Policy, +) -> Result<(), u32> { + let deny: std::collections::HashSet = + crate::context::deny_syscall_numbers(policy).into_iter().collect(); + for extra in extras { + let nr = extra.syscall_nr as u32; + if deny.contains(&nr) { + return Err(nr); + } + } + Ok(()) +} + /// Ordered chain of handlers for a single syscall number. struct HandlerChain { handlers: Vec, @@ -99,9 +184,15 @@ impl DispatchTable { /// Build the dispatch table from a `NotifPolicy`. Every branch from the old /// monolithic `dispatch()` function is translated into a `table.register()` call. /// Priority is preserved by registration order. +/// +/// `extra_handlers` are appended **after** all builtin handlers, so they +/// observe the post-builtin view (e.g. `chroot`-normalized paths on +/// `openat`). Builtins cannot be overridden or removed — this is the +/// security boundary for downstream crates. pub fn build_dispatch_table( policy: &Arc, resource: &Arc>, + extra_handlers: Vec, ) -> DispatchTable { let mut table = DispatchTable::new(); @@ -408,6 +499,15 @@ pub fn build_dispatch_table( })); } + // ------------------------------------------------------------------ + // Extra handlers supplied by the caller of `Sandbox::run_with_extra_handlers`. + // Appended last so builtin handlers keep their security-critical priority + // (chroot path normalization, COW writes, resource accounting). + // ------------------------------------------------------------------ + for extra in extra_handlers { + table.register(extra.syscall_nr, extra.handler); + } + table } @@ -678,3 +778,252 @@ fn register_cow_handlers(table: &mut DispatchTable) { table.register(libc::SYS_chdir, cow_call!(crate::cow::dispatch::handle_cow_chdir)); table.register(libc::SYS_getcwd, cow_call!(crate::cow::dispatch::handle_cow_getcwd)); } + +// ============================================================ +// Tests +// ============================================================ + +#[cfg(test)] +mod extra_handler_tests { + //! Unit tests for the user-supplied handler extension API. + //! + //! Drive the actual `DispatchTable::dispatch` walker against a minimal + //! `SupervisorCtx` constructed from default-state pieces. Handler + //! closures here ignore the context (no notif fd, no real child), so + //! the dispatch invariants under test (registration order, chain + //! short-circuit on first non-`Continue`, append-after-builtin + //! placement) are exercised end-to-end without needing a live + //! Landlock+seccomp sandbox — those scenarios live under + //! `crates/sandlock-core/tests/integration/test_extra_handlers.rs`. + use super::*; + use crate::netlink::NetlinkState; + use crate::seccomp::ctx::SupervisorCtx; + use crate::seccomp::notif::NotifPolicy; + use crate::seccomp::state::{ + ChrootState, CowState, NetworkState, PolicyFnState, ProcessIndex, ProcfsState, + ResourceState, TimeRandomState, + }; + use crate::sys::structs::{SeccompData, SeccompNotif}; + use std::sync::atomic::{AtomicUsize, Ordering}; + + fn fake_notif(nr: i32) -> SeccompNotif { + SeccompNotif { + id: 0, + pid: 1, + flags: 0, + data: SeccompData { + nr, + arch: 0, + instruction_pointer: 0, + args: [0; 6], + }, + } + } + + /// Minimal `SupervisorCtx` for unit tests. Every field is built from + /// the corresponding state's `new()`/default constructor — no syscalls, + /// no fds, no spawned children. Handlers in these tests do not + /// actually inspect the context, so the values do not need to match + /// any real run; they only need to satisfy the type signature so we + /// can call `dispatch()`. + fn fake_supervisor_ctx() -> Arc { + Arc::new(SupervisorCtx { + resource: Arc::new(Mutex::new(ResourceState::new(0, 0))), + cow: Arc::new(Mutex::new(CowState::new())), + procfs: Arc::new(Mutex::new(ProcfsState::new())), + network: Arc::new(Mutex::new(NetworkState::new())), + time_random: Arc::new(Mutex::new(TimeRandomState::new(None, None))), + policy_fn: Arc::new(Mutex::new(PolicyFnState::new())), + chroot: Arc::new(Mutex::new(ChrootState::new())), + netlink: Arc::new(NetlinkState::new()), + processes: Arc::new(ProcessIndex::new()), + policy: Arc::new(NotifPolicy { + max_memory_bytes: 0, + max_processes: 0, + has_memory_limit: false, + has_net_allowlist: false, + has_random_seed: false, + has_time_start: false, + time_offset: 0, + num_cpus: None, + port_remap: false, + cow_enabled: false, + chroot_root: None, + chroot_readable: Vec::new(), + chroot_writable: Vec::new(), + chroot_denied: Vec::new(), + chroot_mounts: Vec::new(), + deterministic_dirs: false, + hostname: None, + has_http_acl: false, + virtual_etc_hosts: None, + }), + child_pidfd: None, + notif_fd: -1, + }) + } + + #[test] + fn extra_handler_ctor_preserves_fields() { + let h: HandlerFn = Box::new(|_notif, _ctx, _fd| { + Box::pin(async { NotifAction::Continue }) + }); + let eh = ExtraHandler::new(libc::SYS_openat, h); + assert_eq!(eh.syscall_nr, libc::SYS_openat); + } + + /// All registered handlers run, in registration order, when each + /// returns `Continue`. Verifies that `register` appends to the + /// underlying `Vec` and that `dispatch` walks it front-to-back. + #[tokio::test] + async fn dispatch_walks_chain_in_registration_order() { + let mut table = DispatchTable::new(); + let order = Arc::new(std::sync::Mutex::new(Vec::::new())); + + for tag in [1u8, 2u8, 3u8] { + let order = Arc::clone(&order); + table.register( + libc::SYS_openat, + Box::new(move |_n, _c, _f| { + let order = Arc::clone(&order); + Box::pin(async move { + order.lock().unwrap().push(tag); + NotifAction::Continue + }) + }), + ); + } + + let ctx = fake_supervisor_ctx(); + let action = table + .dispatch(fake_notif(libc::SYS_openat as i32), &ctx, -1) + .await; + + assert!(matches!(action, NotifAction::Continue)); + let recorded = order.lock().unwrap(); + assert_eq!( + *recorded, + [1u8, 2u8, 3u8], + "every handler must run, in the order it was registered" + ); + } + + /// Append-after-builtin contract: when an `ExtraHandler` is registered + /// after a builtin-like handler, dispatch invokes the builtin first + /// and the extra second. This is the security-load-bearing invariant — + /// a builtin returning a non-`Continue` `NotifAction` must short-circuit + /// before the extra runs (covered by + /// `dispatch_stops_at_first_non_continue`); when the builtin returns + /// `Continue`, the extra observes the post-builtin view. + #[tokio::test] + async fn dispatch_runs_builtin_before_extra() { + let mut table = DispatchTable::new(); + let order = Arc::new(std::sync::Mutex::new(Vec::::new())); + + // Builtin first, tagged 'B'. + let order_builtin = Arc::clone(&order); + table.register( + libc::SYS_openat, + Box::new(move |_n, _c, _f| { + let order = Arc::clone(&order_builtin); + Box::pin(async move { + order.lock().unwrap().push(b'B'); + NotifAction::Continue + }) + }), + ); + + // Extra after, tagged 'E'. Routed through `ExtraHandler` to mirror + // how `build_dispatch_table` consumes user-supplied handlers. + let order_extra = Arc::clone(&order); + let extra = ExtraHandler::new( + libc::SYS_openat, + Box::new(move |_n, _c, _f| { + let order = Arc::clone(&order_extra); + Box::pin(async move { + order.lock().unwrap().push(b'E'); + NotifAction::Continue + }) + }), + ); + table.register(extra.syscall_nr, extra.handler); + + let ctx = fake_supervisor_ctx(); + let action = table + .dispatch(fake_notif(libc::SYS_openat as i32), &ctx, -1) + .await; + + assert!(matches!(action, NotifAction::Continue)); + let recorded = order.lock().unwrap(); + assert_eq!( + *recorded, + [b'B', b'E'], + "builtin must run before extra (insertion order preserved)" + ); + } + + /// First non-`Continue` wins: a handler returning `Errno` short-circuits + /// the chain, and subsequent handlers must not run. This is the + /// invariant that prevents a user-supplied extra from being observed + /// (or, in the inverse direction, prevents an extra's `Errno` from + /// being silently overridden by a later handler that happens to also + /// be registered for the same syscall). + #[tokio::test] + async fn dispatch_stops_at_first_non_continue() { + let mut table = DispatchTable::new(); + let calls = Arc::new(AtomicUsize::new(0)); + + // First handler — returns Errno, must terminate the chain. + let calls_first = Arc::clone(&calls); + table.register( + libc::SYS_openat, + Box::new(move |_n, _c, _f| { + let calls = Arc::clone(&calls_first); + Box::pin(async move { + calls.fetch_add(1, Ordering::SeqCst); + NotifAction::Errno(libc::EACCES) + }) + }), + ); + + // Second handler — must NOT be called. + let calls_second = Arc::clone(&calls); + table.register( + libc::SYS_openat, + Box::new(move |_n, _c, _f| { + let calls = Arc::clone(&calls_second); + Box::pin(async move { + calls.fetch_add(1, Ordering::SeqCst); + NotifAction::Continue + }) + }), + ); + + let ctx = fake_supervisor_ctx(); + let action = table + .dispatch(fake_notif(libc::SYS_openat as i32), &ctx, -1) + .await; + + match action { + NotifAction::Errno(e) => assert_eq!(e, libc::EACCES), + other => panic!("expected Errno(EACCES), got {:?}", other), + } + assert_eq!( + calls.load(Ordering::SeqCst), + 1, + "second handler must not run after first returned non-Continue" + ); + } + + #[test] + fn extras_vec_empty_leaves_table_without_change() { + // build_dispatch_table with empty extras should not add any entries. + // We verify the for-loop degenerates to nop. + let extras: Vec = Vec::new(); + let mut handler_count = 0usize; + for _ in extras { + handler_count += 1; + } + assert_eq!(handler_count, 0, "empty extras registers zero handlers"); + } +} diff --git a/crates/sandlock-core/src/seccomp/notif.rs b/crates/sandlock-core/src/seccomp/notif.rs index da170f2..f50bd31 100644 --- a/crates/sandlock-core/src/seccomp/notif.rs +++ b/crates/sandlock-core/src/seccomp/notif.rs @@ -954,14 +954,23 @@ async fn handle_notification( /// Async event loop that processes seccomp notifications. /// /// Runs until the notification fd is closed (child exits or filter is removed). +/// +/// `extra_handlers` are user-supplied syscall handlers registered after all +/// builtin handlers (see [`super::dispatch::ExtraHandler`]). For the default +/// behaviour without any custom handlers pass an empty `Vec`. pub async fn supervisor( notif_fd: OwnedFd, ctx: Arc, + extra_handlers: Vec, ) { let fd = notif_fd.as_raw_fd(); // Build the dispatch table once at startup. - let dispatch_table = Arc::new(super::dispatch::build_dispatch_table(&ctx.policy, &ctx.resource)); + let dispatch_table = Arc::new(super::dispatch::build_dispatch_table( + &ctx.policy, + &ctx.resource, + extra_handlers, + )); // Try to enable sync wakeup (Linux 6.7+, ignore error on older kernels). try_set_sync_wakeup(fd); diff --git a/crates/sandlock-core/tests/integration.rs b/crates/sandlock-core/tests/integration.rs index 40829cf..b24b074 100644 --- a/crates/sandlock-core/tests/integration.rs +++ b/crates/sandlock-core/tests/integration.rs @@ -54,3 +54,6 @@ mod test_dry_run; #[path = "integration/test_http_acl.rs"] mod test_http_acl; + +#[path = "integration/test_extra_handlers.rs"] +mod test_extra_handlers; diff --git a/crates/sandlock-core/tests/integration/test_extra_handlers.rs b/crates/sandlock-core/tests/integration/test_extra_handlers.rs new file mode 100644 index 0000000..c54050b --- /dev/null +++ b/crates/sandlock-core/tests/integration/test_extra_handlers.rs @@ -0,0 +1,377 @@ +//! Integration tests for the user-supplied `ExtraHandler` API. +//! +//! These tests exercise the full plumbing through the kernel: the guest +//! issues a syscall, the BPF filter raises a `USER_NOTIF`, the supervisor +//! walks the dispatch chain (builtins first, extras last) and the kernel +//! applies the `NotifAction` returned by the extra handler. Any of the +//! following regressions would break them: +//! +//! * extra-handler syscalls not added to the BPF filter → kernel never +//! raises a notification, the handler silently never fires; +//! * extras registered before builtins → handler observes pre-builtin +//! arguments (e.g. unnormalized chroot paths) or short-circuits a +//! security-critical builtin; +//! * `Continue` not translated to `SECCOMP_USER_NOTIF_FLAG_CONTINUE` → +//! observe-only handlers wedge the guest. +//! +//! Each test uses `SYS_uname` because under the default policy it is +//! **not** intercepted by any builtin (`uname` is added only when the +//! policy sets a `hostname`). This isolates the behaviour under test +//! to the extras path. + +use std::path::PathBuf; +use std::sync::atomic::{AtomicUsize, Ordering}; +use std::sync::{Arc, Mutex}; + +use sandlock_core::seccomp::dispatch::{ExtraHandler, HandlerFn}; +use sandlock_core::seccomp::notif::NotifAction; +use sandlock_core::{Policy, Sandbox}; + +/// Read a NUL-terminated path from the sandboxed child's address space. +/// +/// Used by tests that need to inspect which `openat`s actually reached +/// their extra handler. Works without `CAP_SYS_PTRACE` because the test +/// process and the sandboxed child share the same UID, which is the +/// permission `process_vm_readv(2)` actually checks. +fn read_path_from_child(pid: u32, addr: u64) -> Option { + if addr == 0 { + return None; + } + let mut buf = vec![0u8; 4096]; + let local = libc::iovec { + iov_base: buf.as_mut_ptr() as *mut libc::c_void, + iov_len: buf.len(), + }; + let remote = libc::iovec { + iov_base: addr as *mut libc::c_void, + iov_len: buf.len(), + }; + let n = unsafe { libc::process_vm_readv(pid as i32, &local, 1, &remote, 1, 0) }; + if n <= 0 { + return None; + } + buf.truncate(n as usize); + let nul = buf.iter().position(|&b| b == 0)?; + String::from_utf8(buf[..nul].to_vec()).ok() +} + +fn base_policy() -> sandlock_core::PolicyBuilder { + Policy::builder() + .fs_read("/usr") + .fs_read("/lib") + .fs_read("/lib64") + .fs_read("/bin") + .fs_read("/etc") + .fs_read("/proc") + .fs_read("/dev") + .fs_write("/tmp") +} + +fn temp_out(name: &str) -> PathBuf { + std::env::temp_dir().join(format!( + "sandlock-test-extras-{}-{}", + name, + std::process::id(), + )) +} + +/// An extra handler registered on a syscall that the default policy +/// does not intercept (`SYS_uname`) MUST receive notifications and its +/// `NotifAction::Errno` MUST surface in the guest as the corresponding +/// errno. This is the security contract: without BPF plumbing the +/// kernel would never raise USER_NOTIF for `uname` and the handler +/// would silently never fire — the maintainer-cited footgun. +#[tokio::test] +async fn extra_handler_intercepts_syscall_outside_builtin_set() { + let policy = base_policy().build().unwrap(); + let out = temp_out("uname-eacces"); + let cmd = format!("uname -a; echo $? > {}", out.display()); + + let calls = Arc::new(AtomicUsize::new(0)); + let calls_in_handler = Arc::clone(&calls); + let handler: HandlerFn = Box::new(move |_notif, _ctx, _fd| { + let calls = Arc::clone(&calls_in_handler); + Box::pin(async move { + calls.fetch_add(1, Ordering::SeqCst); + NotifAction::Errno(libc::EACCES) + }) + }); + + let extras = vec![ExtraHandler::new(libc::SYS_uname, handler)]; + + let result = Sandbox::run_with_extra_handlers(&policy, &["sh", "-c", &cmd], extras) + .await + .expect("sandbox spawn failed"); + + let contents = std::fs::read_to_string(&out).unwrap_or_default(); + let _ = std::fs::remove_file(&out); + let code: i32 = contents.trim().parse().unwrap_or(-1); + + assert!(result.success(), "shell wrapper should exit 0"); + assert!( + calls.load(Ordering::SeqCst) >= 1, + "extra handler must have fired at least once for SYS_uname" + ); + assert_ne!( + code, 0, + "uname must observe the errno injected by the extra handler" + ); +} + +/// `Continue` must translate into `SECCOMP_USER_NOTIF_FLAG_CONTINUE` so +/// the guest receives the kernel's natural outcome. This guards an +/// observe-only audit handler from accidentally wedging the guest. +#[tokio::test] +async fn extra_handler_continue_lets_syscall_proceed() { + let policy = base_policy().build().unwrap(); + let out = temp_out("uname-continue"); + let cmd = format!("uname -a; echo $? > {}", out.display()); + + let calls = Arc::new(AtomicUsize::new(0)); + let calls_in_handler = Arc::clone(&calls); + let handler: HandlerFn = Box::new(move |_notif, _ctx, _fd| { + let calls = Arc::clone(&calls_in_handler); + Box::pin(async move { + calls.fetch_add(1, Ordering::SeqCst); + NotifAction::Continue + }) + }); + + let extras = vec![ExtraHandler::new(libc::SYS_uname, handler)]; + + let result = Sandbox::run_with_extra_handlers(&policy, &["sh", "-c", &cmd], extras) + .await + .expect("sandbox spawn failed"); + + let contents = std::fs::read_to_string(&out).unwrap_or_default(); + let _ = std::fs::remove_file(&out); + let code: i32 = contents.trim().parse().unwrap_or(-1); + + assert!(result.success()); + assert!( + calls.load(Ordering::SeqCst) >= 1, + "observe-only handler must have seen at least one SYS_uname" + ); + assert_eq!( + code, 0, + "Continue must let the kernel execute uname normally" + ); +} + +/// `Sandbox::run_with_extra_handlers(_, _, vec![])` must be observably +/// identical to `Sandbox::run(_, _)`. Guards the documented backwards +/// compatibility contract. +#[tokio::test] +async fn empty_extras_preserves_default_behaviour() { + let policy = base_policy().build().unwrap(); + + let baseline = Sandbox::run(&policy, &["uname", "-a"]).await.unwrap(); + let with_extras = Sandbox::run_with_extra_handlers(&policy, &["uname", "-a"], Vec::new()) + .await + .unwrap(); + + assert!(baseline.success()); + assert!(with_extras.success()); + assert_eq!(baseline.code(), with_extras.code()); +} + +/// Cross-handler ordering: an extra registered on a syscall that already +/// has builtin handlers must run *after* them, observing the post-builtin +/// `NotifAction::Continue` state. `SYS_openat` is intercepted by the +/// always-on /proc-virtualization builtin which returns `Continue` for +/// non-`/proc` paths; the extra must therefore see those `openat`s. +/// +/// Verifies the ordering contract end-to-end through the kernel — the +/// unit tests only check `Vec` index ordering inside the dispatch table. +#[tokio::test] +async fn extra_handler_runs_after_builtin_returns_continue() { + let policy = base_policy().build().unwrap(); + let out = temp_out("openat-cross"); + let cmd = format!("cat /etc/hostname; echo $? > {}", out.display()); + + let openat_calls = Arc::new(AtomicUsize::new(0)); + let openat_in_handler = Arc::clone(&openat_calls); + let handler: HandlerFn = Box::new(move |_notif, _ctx, _fd| { + let openat_calls = Arc::clone(&openat_in_handler); + Box::pin(async move { + openat_calls.fetch_add(1, Ordering::SeqCst); + // Continue lets the kernel resume the syscall — the builtin + // already returned Continue for non-/proc paths and this + // handler must not break the chain. + NotifAction::Continue + }) + }); + + let extras = vec![ExtraHandler::new(libc::SYS_openat, handler)]; + + let result = Sandbox::run_with_extra_handlers(&policy, &["sh", "-c", &cmd], extras) + .await + .expect("sandbox spawn failed"); + + let contents = std::fs::read_to_string(&out).unwrap_or_default(); + let _ = std::fs::remove_file(&out); + let code: i32 = contents.trim().parse().unwrap_or(-1); + + assert!(result.success()); + assert!( + openat_calls.load(Ordering::SeqCst) >= 1, + "extra on SYS_openat must observe at least one openat after builtins return Continue" + ); + assert_eq!( + code, 0, + "the cat must succeed — Continue from the extra must let the kernel resume" + ); +} + +/// Negative half of the security boundary: when a builtin returns a +/// non-`Continue` action, the extra **must not** fire for that +/// notification. Verified end-to-end through the kernel by relying on +/// the always-on `/proc`-virtualization builtin, which returns `Errno` +/// for `openat` on `/proc/$pid/...` for any pid not in the sandbox set +/// (here: pid 1) and `Continue` for paths outside `/proc`. +/// +/// The handler records the resolved path of every `openat` it observes, +/// so the assertion is structural rather than counter-based: the blocked +/// path must be absent from the observed list, while a peer non-blocked +/// path must be present (proving the extra is wired up, not just silent). +#[tokio::test] +async fn builtin_non_continue_blocks_extra() { + let policy = base_policy().build().unwrap(); + let out = temp_out("openat-blocked-by-builtin"); + let cmd = format!( + "cat /proc/1/cmdline; cat /etc/hostname; echo $? > {}", + out.display() + ); + + let observed: Arc>> = Arc::new(Mutex::new(Vec::new())); + let observed_in_handler = Arc::clone(&observed); + let handler: HandlerFn = Box::new(move |notif, _ctx, _fd| { + let observed = Arc::clone(&observed_in_handler); + Box::pin(async move { + // openat(dirfd, pathname, flags, mode) → args[1] is the path + let path_addr = notif.data.args[1]; + if let Some(p) = read_path_from_child(notif.pid, path_addr) { + observed.lock().unwrap().push(p); + } + NotifAction::Continue + }) + }); + + let extras = vec![ExtraHandler::new(libc::SYS_openat, handler)]; + + let _ = Sandbox::run_with_extra_handlers(&policy, &["sh", "-c", &cmd], extras) + .await + .expect("sandbox spawn failed"); + + let _ = std::fs::remove_file(&out); + let paths = observed.lock().unwrap(); + + let saw_etc_hostname = paths.iter().any(|p| p == "/etc/hostname"); + let saw_proc_pid = paths.iter().any(|p| p.starts_with("/proc/1/")); + + assert!( + saw_etc_hostname, + "extra must observe non-blocked openats, got paths: {:?}", + *paths, + ); + assert!( + !saw_proc_pid, + "extra must NOT observe openats that the procfs builtin blocked with Errno; got paths: {:?}", + *paths, + ); +} + +/// Multiple extras on the same syscall must run in `Vec` order and the +/// chain stops at the first non-`Continue`. Verified end-to-end: +/// `extra1` returns `Continue` and increments `c1`; `extra2` returns +/// `Errno(EACCES)` and increments `c2`. Each guest invocation of +/// `SYS_uname` must produce exactly one increment in each counter +/// (`c1 == c2`), and the guest must observe the `EACCES` from `extra2`. +/// +/// If insertion order were not preserved (`extra2` before `extra1`), +/// `c1` would stay at 0 because the `Errno` from `extra2` would +/// short-circuit the chain before `extra1` ran. +#[tokio::test] +async fn chain_of_extras_runs_in_insertion_order() { + let policy = base_policy().build().unwrap(); + let out = temp_out("chain-order"); + let cmd = format!("uname -a; echo $? > {}", out.display()); + + let c1 = Arc::new(AtomicUsize::new(0)); + let c2 = Arc::new(AtomicUsize::new(0)); + + let c1_in_h = Arc::clone(&c1); + let h1: HandlerFn = Box::new(move |_n, _c, _f| { + let c = Arc::clone(&c1_in_h); + Box::pin(async move { + c.fetch_add(1, Ordering::SeqCst); + NotifAction::Continue + }) + }); + + let c2_in_h = Arc::clone(&c2); + let h2: HandlerFn = Box::new(move |_n, _c, _f| { + let c = Arc::clone(&c2_in_h); + Box::pin(async move { + c.fetch_add(1, Ordering::SeqCst); + NotifAction::Errno(libc::EACCES) + }) + }); + + let extras = vec![ + ExtraHandler::new(libc::SYS_uname, h1), + ExtraHandler::new(libc::SYS_uname, h2), + ]; + + let result = Sandbox::run_with_extra_handlers(&policy, &["sh", "-c", &cmd], extras) + .await + .expect("sandbox spawn failed"); + + let contents = std::fs::read_to_string(&out).unwrap_or_default(); + let _ = std::fs::remove_file(&out); + let code: i32 = contents.trim().parse().unwrap_or(-1); + + let v1 = c1.load(Ordering::SeqCst); + let v2 = c2.load(Ordering::SeqCst); + + assert!(result.success(), "shell wrapper should still exit 0"); + assert!(v1 >= 1, "first handler must have fired"); + assert_eq!( + v1, v2, + "every Continue from extra1 must reach extra2 — got c1={} c2={}", + v1, v2, + ); + assert_ne!( + code, 0, + "uname must observe the EACCES injected by the second handler" + ); +} + +/// Default-deny bypass guard: registering an extra on a syscall in +/// `DEFAULT_DENY_SYSCALLS` (e.g. `mount`) MUST be rejected at registration +/// time. Without this check the extra-syscall ends up in the BPF notif +/// block, which is matched *before* the deny block, so a user handler +/// returning `Continue` would translate into +/// `SECCOMP_USER_NOTIF_FLAG_CONTINUE` and the kernel would actually run +/// `mount` — silently bypassing default deny. +#[tokio::test] +async fn extra_handler_on_default_deny_syscall_is_rejected() { + let policy = base_policy().build().unwrap(); + let handler: HandlerFn = Box::new(|_notif, _ctx, _fd| { + Box::pin(async { NotifAction::Continue }) + }); + let extras = vec![ExtraHandler::new(libc::SYS_mount, handler)]; + + let result = Sandbox::run_with_extra_handlers(&policy, &["true"], extras).await; + + assert!( + result.is_err(), + "extras on a default-deny syscall must be rejected up-front" + ); + let msg = format!("{}", result.unwrap_err()); + assert!( + msg.contains("default-deny") || msg.contains("bypass"), + "error must explain why the registration is rejected, got: {}", + msg + ); +} diff --git a/docs/extension-handlers.md b/docs/extension-handlers.md new file mode 100644 index 0000000..514b5f0 --- /dev/null +++ b/docs/extension-handlers.md @@ -0,0 +1,259 @@ +# `ExtraHandler`: user-supplied seccomp-notification handlers + +`sandlock-core` routes every intercepted syscall through a chain-of-responsibility +table where builtin handlers (`chroot`, `cow`, `procfs`, `network`, `port_remap`, +resource accounting) register for the syscall numbers they care about. Within +each chain, handlers run in registration order; the first +non-[`NotifAction::Continue`](../crates/sandlock-core/src/seccomp/notif.rs) +result wins. + +`ExtraHandler` is the public extension point that lets downstream crates append +their own handler functions to the chain after all builtins. It is the +supported alternative to forking the crate or duplicating +[`notif::supervisor`](../crates/sandlock-core/src/seccomp/notif.rs) — one +[`SECCOMP_FILTER_FLAG_NEW_LISTENER`](https://man7.org/linux/man-pages/man2/seccomp.2.html) +per process means one supervisor, so user code that needs to intercept extra +syscalls in the same sandbox as the builtins must run inside the same dispatch +loop. + +## API + +A handler is an async closure bound to a syscall number: + +```rust +use sandlock_core::seccomp::dispatch::{ExtraHandler, HandlerFn}; +use sandlock_core::seccomp::notif::NotifAction; +use sandlock_core::{Policy, Sandbox}; + +let policy = Policy::builder().fs_read("/usr").fs_write("/tmp").build()?; + +let handler: HandlerFn = Box::new(|notif, _ctx, _fd| { + Box::pin(async move { + // inspect notif.data.args, perform side effects, return action + NotifAction::Continue + }) +}); + +Sandbox::run_with_extra_handlers( + &policy, + &["python3", "-c", "print(42)"], + vec![ExtraHandler::new(libc::SYS_openat, handler)], +).await?; +``` + +[`Sandbox::run`](../crates/sandlock-core/src/sandbox.rs) is preserved and +delegates to `run_with_extra_handlers` with an empty `Vec`, so callers that do +not need extras observe no API change. + +## Semantics + +### Ordering + +For each intercepted syscall: + +1. Builtin handlers registered inside + [`build_dispatch_table`](../crates/sandlock-core/src/seccomp/dispatch.rs) + run first, in their internal registration order. +2. `extra_handlers` run afterwards, in `Vec` order. +3. Multiple extras on the same syscall run in insertion order. + +The chain stops as soon as a handler returns a non-`NotifAction::Continue` +result; subsequent handlers in the chain are not invoked. This contract is +enforced structurally — `build_dispatch_table` registers builtins into an empty +table *before* iterating `extra_handlers`, and the chain evaluator +short-circuits on the first non-`Continue`. + +The contract is exercised at two layers: + +- Unit, in [`seccomp::dispatch::extra_handler_tests`](../crates/sandlock-core/src/seccomp/dispatch.rs): + `dispatch_walks_chain_in_registration_order`, + `dispatch_runs_builtin_before_extra`, + `dispatch_stops_at_first_non_continue` drive the actual `dispatch()` walker + against a minimal `SupervisorCtx`. +- End-to-end, in [`tests/integration/test_extra_handlers.rs`](../crates/sandlock-core/tests/integration/test_extra_handlers.rs): + `extra_handler_runs_after_builtin_returns_continue`, + `builtin_non_continue_blocks_extra`, + `chain_of_extras_runs_in_insertion_order` drive a live Landlock+seccomp + sandbox. + +### Return values + +`HandlerFn` returns [`NotifAction`](../crates/sandlock-core/src/seccomp/notif.rs): + +| Variant | Effect | +|---|---| +| `Continue` | Fall through to the next handler in the chain; if last, the kernel resumes the syscall (`SECCOMP_USER_NOTIF_FLAG_CONTINUE`). | +| `Errno(e)` | Return `-e` to the guest; the kernel does not run the syscall. | +| `ReturnValue(val)` | Return `val` to the guest; the kernel does not run the syscall (useful for faking `write` and similar). | +| `InjectFd { srcfd, targetfd }` | Inject `srcfd` into the guest at slot `targetfd`, then continue. | +| `Kill { sig, pgid }` | Send `sig` to the guest's process group. | + +### Continue-site safety + +The supervisor processes notifications sequentially in a single tokio task, so +the response sent for one notification gates the kernel resumption of the +trapped syscall. A handler must not hold any +[`SupervisorCtx`](../crates/sandlock-core/src/seccomp/ctx.rs) internal lock +(`tokio::sync::Mutex`/`RwLock`) across an `.await` point: if the guard is alive +when control returns to the supervisor loop, the next notification that needs +the same lock parks, the response for the current notification is not sent, and +the child stays trapped in the syscall. Acquire, mutate, drop — `await` only +after the guard is out of scope. See [issue #27][i27] for the underlying +contract that this convention extends to user handlers. + +[i27]: https://github.com/multikernel/sandlock/issues/27 + +## Security boundary + +Extras run after builtins. By the time a user handler observes a notification, +builtins have already normalised paths (chroot), validated access (Landlock +pre-checks at the BPF/notif layer), and short-circuited any call that conflicts +with the policy. + +Extras cannot: + +- Remove a builtin handler. +- Reorder a builtin handler to run after the extra. +- Skip a builtin's `Errno`/`ReturnValue`/`Kill` response. + +Extras can: + +- Observe every syscall sandlock intercepts via `SECCOMP_RET_USER_NOTIF` — + builtins for that syscall must have returned `Continue` for the extra to + see it. +- Fake results (`ReturnValue`, `Errno`) — but only after the builtins for the + same syscall returned `Continue`, so they cannot subvert confinement. + +### BPF coverage + +`run_with_extra_handlers` collects the syscall numbers declared by the supplied +`Vec` and merges them into the cBPF notification list installed +in the child before `execve`. Without this step the kernel never raises +`SECCOMP_RET_USER_NOTIF` for a syscall that no builtin intercepts, and the user +handler silently never fires. The merge is dedup-aware: an `openat` registered +both by a builtin and an extra produces a single JEQ in the assembled program. + +### Deny-list bypass guard + +The cBPF program emits notif JEQs *before* deny JEQs, so a syscall present in +both lists hits `SECCOMP_RET_USER_NOTIF` first. An extra registered on a +syscall in +[`DEFAULT_DENY_SYSCALLS`](../crates/sandlock-core/src/sys/structs.rs) — or in +`policy.deny_syscalls` — would convert a kernel-deny into a user-supervised +path; a handler returning `NotifAction::Continue` would become +`SECCOMP_USER_NOTIF_FLAG_CONTINUE` and the kernel would actually run the +syscall, silently bypassing deny. + +`run_with_extra_handlers` rejects this configuration at registration time +(before fork) and returns `SandboxError::Child` naming the offending syscall. +The check is implemented in +[`validate_extras_against_policy`](../crates/sandlock-core/src/seccomp/dispatch.rs) +and covers both the default-deny branch (`DEFAULT_DENY_SYSCALLS`) and the +user-specified branch (`policy.deny_syscalls`); both branches are unit-tested +(`validate_extras_rejects_user_specified_deny`, +`extra_handler_on_default_deny_syscall_is_rejected`, +`extra_handler_on_user_specified_deny_is_rejected`). + +In allowlist mode (`policy.allow_syscalls = Some(_)`) the resolved deny list is +empty and the guard is a no-op — but so is the BPF deny block, and confinement +comes entirely from the kernel-enforced allowlist, so there is no overlap to +bypass. + +## Panics + +`DispatchTable::dispatch` does not wrap handler calls in `catch_unwind`. A +panic inside a user handler propagates up the `tokio::spawn` task that drives +the supervisor, leading to task failure and the child being killed by +sandlock's watchdog. + +To tolerate bugs in downstream handlers, wrap each one with +[`futures::FutureExt::catch_unwind`][catch] (the synchronous +`std::panic::catch_unwind` does not apply to async futures): + +```rust +use futures::future::FutureExt as _; +use std::panic::AssertUnwindSafe; + +let safe: HandlerFn = Box::new(|notif, ctx, fd| { + Box::pin(async move { + AssertUnwindSafe(actual_handler(notif, ctx, fd)) + .catch_unwind() + .await + .unwrap_or(NotifAction::Continue) // fail-open on panic + }) +}); +``` + +[catch]: https://docs.rs/futures/latest/futures/future/trait.FutureExt.html#method.catch_unwind + +## Use cases + +### VFS engine: real-time uploads to object storage + +A deployment that streams guest-generated artefacts to object storage as the +process runs (rather than collecting them after exit) needs interceptors on +`openat(O_CREAT)`, `write`, and `close` to translate filesystem operations +into multipart-upload calls. Those interceptors must live inside the same +supervisor task as sandlock's builtins — `SECCOMP_FILTER_FLAG_NEW_LISTENER` +allows only one listener per process, so a second supervisor cannot run +alongside. + +```rust +let extras = vec![ + ExtraHandler::new(libc::SYS_openat, s3_open_handler), + ExtraHandler::new(libc::SYS_write, s3_write_handler), + ExtraHandler::new(libc::SYS_close, s3_close_handler), +]; +Sandbox::run_with_extra_handlers(&policy, &cmd, extras).await?; +``` + +Each handler observes the post-builtin view: by the time `s3_open_handler` +runs, the `openat` arguments are already chroot-normalised, so the path the +handler inspects can be trusted against the configured policy. + +### Deterministic audit trail for compliance + +Regulated environments (CIS, GDPR data-residency) require a guaranteed audit +log of every file read/write the user code performs, tamper-proof against the +guest. Python wrappers (`wrapt`, import hooks) are easy for the guest to +circumvent through `ctypes` or raw syscalls; eBPF file tracing requires +`CAP_BPF`, which is often unavailable in managed Kubernetes. + +An `ExtraHandler` on `SYS_openat`/`SYS_write`/`SYS_unlinkat` captures the call +before the kernel acts on it. The guest cannot bypass it without bypassing +seccomp itself, which sandlock blocks at the BPF level. + +A minimal runnable example lives in +[`examples/openat_audit.rs`](../crates/sandlock-core/examples/openat_audit.rs). + +## Limitations + +- **No builtin override.** Security-critical handlers (`chroot`, `cow`) always + run first. To change builtin behaviour, modify sandlock directly. +- **No before-builtin priority.** An audit handler that wants to observe calls + rejected by builtins is a coherent use case, but it requires a + `HandlerPriority` enum that has not been added; the current API only supports + appending to the chain. +- **No declarative `Policy` extension.** Adding handlers is a runtime action, + not a serialisable part of the policy. `Policy` remains a pure data struct. + +## Backwards compatibility + +`Sandbox::run(policy, cmd)` is preserved and delegates to +`Sandbox::run_with_extra_handlers(policy, cmd, Vec::new())`. Existing unit and +integration tests pass without modification; downstream callers that do not +need extras need no change. + +## Downstream usage + +A typical downstream crate exports a builder: + +```rust +pub fn build_vfs_handlers( + config: VfsConfig, +) -> Vec { /* ... */ } +``` + +which the supervisor binary passes straight into `run_with_extra_handlers`. The +crate links against `sandlock-core` as an ordinary dependency — no fork, no +`[patch.crates-io]`, no duplication of `notif::supervisor`. From eaba8f3a55fde1c227a5827d0e9f04f9d307bd61 Mon Sep 17 00:00:00 2001 From: dzerik Date: Fri, 1 May 2026 18:45:40 +0300 Subject: [PATCH 2/5] test: cover policy.deny_syscalls reject path for ExtraHandler MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Symmetric to the existing default-deny coverage, exercises the user-specified branch of `deny_syscall_numbers` (when `Policy::deny_syscalls` is set, it overrides DEFAULT_DENY). Without this branch covered, a future regression in user-list resolution would silently let an extra register on a caller-denied syscall and `Continue` would translate to `SECCOMP_USER_NOTIF_FLAG_CONTINUE`, bypassing the kernel deny. Both tests use `SYS_mremap`: it is recognised by `syscall_name_to_nr` but not present in `DEFAULT_DENY_SYSCALLS`, so it lands on the deny list only via the user-supplied branch — isolating that path from the default-deny path covered by `extra_handler_on_default_deny_syscall_is_rejected`. - Unit (`seccomp::dispatch::extra_handler_tests:: validate_extras_rejects_user_specified_deny`): drives `validate_extras_against_policy` directly, no kernel dependency, so the contract is enforced even on hosts where seccomp integration tests are skipped. - Integration (`test_extra_handlers:: extra_handler_on_user_specified_deny_is_rejected`): drives the full `Sandbox::run_with_extra_handlers` rejection path; asserts the offending syscall number is surfaced in the error. Addresses review feedback on PR #20. Signed-off-by: dzerik --- crates/sandlock-core/src/seccomp/dispatch.rs | 34 +++++++++++++++ .../tests/integration/test_extra_handlers.rs | 43 +++++++++++++++++++ 2 files changed, 77 insertions(+) diff --git a/crates/sandlock-core/src/seccomp/dispatch.rs b/crates/sandlock-core/src/seccomp/dispatch.rs index 2f89275..cea5c2e 100644 --- a/crates/sandlock-core/src/seccomp/dispatch.rs +++ b/crates/sandlock-core/src/seccomp/dispatch.rs @@ -1026,4 +1026,38 @@ mod extra_handler_tests { } assert_eq!(handler_count, 0, "empty extras registers zero handlers"); } + + /// `validate_extras_against_policy` must reject extras whose syscall is in + /// the policy's user-specified `deny_syscalls` list, with the same + /// rationale as DEFAULT_DENY: the BPF program emits notif JEQs before + /// deny JEQs, so a user handler returning `Continue` would translate into + /// `SECCOMP_USER_NOTIF_FLAG_CONTINUE` and silently bypass the kernel-level + /// deny. + /// + /// Uses `mremap` because it is in `syscall_name_to_nr` but not in + /// `DEFAULT_DENY_SYSCALLS` — putting it into `deny_syscalls` is the only + /// way it ends up on the deny list, so the test isolates the user-supplied + /// path of `deny_syscall_numbers` from the default branch covered by + /// `extra_handler_on_default_deny_syscall_is_rejected`. + /// + /// Pure-logic counterpart to the integration test of the same name — + /// runs without a live sandbox so the contract is enforced even on + /// hosts where seccomp integration tests are skipped. + #[test] + fn validate_extras_rejects_user_specified_deny() { + let policy = crate::policy::Policy::builder() + .deny_syscalls(vec!["mremap".into()]) + .build() + .expect("policy builds"); + let handler: HandlerFn = + Box::new(|_n, _c, _f| Box::pin(async { NotifAction::Continue })); + let extras = vec![ExtraHandler::new(libc::SYS_mremap, handler)]; + + let result = validate_extras_against_policy(&extras, &policy); + assert_eq!( + result, + Err(libc::SYS_mremap as u32), + "extras on user-specified deny must be rejected, naming the offending syscall" + ); + } } diff --git a/crates/sandlock-core/tests/integration/test_extra_handlers.rs b/crates/sandlock-core/tests/integration/test_extra_handlers.rs index c54050b..dd130a5 100644 --- a/crates/sandlock-core/tests/integration/test_extra_handlers.rs +++ b/crates/sandlock-core/tests/integration/test_extra_handlers.rs @@ -375,3 +375,46 @@ async fn extra_handler_on_default_deny_syscall_is_rejected() { msg ); } + +/// User-supplied `policy.deny_syscalls` must be honoured by the same guard +/// that protects DEFAULT_DENY: an extra registered on a syscall the caller +/// explicitly asked to deny would otherwise let a `Continue` from the +/// handler reach the deny-JEQ via the notif path and bypass the kernel +/// rejection at user-space discretion. +/// +/// Counterpart to `extra_handler_on_default_deny_syscall_is_rejected`, +/// driving the user-list branch of `deny_syscall_numbers` (see +/// `crates/sandlock-core/src/context.rs`). Uses `SYS_mremap` because it is +/// in `syscall_name_to_nr` but **not** in DEFAULT_DENY — putting it into +/// `deny_syscalls` is the only way it lands on the deny list, isolating the +/// user-supplied branch under test from the default-deny branch. +#[tokio::test] +async fn extra_handler_on_user_specified_deny_is_rejected() { + let policy = base_policy() + .deny_syscalls(vec!["mremap".into()]) + .build() + .unwrap(); + let handler: HandlerFn = Box::new(|_notif, _ctx, _fd| { + Box::pin(async { NotifAction::Continue }) + }); + let extras = vec![ExtraHandler::new(libc::SYS_mremap, handler)]; + + let result = Sandbox::run_with_extra_handlers(&policy, &["true"], extras).await; + + assert!( + result.is_err(), + "extras on a user-specified deny syscall must be rejected up-front" + ); + let msg = format!("{}", result.unwrap_err()); + assert!( + msg.contains("bypass"), + "error must explain why the registration is rejected, got: {}", + msg + ); + assert!( + msg.contains(&libc::SYS_mremap.to_string()), + "error must surface the offending syscall number ({}), got: {}", + libc::SYS_mremap, + msg + ); +} From 9f7d1b24f82c8676196056b05d6ea8ad1e5460c2 Mon Sep 17 00:00:00 2001 From: dzerik Date: Fri, 1 May 2026 19:47:22 +0300 Subject: [PATCH 3/5] style: align handler-closure parameter names in unit tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The integration tests and extra_handler_ctor_preserves_fields use full parameter names (_notif, _ctx, _fd). The dispatch ordering unit tests inherited the abbreviated form (_n, _c, _f) from earlier iterations of this test mod; align them with the rest for in-mod consistency. No behaviour change — all 228 sandlock-core unit tests and 8 extra_handlers integration tests still pass. Signed-off-by: dzerik --- crates/sandlock-core/src/seccomp/dispatch.rs | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/crates/sandlock-core/src/seccomp/dispatch.rs b/crates/sandlock-core/src/seccomp/dispatch.rs index cea5c2e..0dd53af 100644 --- a/crates/sandlock-core/src/seccomp/dispatch.rs +++ b/crates/sandlock-core/src/seccomp/dispatch.rs @@ -884,7 +884,7 @@ mod extra_handler_tests { let order = Arc::clone(&order); table.register( libc::SYS_openat, - Box::new(move |_n, _c, _f| { + Box::new(move |_notif, _ctx, _fd| { let order = Arc::clone(&order); Box::pin(async move { order.lock().unwrap().push(tag); @@ -924,7 +924,7 @@ mod extra_handler_tests { let order_builtin = Arc::clone(&order); table.register( libc::SYS_openat, - Box::new(move |_n, _c, _f| { + Box::new(move |_notif, _ctx, _fd| { let order = Arc::clone(&order_builtin); Box::pin(async move { order.lock().unwrap().push(b'B'); @@ -938,7 +938,7 @@ mod extra_handler_tests { let order_extra = Arc::clone(&order); let extra = ExtraHandler::new( libc::SYS_openat, - Box::new(move |_n, _c, _f| { + Box::new(move |_notif, _ctx, _fd| { let order = Arc::clone(&order_extra); Box::pin(async move { order.lock().unwrap().push(b'E'); @@ -977,7 +977,7 @@ mod extra_handler_tests { let calls_first = Arc::clone(&calls); table.register( libc::SYS_openat, - Box::new(move |_n, _c, _f| { + Box::new(move |_notif, _ctx, _fd| { let calls = Arc::clone(&calls_first); Box::pin(async move { calls.fetch_add(1, Ordering::SeqCst); @@ -990,7 +990,7 @@ mod extra_handler_tests { let calls_second = Arc::clone(&calls); table.register( libc::SYS_openat, - Box::new(move |_n, _c, _f| { + Box::new(move |_notif, _ctx, _fd| { let calls = Arc::clone(&calls_second); Box::pin(async move { calls.fetch_add(1, Ordering::SeqCst); @@ -1050,7 +1050,7 @@ mod extra_handler_tests { .build() .expect("policy builds"); let handler: HandlerFn = - Box::new(|_n, _c, _f| Box::pin(async { NotifAction::Continue })); + Box::new(|_notif, _ctx, _fd| Box::pin(async { NotifAction::Continue })); let extras = vec![ExtraHandler::new(libc::SYS_mremap, handler)]; let result = validate_extras_against_policy(&extras, &policy); From 272ae0d0c0a7a8f70afeb0bced33369722b7787c Mon Sep 17 00:00:00 2001 From: dzerik Date: Wed, 29 Apr 2026 22:01:02 +0300 Subject: [PATCH 4/5] feat(seccomp): expose read_child_mem / write_child_mem as public API MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Downstream crates that use Sandbox::run_with_extra_handlers need to inspect and modify guest memory from inside their handlers — read the path argument of openat, copy a write() buffer into a chunked S3 upload, synthesise a fake getdents64 directory listing. The TOCTOU-safe wrappers in seccomp::notif were already correct (id_valid before + after process_vm_readv / process_vm_writev), they were just pub(crate). Promote them to pub. The internal *_vm variants remain private. NotifError and SeccompNotif are already public, so this completes the guest-memory-access API surface for ExtraHandlers. Signed-off-by: dzerik --- crates/sandlock-core/src/seccomp/notif.rs | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/crates/sandlock-core/src/seccomp/notif.rs b/crates/sandlock-core/src/seccomp/notif.rs index f50bd31..2be0f6d 100644 --- a/crates/sandlock-core/src/seccomp/notif.rs +++ b/crates/sandlock-core/src/seccomp/notif.rs @@ -371,11 +371,16 @@ fn write_child_mem_vm(pid: u32, addr: u64, data: &[u8]) -> Result<(), NotifError } } -/// Read bytes from a child process via process_vm_readv. +/// Read bytes from a child process via `process_vm_readv` with TOCTOU validation. /// -/// Performs TOCTOU validation by calling `id_valid` before and after -/// the read to ensure the notification is still live. -pub(crate) fn read_child_mem( +/// Calls `id_valid` before and after the read to ensure the notification is +/// still live (kernel did not abort or release the trapped syscall while the +/// supervisor was reading guest memory). +/// +/// Public — used by downstream `ExtraHandler`s (sandbox-sber/vfs-engine etc.) +/// to read syscall arguments that the kernel passes by pointer (paths in +/// `openat`, buffers in `write`/`writev`). +pub fn read_child_mem( notif_fd: RawFd, id: u64, pid: u32, @@ -420,11 +425,12 @@ pub(crate) fn read_child_cstr( String::from_utf8(result).ok() } -/// Write bytes to a child process via process_vm_writev. +/// Write bytes to a child process via `process_vm_writev` with TOCTOU validation. /// -/// Performs TOCTOU validation by calling `id_valid` before and after -/// the write to ensure the notification is still live. -pub(crate) fn write_child_mem( +/// Same TOCTOU contract as [`read_child_mem`]. Public for downstream +/// `ExtraHandler`s that synthesise syscall results into guest memory +/// (e.g. fake `getdents64` listings populated from a virtual tree-index). +pub fn write_child_mem( notif_fd: RawFd, id: u64, pid: u32, From 8f20ab86f23d6a507fd34e5338d313c183fa026f Mon Sep 17 00:00:00 2001 From: dzerik Date: Fri, 1 May 2026 23:39:53 +0300 Subject: [PATCH 5/5] fix(test): use fs_read_if_exists for /lib64 in test_extra_handlers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The arm64 Ubuntu runner (and aarch64 hosts in general) do not have `/lib64` — the dynamic linker lives under `/lib/aarch64-linux-gnu/`. A strict `fs_read("/lib64")` makes Landlock refuse to add the rule and the child exits before completing confinement, surfacing in the parent as `pipe closed before 4 bytes read`. Switch to `fs_read_if_exists("/lib64")` to mirror the convention already used by `test_dry_run`, `test_fork`, `test_netlink_virt`, and `test_landlock`. Brings the six previously-failing tests on the `ubuntu-24.04-arm` CI job back to green; x86_64 unchanged. Signed-off-by: dzerik --- .../tests/integration/test_extra_handlers.rs | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/crates/sandlock-core/tests/integration/test_extra_handlers.rs b/crates/sandlock-core/tests/integration/test_extra_handlers.rs index dd130a5..0ca1d5d 100644 --- a/crates/sandlock-core/tests/integration/test_extra_handlers.rs +++ b/crates/sandlock-core/tests/integration/test_extra_handlers.rs @@ -56,10 +56,17 @@ fn read_path_from_child(pid: u32, addr: u64) -> Option { } fn base_policy() -> sandlock_core::PolicyBuilder { + // `fs_read_if_exists` for `/lib64` because aarch64 hosts (Ubuntu CI + // arm64 runner) do not have it — the dynamic linker lives under + // `/lib/aarch64-linux-gnu/`. A strict `fs_read` here makes Landlock + // refuse to add the rule and the child exits before completing + // confinement, surfacing as `pipe closed before 4 bytes read` + // in the parent. Mirrors the convention used in upstream + // `test_dry_run`, `test_fork`, `test_netlink_virt`, `test_landlock`. Policy::builder() .fs_read("/usr") .fs_read("/lib") - .fs_read("/lib64") + .fs_read_if_exists("/lib64") .fs_read("/bin") .fs_read("/etc") .fs_read("/proc")