fix(net): empty net_allow deny-all bypass + strengthen false-passing security tests by congwang-mk · Pull Request #103 · multikernel/sandlock

congwang-mk · 2026-06-13T05:28:35Z

Summary

Everything here traces to one failure mode: security tests that pass without the security actually happening. Fixing those tests exposed real bugs in the enforcement they were supposed to cover.

1. `net_allow=[]` silently allowed all outbound TCP (Landlock bypass)

Once any fs grant turns on the named-AF_UNIX connect gate (has_unix_fs_gate — i.e. essentially always), connect() is trapped for every address family. The on-behalf handler then performed IP connects in the unconfined supervisor, bypassing the child's Landlock CONNECT_TCP deny-all. So an empty net_allow (deny-all) permitted all outbound TCP.

Fix: bring connect() in line with the invariant the sendto/sendmsg handlers already follow — no network destination policy ⇒ nothing to enforce ⇒ return the syscall to the kernel so Landlock governs it. Also renames the misnamed has_net_allowlist → has_net_destination_policy and corrects the NetworkPolicy::Unrestricted doc.

Regression introduced in 4a8a7bf (post-v0.8.3 tag). The published v0.8.3 release is NOT affected — verified against the PyPI wheel; only source builds from main between 4a8a7bf and this PR are.

2. False-passing security tests

A cluster of tests exercised a deny/restrict API but only asserted "the process still ran", so they passed even when enforcement was a complete no-op:

network deny tests connecting to a dead port (failure was incidental, not enforced)
restrict_network([]) — an empty list is a no-op
restrict_pid_network — ran echo, never made a network call
restrict_max_memory / restrict_max_processes — restricted, then ran echo, asserted nothing

All now use live loopback listeners (so an allowed op would actually succeed) or real allocations/forks, and assert the enforcement-specific outcome (EACCES/EPERM/EAGAIN, killed process, refused connect). Equivalently weak tests in test_mcp_integration.py / test_pipeline.py are fixed too.

3. Three `policy_fn` product bugs the strengthening uncovered

Making those tests real revealed the underlying dynamic-policy features were broken:

restrict_max_memory was a no-op — handle_memory read the static NotifPolicy limit, never the LivePolicy that restrict_max_memory writes. Now reads the live (tightened) limit.
restrict_max_processes was a no-op — same shape in handle_fork (read static rs.max_processes). Now reads the live limit.
policy_fn + fork() deadlocked the sandbox. The fork-event ptrace tracking that policy_fn enables hung on any fork. Root cause unfolded into three couplings, all fixed:
1. Pre-Continue ptrace-stop deadlock — prepare INTERRUPT'd and waited for a stop before the seccomp Continue, but a tracee parked in the notify wait can't reach a ptrace-stop. Now SEIZE-only before Continue; the fork event is awaited after.
2. Single-thread coordinator, no busy-poll — ptrace commands are per-tracer-thread (cross-thread = ESRCH; only waitpid may be cross-thread), so the whole SEIZE → wait → detach runs in one spawn_blocking worker driven over a channel. A successful fork is caught by a single blocking waitpid; a failed fork (no event) is bounded by an async tokio::time watchdog that wakes the worker via kill(SIGURG). No dedicated thread, no spin, no INTERRUPT race.
3. Competing waiter — Sandbox::wait() reaped the top-level child with a waitpid(pid,0) loop, and waitpid (any flags) reaps a tracee's ptrace-stops, so it raced the fork-tracking worker and hung repeated forks. wait() now detects exit via the child's pidfd + AsyncFd (readable on exit only, never consuming ptrace-stops), mirroring spawn_pid_watcher.

Fork tracking still uses ptrace (the namespace-less TOCTOU guarantee requires TRACEFORK); pidfd is only for exit detection.

Test plan

cargo test -p sandlock-core — 398 lib + 257 integration pass (incl. new test_empty_net_allow_denies_tcp_despite_fs_grants and the fork-tracking integration test exercising the new coordinator).
pytest python/tests — 354 pass.
Stress: 200 sequential forks under policy_fn, forced fork failures (EAGAIN), and repeated runs all complete with zero hangs.
Egress matrix: net_allow=[] → EACCES; matching rule → ok; non-matching → refused; *:* → ok.

`test_no_network_blocks_connect` connected to 127.0.0.1:80 (nothing listening) and treated any OSError as "blocked", so it passed with zero network enforcement. `test_xoa_executor_no_network` asserted `not result.success` on a malformed executor command, failing for plumbing reasons rather than because egress was denied. Both are negative tests that assert "the connection failed" without controlling why. Rebind each against a live loopback listener so an *allowed* connect would succeed, and assert the connect is denied with EACCES (errno 13) — Landlock's CONNECT_TCP hook. This makes them discriminate a real deny from an incidental failure. Signed-off-by: Cong Wang <cwang@multikernel.io>

Cover the contract that an empty `net_allow` denies all outbound TCP even when filesystem grants are present — the configuration in which any fs grant turns on the named-AF_UNIX connect gate (`has_unix_fs_gate`) and `connect()` starts being trapped for every address family. - Rust: crates/sandlock-core/tests/integration/test_network.rs - Python: python/tests/test_sandbox.py (TestNetAllowDenyAll) Both use a live loopback listener so an *allowed* connect would succeed, and assert the connect is denied with EACCES (errno 13). They fail on the pre-fix code (the supervisor performs the connect on-behalf, bypassing Landlock) and pass once the on-behalf path defers unconstrained IP connects to the kernel. Signed-off-by: Cong Wang <cwang@multikernel.io>

… invariant The sendto/sendmsg on-behalf handlers already follow a clear rule: with a network destination policy active the supervisor is the IP-level enforcer and must never Continue; without one there is nothing to enforce, so the syscall is returned to the kernel and Landlock/BPF govern it. connect() violated that invariant. Once any fs grant turns on the named- AF_UNIX connect gate, connect() is trapped for every address family, and the handler performed IP connects on-behalf unconditionally — in the (unconfined) supervisor, bypassing the child's Landlock CONNECT_TCP decision. With fs grants present (i.e. essentially always), an empty net_allow (deny-all) silently allowed all outbound TCP. Bring connect() in line with the datagram handlers: add `NotifPolicy::ip_connect_supervised(dest_is_loopback)` and return SECCOMP_USER_NOTIF_FLAG_CONTINUE for an IP connect that needs no supervision. The child then issues the connect directly and its own Landlock CONNECT_TCP rules govern it — deny-all for an empty net_allow, allow-all for `*:*` (CONNECT_TCP left unhandled). port-remap still needs on-behalf, but only for loopback, so it is the sole extra reason to supervise an otherwise-unpoliced IP connect. AllowList/DenyList checks, HTTP ACL, IP allowlists, and UDP/ICMP handling are unchanged. Also rename the flag that drives this invariant from the misleading `has_net_allowlist` to `has_net_destination_policy` — it is set from `network_destination_policy` (net_allow OR net_deny OR policy_fn OR http_acl), not just an allowlist — so the connect gate, the sendto/sendmsg fast path, and the dispatch registration all read honestly. This also makes the redundant `|| has_http_acl` (already subsumed) in the connect trap condition visible, and drops it. With the empty case now gated to Continue, `NetworkPolicy::Unrestricted` is only ever reached meaning "allow all", so its doc is corrected accordingly. Signed-off-by: Cong Wang <cwang@multikernel.io>

…p assert) test_deny_returns_true connected to a dead port (127.0.0.1:1) with connect_ex, ignored the result, and asserted only that the process ran — so it passed even if the deny did nothing (port 1 fails regardless). Same false-positive shape as the network deny-tests fixed in b7d800a. Rebind it against a live loopback listener on an allowlisted port (so the connect would succeed without the deny) and assert EPERM (errno 1) plus a post-deny marker proving the process survives. Adds a shared `_loopback_listener` helper for the deny-tests that follow. Signed-off-by: Cong Wang <cwang@multikernel.io>

test_deny_with_errno connected to 127.0.0.1:1 — outside the net_allow allowlist — so Landlock denied it with the same errno 13 the policy_fn returns. The assertion passed even if the policy_fn errno path were a no-op. Target a live listener on an allowlisted port instead: Landlock permits it and the listener would accept it, so errno 13 can only originate from the policy_fn verdict. Signed-off-by: Cong Wang <cwang@multikernel.io>

test_restrict_network_on_execve called restrict_network([]) — an empty list is a no-op (the override is skipped when it lists no IPs) — and asserted only that the program printed output. It verified nothing about network restriction. Use two live loopback listeners (127.0.0.1 and 127.0.0.2), both allowlisted so either would connect, then restrict to ["127.0.0.1"] and assert the first connects while the second is refused (errno 111). Without enforcement both would connect. Signed-off-by: Cong Wang <cwang@multikernel.io>

…ed IPs test_restrict_pid_network restricted the exec'd pid then ran `echo ok` — it made no network call, so it observed nothing. Mirror the restrict_network test: two live loopback listeners (both allowlisted), restrict the pid to ["127.0.0.1"], and assert the listed IP connects while the non-listed one is refused (errno 111). Signed-off-by: Cong Wang <cwang@multikernel.io>

…al test handle_memory read the limit from the static NotifPolicy snapshot (`policy.max_memory_bytes`) and never from the LivePolicy that `restrict_max_memory` writes, so dynamically tightening the limit had no effect — a sandbox could allocate far past the restricted size. Read the live limit when a policy_fn is active (it is seeded from the static `max_memory` ceiling and only lowered by `restrict_max_memory`), falling back to the static limit otherwise. The policy_fn lock is taken and released before the resource lock, preserving lock ordering. test_restrict_max_memory previously restricted memory, ran `echo`, and asserted nothing. It now sets a 256 MiB ceiling, restricts to 64 MiB, and asserts a 128 MiB allocation is killed — while the same allocation under the un-restricted ceiling (control) succeeds, proving the kill is due to the dynamic restriction. Signed-off-by: Cong Wang <cwang@multikernel.io>

With policy_fn active, every fork-like syscall is ptrace-traced so the new child is registered before it can run user code (argv-safety TOCTOU). That path deadlocked, and fixing it surfaced three couplings, all addressed here. 1. Pre-Continue ptrace-stop deadlock. prepare_process_creation_tracking PTRACE_INTERRUPT'd the caller and waited for a stop *before* the seccomp Continue was sent — but a task parked in the seccomp user-notification wait cannot reach a ptrace-stop, so waitpid blocked forever. Now the tracee is only PTRACE_SEIZE'd before Continue (SEIZE does not stop it); the fork event is awaited afterward. 2. Single-thread coordinator, no busy-poll. ptrace *commands* are per-tracer-thread (cross-thread issue fails with ESRCH; only waitpid may be cross-thread), so the whole SEIZE -> wait -> detach sequence runs in one spawn_blocking worker, driven over a channel by finish() after Continue. The worker uses a single blocking waitpid (deterministic for a successful fork — the event is synchronous). A *failed* fork emits no event, so finish() bounds the wait with an async tokio::time watchdog that wakes the worker via kill(SIGURG); the worker treats that signal as "no child". No dedicated thread, no spinning, no INTERRUPT race. 3. Competing waiter. Sandbox::wait() reaped the top-level child with a waitpid(pid, 0) loop, and waitpid (any flags) reaps a tracee's ptrace-stops — so it raced the fork-tracking worker for the child's fork events and hung repeated forks. wait() now detects exit via the child's pidfd + AsyncFd (readable on exit only, never consuming ptrace-stops), mirroring spawn_pid_watcher, then reaps the status. A blocking waitpid fallback remains for kernels without pidfd_open. Verified: 200 sequential forks under policy_fn, forced fork failures (EAGAIN), and repeated separate runs all complete with no hangs; full Rust (398 lib + 257 integration) and Python (354) suites pass. Signed-off-by: Cong Wang <cwang@multikernel.io>

handle_fork compared the concurrent process count against the static ResourceState.max_processes and never against the LivePolicy that restrict_max_processes writes, so dynamically tightening the process limit had no effect. Read the live limit when a policy_fn is active (it is seeded from the static max_processes and only lowered by restrict_max_processes), falling back to the static limit otherwise — mirroring the restrict_max_memory fix. handle_fork now takes the SupervisorCtx (for ctx.policy_fn and ctx.resource) instead of the resource handle alone; the policy_fn lock is taken and released before the resource lock to preserve lock ordering. Verified: restrict_max_processes(1) now denies the fork with EAGAIN while the unrestricted baseline allows it, deterministically and without hanging (the latter relying on the fork-tracking deadlock fix). Signed-off-by: Cong Wang <cwang@multikernel.io>

test_restrict_max_processes restricted the limit then ran `echo` and asserted nothing. Restrict to 1 and fork: assert the fork is denied with EAGAIN (errno 11), while the unrestricted control run forks successfully — proving the denial is due to the dynamic limit. Now meaningful (and no longer hangs) given the fork-tracking deadlock fix and the restrict_max_processes enforcement fix. Signed-off-by: Cong Wang <cwang@multikernel.io>

congwang-mk added 11 commits June 12, 2026 19:55

congwang-mk merged commit 9df56bd into main Jun 13, 2026
12 checks passed

congwang-mk deleted the fix/empty-net-allow-deny-all branch June 13, 2026 23:52

congwang-mk mentioned this pull request Jun 14, 2026

test: strengthen Rust policy_fn tests + fix c_smoke cdylib build (follow-ups to #103) #104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(net): empty net_allow deny-all bypass + strengthen false-passing security tests#103

fix(net): empty net_allow deny-all bypass + strengthen false-passing security tests#103
congwang-mk merged 11 commits into
mainfrom
fix/empty-net-allow-deny-all

congwang-mk commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

congwang-mk commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. net_allow=[] silently allowed all outbound TCP (Landlock bypass)

2. False-passing security tests

3. Three policy_fn product bugs the strengthening uncovered

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

congwang-mk commented Jun 13, 2026 •

edited

Loading

1. `net_allow=[]` silently allowed all outbound TCP (Landlock bypass)

3. Three `policy_fn` product bugs the strengthening uncovered