Skip to content

fix(net): empty net_allow deny-all bypass + strengthen false-passing security tests#103

Merged
congwang-mk merged 11 commits into
mainfrom
fix/empty-net-allow-deny-all
Jun 13, 2026
Merged

fix(net): empty net_allow deny-all bypass + strengthen false-passing security tests#103
congwang-mk merged 11 commits into
mainfrom
fix/empty-net-allow-deny-all

Conversation

@congwang-mk

@congwang-mk congwang-mk commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Everything here traces to one failure mode: security tests that pass without the security actually happening. Fixing those tests exposed real bugs in the enforcement they were supposed to cover.

1. net_allow=[] silently allowed all outbound TCP (Landlock bypass)

Once any fs grant turns on the named-AF_UNIX connect gate (has_unix_fs_gate — i.e. essentially always), connect() is trapped for every address family. The on-behalf handler then performed IP connects in the unconfined supervisor, bypassing the child's Landlock CONNECT_TCP deny-all. So an empty net_allow (deny-all) permitted all outbound TCP.

Fix: bring connect() in line with the invariant the sendto/sendmsg handlers already follow — no network destination policy ⇒ nothing to enforce ⇒ return the syscall to the kernel so Landlock governs it. Also renames the misnamed has_net_allowlisthas_net_destination_policy and corrects the NetworkPolicy::Unrestricted doc.

Regression introduced in 4a8a7bf (post-v0.8.3 tag). The published v0.8.3 release is NOT affected — verified against the PyPI wheel; only source builds from main between 4a8a7bf and this PR are.

2. False-passing security tests

A cluster of tests exercised a deny/restrict API but only asserted "the process still ran", so they passed even when enforcement was a complete no-op:

  • network deny tests connecting to a dead port (failure was incidental, not enforced)
  • restrict_network([]) — an empty list is a no-op
  • restrict_pid_network — ran echo, never made a network call
  • restrict_max_memory / restrict_max_processes — restricted, then ran echo, asserted nothing

All now use live loopback listeners (so an allowed op would actually succeed) or real allocations/forks, and assert the enforcement-specific outcome (EACCES/EPERM/EAGAIN, killed process, refused connect). Equivalently weak tests in test_mcp_integration.py / test_pipeline.py are fixed too.

3. Three policy_fn product bugs the strengthening uncovered

Making those tests real revealed the underlying dynamic-policy features were broken:

  • restrict_max_memory was a no-ophandle_memory read the static NotifPolicy limit, never the LivePolicy that restrict_max_memory writes. Now reads the live (tightened) limit.
  • restrict_max_processes was a no-op — same shape in handle_fork (read static rs.max_processes). Now reads the live limit.
  • policy_fn + fork() deadlocked the sandbox. The fork-event ptrace tracking that policy_fn enables hung on any fork. Root cause unfolded into three couplings, all fixed:
    1. Pre-Continue ptrace-stop deadlockprepare INTERRUPT'd and waited for a stop before the seccomp Continue, but a tracee parked in the notify wait can't reach a ptrace-stop. Now SEIZE-only before Continue; the fork event is awaited after.
    2. Single-thread coordinator, no busy-poll — ptrace commands are per-tracer-thread (cross-thread = ESRCH; only waitpid may be cross-thread), so the whole SEIZE → wait → detach runs in one spawn_blocking worker driven over a channel. A successful fork is caught by a single blocking waitpid; a failed fork (no event) is bounded by an async tokio::time watchdog that wakes the worker via kill(SIGURG). No dedicated thread, no spin, no INTERRUPT race.
    3. Competing waiterSandbox::wait() reaped the top-level child with a waitpid(pid,0) loop, and waitpid (any flags) reaps a tracee's ptrace-stops, so it raced the fork-tracking worker and hung repeated forks. wait() now detects exit via the child's pidfd + AsyncFd (readable on exit only, never consuming ptrace-stops), mirroring spawn_pid_watcher.

Fork tracking still uses ptrace (the namespace-less TOCTOU guarantee requires TRACEFORK); pidfd is only for exit detection.

Test plan

  • cargo test -p sandlock-core398 lib + 257 integration pass (incl. new test_empty_net_allow_denies_tcp_despite_fs_grants and the fork-tracking integration test exercising the new coordinator).
  • pytest python/tests354 pass.
  • Stress: 200 sequential forks under policy_fn, forced fork failures (EAGAIN), and repeated runs all complete with zero hangs.
  • Egress matrix: net_allow=[] → EACCES; matching rule → ok; non-matching → refused; *:* → ok.

`test_no_network_blocks_connect` connected to 127.0.0.1:80 (nothing
listening) and treated any OSError as "blocked", so it passed with zero
network enforcement. `test_xoa_executor_no_network` asserted `not
result.success` on a malformed executor command, failing for plumbing
reasons rather than because egress was denied.

Both are negative tests that assert "the connection failed" without
controlling why. Rebind each against a live loopback listener so an
*allowed* connect would succeed, and assert the connect is denied with
EACCES (errno 13) — Landlock's CONNECT_TCP hook. This makes them
discriminate a real deny from an incidental failure.

Signed-off-by: Cong Wang <cwang@multikernel.io>
Cover the contract that an empty `net_allow` denies all outbound TCP even
when filesystem grants are present — the configuration in which any fs
grant turns on the named-AF_UNIX connect gate (`has_unix_fs_gate`) and
`connect()` starts being trapped for every address family.

- Rust: crates/sandlock-core/tests/integration/test_network.rs
- Python: python/tests/test_sandbox.py (TestNetAllowDenyAll)

Both use a live loopback listener so an *allowed* connect would succeed,
and assert the connect is denied with EACCES (errno 13). They fail on the
pre-fix code (the supervisor performs the connect on-behalf, bypassing
Landlock) and pass once the on-behalf path defers unconstrained IP
connects to the kernel.

Signed-off-by: Cong Wang <cwang@multikernel.io>
… invariant

The sendto/sendmsg on-behalf handlers already follow a clear rule: with a
network destination policy active the supervisor is the IP-level enforcer
and must never Continue; without one there is nothing to enforce, so the
syscall is returned to the kernel and Landlock/BPF govern it.

connect() violated that invariant. Once any fs grant turns on the named-
AF_UNIX connect gate, connect() is trapped for every address family, and
the handler performed IP connects on-behalf unconditionally — in the
(unconfined) supervisor, bypassing the child's Landlock CONNECT_TCP
decision. With fs grants present (i.e. essentially always), an empty
net_allow (deny-all) silently allowed all outbound TCP.

Bring connect() in line with the datagram handlers: add
`NotifPolicy::ip_connect_supervised(dest_is_loopback)` and return
SECCOMP_USER_NOTIF_FLAG_CONTINUE for an IP connect that needs no
supervision. The child then issues the connect directly and its own
Landlock CONNECT_TCP rules govern it — deny-all for an empty net_allow,
allow-all for `*:*` (CONNECT_TCP left unhandled). port-remap still needs
on-behalf, but only for loopback, so it is the sole extra reason to
supervise an otherwise-unpoliced IP connect. AllowList/DenyList checks,
HTTP ACL, IP allowlists, and UDP/ICMP handling are unchanged.

Also rename the flag that drives this invariant from the misleading
`has_net_allowlist` to `has_net_destination_policy` — it is set from
`network_destination_policy` (net_allow OR net_deny OR policy_fn OR
http_acl), not just an allowlist — so the connect gate, the sendto/sendmsg
fast path, and the dispatch registration all read honestly. This also
makes the redundant `|| has_http_acl` (already subsumed) in the connect
trap condition visible, and drops it. With the empty case now gated to
Continue, `NetworkPolicy::Unrestricted` is only ever reached meaning
"allow all", so its doc is corrected accordingly.

Signed-off-by: Cong Wang <cwang@multikernel.io>
…p assert)

test_deny_returns_true connected to a dead port (127.0.0.1:1) with
connect_ex, ignored the result, and asserted only that the process ran —
so it passed even if the deny did nothing (port 1 fails regardless). Same
false-positive shape as the network deny-tests fixed in b7d800a.

Rebind it against a live loopback listener on an allowlisted port (so the
connect would succeed without the deny) and assert EPERM (errno 1) plus a
post-deny marker proving the process survives. Adds a shared
`_loopback_listener` helper for the deny-tests that follow.

Signed-off-by: Cong Wang <cwang@multikernel.io>
test_deny_with_errno connected to 127.0.0.1:1 — outside the net_allow
allowlist — so Landlock denied it with the same errno 13 the policy_fn
returns. The assertion passed even if the policy_fn errno path were a
no-op.

Target a live listener on an allowlisted port instead: Landlock permits it
and the listener would accept it, so errno 13 can only originate from the
policy_fn verdict.

Signed-off-by: Cong Wang <cwang@multikernel.io>
test_restrict_network_on_execve called restrict_network([]) — an empty
list is a no-op (the override is skipped when it lists no IPs) — and
asserted only that the program printed output. It verified nothing about
network restriction.

Use two live loopback listeners (127.0.0.1 and 127.0.0.2), both
allowlisted so either would connect, then restrict to ["127.0.0.1"] and
assert the first connects while the second is refused (errno 111). Without
enforcement both would connect.

Signed-off-by: Cong Wang <cwang@multikernel.io>
…ed IPs

test_restrict_pid_network restricted the exec'd pid then ran `echo ok` —
it made no network call, so it observed nothing. Mirror the
restrict_network test: two live loopback listeners (both allowlisted),
restrict the pid to ["127.0.0.1"], and assert the listed IP connects while
the non-listed one is refused (errno 111).

Signed-off-by: Cong Wang <cwang@multikernel.io>
…al test

handle_memory read the limit from the static NotifPolicy snapshot
(`policy.max_memory_bytes`) and never from the LivePolicy that
`restrict_max_memory` writes, so dynamically tightening the limit had no
effect — a sandbox could allocate far past the restricted size.

Read the live limit when a policy_fn is active (it is seeded from the
static `max_memory` ceiling and only lowered by `restrict_max_memory`),
falling back to the static limit otherwise. The policy_fn lock is taken and
released before the resource lock, preserving lock ordering.

test_restrict_max_memory previously restricted memory, ran `echo`, and
asserted nothing. It now sets a 256 MiB ceiling, restricts to 64 MiB, and
asserts a 128 MiB allocation is killed — while the same allocation under
the un-restricted ceiling (control) succeeds, proving the kill is due to
the dynamic restriction.

Signed-off-by: Cong Wang <cwang@multikernel.io>
With policy_fn active, every fork-like syscall is ptrace-traced so the new
child is registered before it can run user code (argv-safety TOCTOU). That
path deadlocked, and fixing it surfaced three couplings, all addressed here.

1. Pre-Continue ptrace-stop deadlock. prepare_process_creation_tracking
   PTRACE_INTERRUPT'd the caller and waited for a stop *before* the seccomp
   Continue was sent — but a task parked in the seccomp user-notification
   wait cannot reach a ptrace-stop, so waitpid blocked forever. Now the
   tracee is only PTRACE_SEIZE'd before Continue (SEIZE does not stop it);
   the fork event is awaited afterward.

2. Single-thread coordinator, no busy-poll. ptrace *commands* are
   per-tracer-thread (cross-thread issue fails with ESRCH; only waitpid may
   be cross-thread), so the whole SEIZE -> wait -> detach sequence runs in
   one spawn_blocking worker, driven over a channel by finish() after
   Continue. The worker uses a single blocking waitpid (deterministic for a
   successful fork — the event is synchronous). A *failed* fork emits no
   event, so finish() bounds the wait with an async tokio::time watchdog
   that wakes the worker via kill(SIGURG); the worker treats that signal as
   "no child". No dedicated thread, no spinning, no INTERRUPT race.

3. Competing waiter. Sandbox::wait() reaped the top-level child with a
   waitpid(pid, 0) loop, and waitpid (any flags) reaps a tracee's
   ptrace-stops — so it raced the fork-tracking worker for the child's fork
   events and hung repeated forks. wait() now detects exit via the child's
   pidfd + AsyncFd (readable on exit only, never consuming ptrace-stops),
   mirroring spawn_pid_watcher, then reaps the status. A blocking waitpid
   fallback remains for kernels without pidfd_open.

Verified: 200 sequential forks under policy_fn, forced fork failures
(EAGAIN), and repeated separate runs all complete with no hangs; full
Rust (398 lib + 257 integration) and Python (354) suites pass.

Signed-off-by: Cong Wang <cwang@multikernel.io>
handle_fork compared the concurrent process count against the static
ResourceState.max_processes and never against the LivePolicy that
restrict_max_processes writes, so dynamically tightening the process limit
had no effect.

Read the live limit when a policy_fn is active (it is seeded from the
static max_processes and only lowered by restrict_max_processes), falling
back to the static limit otherwise — mirroring the restrict_max_memory fix.
handle_fork now takes the SupervisorCtx (for ctx.policy_fn and ctx.resource)
instead of the resource handle alone; the policy_fn lock is taken and
released before the resource lock to preserve lock ordering.

Verified: restrict_max_processes(1) now denies the fork with EAGAIN while
the unrestricted baseline allows it, deterministically and without hanging
(the latter relying on the fork-tracking deadlock fix).

Signed-off-by: Cong Wang <cwang@multikernel.io>
test_restrict_max_processes restricted the limit then ran `echo` and
asserted nothing. Restrict to 1 and fork: assert the fork is denied with
EAGAIN (errno 11), while the unrestricted control run forks successfully —
proving the denial is due to the dynamic limit. Now meaningful (and no
longer hangs) given the fork-tracking deadlock fix and the
restrict_max_processes enforcement fix.

Signed-off-by: Cong Wang <cwang@multikernel.io>
@congwang-mk congwang-mk merged commit 9df56bd into main Jun 13, 2026
12 checks passed
@congwang-mk congwang-mk deleted the fix/empty-net-allow-deny-all branch June 13, 2026 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant