Skip to content

File descriptor leak in supervisor after SECCOMP_IOCTL_NOTIF_ADDFD #6

@dzerik

Description

@dzerik

Context

We're evaluating sandlock for integration into our code execution platform and performed a security audit of the codebase. The project is very promising — the unprivileged Landlock + seccomp approach is exactly what we need, and the recent additions (confine(), dry-run, deterministic mode) are great. We'd like to contribute fixes and become active users.

Our audit found several issues; this is the most critical one.

Summary

Every NotifAction::InjectFdSend leaks one file descriptor in the supervisor process. The supervisor-side fd is never closed after SECCOMP_IOCTL_NOTIF_ADDFD ioctl duplicates it into the child. Long-running sandboxes will hit EMFILE and fail.

Severity

Critical — leads to supervisor resource exhaustion (DoS) under normal workload, not just adversarial input.

Affected code

6 call sites create an fd, call mem::forget() to prevent OwnedFd from closing it, pass the raw integer via InjectFdSend, and never close it afterward:

File Function Trigger
procfs.rs:147-176 inject_memfd Every /proc/cpuinfo, /proc/meminfo, /proc/net/* read
procfs.rs:160-173 handle_hostname_open Every /etc/hostname read (when hostname is set)
cow/dispatch.rs:107-115 handle_cow_open Every openat in COW workdir
random.rs:80-88 handle_random_open Every /dev/urandom, /dev/random open (when random_seed is set)
chroot/dispatch.rs:261 handle_chroot_open Every redirected open under chroot
chroot/dispatch.rs:274 handle_chroot_open Every redirected open under chroot

Root cause

NotifAction::InjectFdSend stores a RawFd (plain i32), not an OwnedFd. The handler in send_response() (notif.rs:442-449) passes this integer to inject_fd_and_send() which calls SECCOMP_IOCTL_NOTIF_ADDFD. The kernel duplicates the fd into the child but does not close the supervisor's copy. After the ioctl returns, send_response returns Ok(()) without closing srcfd. The NotifAction enum has no Drop implementation, so when the action value is dropped, the i32 is simply discarded — no close() ever happens.

The call sites use std::mem::forget(memfd) specifically to prevent OwnedFd from closing the fd before the ioctl. This is correct — but the fd must be closed after the ioctl, and nothing does that.

Flow diagram

sequenceDiagram
    participant H as Handler<br/>(procfs/random/cow)
    participant SR as send_response<br/>(notif.rs:434)
    participant K as Kernel<br/>(ADDFD ioctl)
    participant C as Child process

    H->>H: memfd = memfd_create("sandlock-*")
    Note over H: supervisor owns fd 7

    H->>H: std::mem::forget(memfd)
    Note over H: OwnedFd destructor disabled<br/>fd 7 will NOT auto-close

    H->>SR: InjectFdSend { srcfd: 7 }

    SR->>K: ioctl(SECCOMP_IOCTL_NOTIF_ADDFD,<br/>srcfd=7, ADDFD_FLAG_SEND)
    K->>C: duplicates fd 7 → child gets fd 3
    K-->>SR: returns new_fd

    Note over SR: srcfd=7 still open in supervisor<br/>no close() called
    SR-->>H: Ok(())

    Note over H: fd 7 leaked forever<br/>process fd table grows by 1
Loading

Impact

A Python script importing NumPy or Pandas triggers ~10 /proc/* reads. With random_seed enabled, each /dev/urandom open adds another leak. In COW mode, every file open leaks. A conservative estimate for a typical data science script:

  • ~10 procfs reads + ~5 urandom opens + ~50 COW opens = ~65 leaked fds per execution
  • Default soft RLIMIT_NOFILE = 1024
  • ~15 executions → supervisor hits EMFILE → all sandboxes on this supervisor fail

With deterministic mode enabled (procfs + random + hostname + getdents), the leak rate is even higher.

Reproduction

Quick check: fd count grows after each run

"""Run this with Python SDK — shows fd leak accumulating per sandbox execution."""
import os
from sandlock import Sandbox, Policy

def count_fds():
    return len(os.listdir(f"/proc/{os.getpid()}/fd"))

policy = Policy(
    fs_readable=["/"],
    random_seed=42,          # enables /dev/urandom interception
    hostname="test",         # enables /etc/hostname interception
    num_cpus=2,              # enables /proc/cpuinfo interception
)

script = """
import os
for f in ['cpuinfo', 'meminfo', 'stat']:
    open(f'/proc/{f}').read()
os.urandom(16)
"""

print(f"before: {count_fds()} fds")

for i in range(20):
    Sandbox(policy).run(["python3", "-c", script], timeout=10)
    print(f"after run {i+1:>2}: {count_fds()} fds")

# Expected: fd count grows by ~10-15 per iteration, never decreases.
# Around iteration 15-20 (depending on RLIMIT_NOFILE), runs will start
# failing with EMFILE ("Too many open files").

Crash scenario: EMFILE after ~15 runs

The leak is per-process, not system-wide — the OS is unaffected, but the supervisor
process (which stays alive across sandbox invocations) accumulates leaked fds until
it hits RLIMIT_NOFILE (typically 1024 soft). After that, all subsequent memfd_create,
pipe, open, and socket calls fail with EMFILE, and no new sandboxes can run.

Restarting the process clears all leaked fds. The leak does not persist across
process restarts — it only affects long-lived processes that run multiple sandboxes
(servers, daemons, worker pools, test loops via the SDK).

Single-shot CLI is not affected

sandlock run -- python3 script.py spawns a new process each time, so leaked fds
are reclaimed by the kernel on exit. The leak only matters when the same process
reuses the Sandbox API repeatedly.

Suggested fix

Change InjectFdSend to own the fd, so it is automatically closed after the ioctl:

// notif.rs — change the variant type:
pub enum NotifAction {
    // ...
    InjectFdSend { srcfd: OwnedFd },  // was: RawFd
    // ...
}

// notif.rs — send_response: OwnedFd is dropped after ioctl
NotifAction::InjectFdSend { srcfd } => {
    match inject_fd_and_send(fd, id, srcfd.as_raw_fd()) {
        Ok(_new_fd) => Ok(()),
        Err(_) => respond_continue(fd, id),
    }
    // srcfd: OwnedFd dropped here → close(fd) called automatically
}

Then all call sites replace mem::forget + raw fd with OwnedFd::into():

// procfs.rs — before:
std::mem::forget(memfd);
NotifAction::InjectFdSend { srcfd: raw }

// procfs.rs — after:
NotifAction::InjectFdSend { srcfd: memfd }  // OwnedFd moved, no forget needed

This fix:

  • Closes all 6 leak points at once (the handler does the close, not each call site)
  • Prevents future leaks — new InjectFdSend callers can't forget to close because OwnedFd does it automatically
  • Is backwards compatible — no API change outside the crate

An alternative minimal fix (without changing NotifAction) would be to add close(srcfd) after inject_fd_and_send() in send_response(). This is simpler but doesn't prevent future call sites from leaking.

Next steps

We're happy to submit a PR for whichever approach you prefer. We also found a few other issues during the audit (pipe created without O_CLOEXEC, unchecked read_exact return on the control pipe, fs_denied policy field not enforced) — we can file separate issues for those if that's helpful.

Thanks for building this project — the unprivileged sandbox space really needs a well-engineered solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions