Commit 000015c
committed
Make fd-table close() path O(1) with reverse map and refcount
Profiling bench_test under --syscall-mode=rewrite on aarch64 showed
kbox_fd_table_find_by_host_fd() consuming 35.47 % of total CPU time.
Every close() on a tracee-held host FD called forward_close(), which
in turn called find_by_host_fd() to locate the supervisor's shadow
entry for that FD. The old implementation walked all three backing
ranges linearly: 1024 low_fds + 31744 mid_fds + 4096 entries = 36864
slots per call. For bench_test's cached-shadow openat path, which
injects an ADDFD without ever creating an fd_table entry, every
close walked the full table finding nothing before returning -1.
Replace the linear scan with two flat tables maintained alongside
the main fd_table:
- host_to_vfd[KBOX_HOST_FD_REVERSE_MAX]: O(1) host_fd -> virtual_fd
reverse map, sized to cover the child's RLIMIT_NOFILE (65536).
Three states per slot: KBOX_HOST_VFD_NONE (-1) means no entry
claims this host_fd (authoritative miss, return -1 in O(1));
KBOX_HOST_VFD_MULTI (-2) means two or more entries share this
host_fd (fall through to the linear scan); any non-negative
value is the single holder's vfd. A forward-check guards
against stale single-holder entries.
- lkl_fd_refs[KBOX_LKL_FD_REFMAX]: O(1) refcount of how many
virtual fds currently hold each lkl_fd, replacing the O(n)
lkl_fd_has_other_ref scan and the still_ref loop inside
forward_close()'s shadow-socket close path.
The three-state reverse map preserves the "authoritative miss"
optimization for the hot path (map[h] == NONE short-circuits to -1
in O(1)) while correctly handling dup2/dup3-style duplicate holders
via the MULTI sentinel. The invariant is that every positive host_fd
assignment must go through kbox_fd_table_set_host_fd(), which the
codebase already honors; the only direct writes to entry->host_fd
are the two negative sentinel writes in seccomp-dispatch.c, and
they never transition a positive value out.
Three existing helpers were also refactored. kbox_fd_table_insert_at
now releases the old refcount and reverse-map entry when reusing a
live slot (previously a latent refcount leak). close_cloexec_entry
gained a vfd argument so it can clear the reverse map correctly.
Six copies of 6-field slot-initialization boilerplate were
collapsed into two helpers, clear_fd_entry() and init_live_entry(),
without changing any behavior.
The forward_close() still_ref loop that walked all three ranges
looking for another holder of a shadow socket lkl_fd is replaced
with kbox_fd_table_lkl_ref_count() == 0, which reads the refcount
in O(1) after kbox_fd_table_remove() has already decremented it.
Memory cost: struct kbox_fd_table grows from ~1.13 MB to ~1.43 MB
(+288 KB: host_to_vfd adds 256 KB as int32_t, lkl_fd_refs adds 32
KB as uint16_t). Stack-allocated in the supervisor launch paths;
the default 8 MB Linux stack has plenty of headroom.
Out-of-range host_fds (>= 65536) and lkl_fds (>= 16384) fall
through to the original linear-scan implementations as a safety
net. In practice kbox raises RLIMIT_NOFILE to exactly 65536 and
LKL allocates small kernel fd numbers, so the fallback paths are
dead code on the hot path.
Performance on real aarch64 hardware (release build, bench_test
10000 iterations, mean of 5 runs):
syscall before after delta
stat 3.5 3.3 noise
open+close 119.76 47.52 -60.3 % (2.52x faster)
lseek+read 42.5 43.4 noise
write 1.4 1.4 flat
getpid 0.0 0.0 flat
Combined with the preceding cancel-wrapper fast-path commit, the
total improvement versus the pre-series baseline is:
open+close 122.97 us -> 47.52 us -61.4 % (2.59x faster)
perf record confirms kbox_fd_table_find_by_host_fd is off the top
chart after this change; the new hotspot is kernel-side
_raw_spin_unlock_irqrestore from futex wake-up in the supervisor
service thread (13.40 %), which is an orthogonal signaling cost
and a separate optimization target.
Tests:
- Two new fd-table unit tests document the hybrid semantics: the
duplicate-holder test asserts either holder is a valid answer
after a dup-style set (matching the scan-order tie-break the
MULTI state exposes), and a positive assertion documents the
load-bearing invariant that positive host_fd values must be
installed via the API (direct writes are intentionally not
findable so the authoritative-NONE fast path is sound).
- 273/273 unit tests pass on lima (ASAN+UBSAN debug build).
- 51/51 integration tests pass on lima.
- bench_test, clone3-test, dup-test, and /bin/ls all work
correctly on arm under --syscall-mode=rewrite. The static-binary
gate was re-verified: /bin/ls (dynamic) reports
cancel_promote_allowed=0, bench_test (static) reports
cancel_promote_allowed=1.
Change-Id: Ic96c0e862e1e984a0966651ee8beb38eb54e7a851 parent 000050c commit 000015c
4 files changed
Lines changed: 308 additions & 104 deletions
0 commit comments