Skip to content

Commit 000015c

Browse files
committed
Make fd-table close() path O(1) with reverse map and refcount
Profiling bench_test under --syscall-mode=rewrite on aarch64 showed kbox_fd_table_find_by_host_fd() consuming 35.47 % of total CPU time. Every close() on a tracee-held host FD called forward_close(), which in turn called find_by_host_fd() to locate the supervisor's shadow entry for that FD. The old implementation walked all three backing ranges linearly: 1024 low_fds + 31744 mid_fds + 4096 entries = 36864 slots per call. For bench_test's cached-shadow openat path, which injects an ADDFD without ever creating an fd_table entry, every close walked the full table finding nothing before returning -1. Replace the linear scan with two flat tables maintained alongside the main fd_table: - host_to_vfd[KBOX_HOST_FD_REVERSE_MAX]: O(1) host_fd -> virtual_fd reverse map, sized to cover the child's RLIMIT_NOFILE (65536). Three states per slot: KBOX_HOST_VFD_NONE (-1) means no entry claims this host_fd (authoritative miss, return -1 in O(1)); KBOX_HOST_VFD_MULTI (-2) means two or more entries share this host_fd (fall through to the linear scan); any non-negative value is the single holder's vfd. A forward-check guards against stale single-holder entries. - lkl_fd_refs[KBOX_LKL_FD_REFMAX]: O(1) refcount of how many virtual fds currently hold each lkl_fd, replacing the O(n) lkl_fd_has_other_ref scan and the still_ref loop inside forward_close()'s shadow-socket close path. The three-state reverse map preserves the "authoritative miss" optimization for the hot path (map[h] == NONE short-circuits to -1 in O(1)) while correctly handling dup2/dup3-style duplicate holders via the MULTI sentinel. The invariant is that every positive host_fd assignment must go through kbox_fd_table_set_host_fd(), which the codebase already honors; the only direct writes to entry->host_fd are the two negative sentinel writes in seccomp-dispatch.c, and they never transition a positive value out. Three existing helpers were also refactored. kbox_fd_table_insert_at now releases the old refcount and reverse-map entry when reusing a live slot (previously a latent refcount leak). close_cloexec_entry gained a vfd argument so it can clear the reverse map correctly. Six copies of 6-field slot-initialization boilerplate were collapsed into two helpers, clear_fd_entry() and init_live_entry(), without changing any behavior. The forward_close() still_ref loop that walked all three ranges looking for another holder of a shadow socket lkl_fd is replaced with kbox_fd_table_lkl_ref_count() == 0, which reads the refcount in O(1) after kbox_fd_table_remove() has already decremented it. Memory cost: struct kbox_fd_table grows from ~1.13 MB to ~1.43 MB (+288 KB: host_to_vfd adds 256 KB as int32_t, lkl_fd_refs adds 32 KB as uint16_t). Stack-allocated in the supervisor launch paths; the default 8 MB Linux stack has plenty of headroom. Out-of-range host_fds (>= 65536) and lkl_fds (>= 16384) fall through to the original linear-scan implementations as a safety net. In practice kbox raises RLIMIT_NOFILE to exactly 65536 and LKL allocates small kernel fd numbers, so the fallback paths are dead code on the hot path. Performance on real aarch64 hardware (release build, bench_test 10000 iterations, mean of 5 runs): syscall before after delta stat 3.5 3.3 noise open+close 119.76 47.52 -60.3 % (2.52x faster) lseek+read 42.5 43.4 noise write 1.4 1.4 flat getpid 0.0 0.0 flat Combined with the preceding cancel-wrapper fast-path commit, the total improvement versus the pre-series baseline is: open+close 122.97 us -> 47.52 us -61.4 % (2.59x faster) perf record confirms kbox_fd_table_find_by_host_fd is off the top chart after this change; the new hotspot is kernel-side _raw_spin_unlock_irqrestore from futex wake-up in the supervisor service thread (13.40 %), which is an orthogonal signaling cost and a separate optimization target. Tests: - Two new fd-table unit tests document the hybrid semantics: the duplicate-holder test asserts either holder is a valid answer after a dup-style set (matching the scan-order tie-break the MULTI state exposes), and a positive assertion documents the load-bearing invariant that positive host_fd values must be installed via the API (direct writes are intentionally not findable so the authoritative-NONE fast path is sound). - 273/273 unit tests pass on lima (ASAN+UBSAN debug build). - 51/51 integration tests pass on lima. - bench_test, clone3-test, dup-test, and /bin/ls all work correctly on arm under --syscall-mode=rewrite. The static-binary gate was re-verified: /bin/ls (dynamic) reports cancel_promote_allowed=0, bench_test (static) reports cancel_promote_allowed=1. Change-Id: Ic96c0e862e1e984a0966651ee8beb38eb54e7a85
1 parent 000050c commit 000015c

4 files changed

Lines changed: 308 additions & 104 deletions

File tree

0 commit comments

Comments
 (0)