io: use poll() instead of select() to avoid an FD_SETSIZE hang (fixes #231) by dr-who · Pull Request #1018 · RsyncProject/rsync

dr-who · 2026-06-30T20:03:46Z

Problem (#231)

rsync's I/O loops — safe_read, safe_write, and the main perform_io multiplexer — waited for readiness with select() and fd_set bitmaps. An fd_set can only represent descriptors below FD_SETSIZE (1024 with glibc).

When rsync is started with many descriptors already open — inherited from a parent that leaked fds, a high ulimit -n, or a busy daemon — its own socket/pipe fds get allocated at or above 1024. FD_SET()/FD_ISSET() then index past the end of the fixed-size fd_set, which is undefined behavior: select() reports the fd ready, but FD_ISSET() reads the out-of-bounds bit as 0, so the read/write never happens and rsync spins at 100% CPU forever with no progress.

This is the long-standing "rsync hangs at 100% CPU" report, and it explains the MemorySanitizer use-of-uninitialized-value in perform_io in the issue thread (the OOB FD_ISSET read) and Wayne's own hypothesis there ("maybe iobuf.in is greater than the default FD_SETSIZE"). The -vvv correlation was a red herring — the message-backlog deadlock is separately mitigated by the dynamic iobuf.msg growth.

Deterministic reproduction

Pre-open ~1100 inheritable fds so rsync's descriptors land above FD_SETSIZE, then run any transfer:

< ~1015 fds: fine.
≥ ~1015 fds (socket fd crosses 1024): hangs. strace shows pselect6(1106, [1105], …) = 1 returning "ready" instantly in a tight loop; one process pegged at 100% CPU in state R; 0 files transferred.

Fix

Convert the three loops to poll(), which identifies descriptors by value in a small array and has no FD_SETSIZE ceiling, so a high-numbered fd works fine.

Not slower here. rsync only ever waits on a handful of fds (at most three in perform_io: in_fd, out_fd, the files-from forward fd). The "poll is slower than select" effect only appears when watching thousands of descriptors; at N≤3 poll() is equal-to or faster than select() (it skips the FD_ZERO bitmap clears). epoll/kqueue would be more code, unportable, and actually slower for such a tiny, changing fd set.
Behavior-preserving. The same max_fd bookkeeping decides when there's nothing to wait on; each FD_ISSET maps to the matching pollfd.revents; the timeout is unchanged (now in milliseconds). The lone remaining select(0, …) is a pure timed sleep with no fds and is left as-is.

Verified: with the fix, transfers complete correctly with 1100 and 5000 pre-opened fds (and the data matches); a normal low-fd transfer is unchanged.

Test

testsuite/highfd-hang_test.py opens enough inheritable dummy fds to push rsync's descriptors past FD_SETSIZE, then runs an ordinary transfer with close_fds=False. The hang is an infinite spin, so the instant-pass / never-finish cross-over is binary rather than a timing race. It hangs (caught by a timeout) on the select() code and passes instantly with poll(), and also verifies the transferred files are correct. It skips cleanly if RLIMIT_NOFILE can't be raised above FD_SETSIZE.

Full suite: 107 passed, 6 skipped, 0 failed.

…syncProject#231) rsync's I/O loops (safe_read, safe_write, and the main perform_io multiplexer) waited for readiness with select() and fd_set bitmaps. An fd_set can only represent descriptors below FD_SETSIZE (1024 with glibc). When rsync is started with many descriptors already open -- e.g. inherited from a parent process that leaked fds, a high "ulimit -n", or a busy daemon -- its own socket and pipe fds get allocated at or above 1024. FD_SET() and FD_ISSET() then index past the end of the fixed-size fd_set, which is undefined behavior: select() reports the fd as ready, but FD_ISSET() reads the out-of-bounds bit as 0, so the read or write never happens and rsync spins at 100% CPU forever with no progress. This is the long-standing "rsync hangs at 100% CPU on large systems" report, and it matches the MemorySanitizer use-of-uninitialized-value seen in perform_io. Convert the three loops to poll(), which identifies descriptors by value in a small array and has no FD_SETSIZE ceiling, so a high-numbered fd works fine. rsync only ever waits on a handful of fds (at most three in perform_io: in_fd, out_fd, and the files-from forward fd), so poll() is as fast as -- or faster than -- select() here; the select()-vs-poll() cost gap only appears when watching thousands of descriptors, which rsync never does. The remaining select(0, ...) call is a pure timed sleep with no fds and is unaffected. The conversion is behavior-preserving: the same max_fd bookkeeping decides when there is nothing to wait on, the per-fd readiness checks map to the matching pollfd revents, and the timeout is the same (now expressed in milliseconds). testsuite/highfd-hang_test.py reproduces the hang deterministically by opening enough inheritable dummy fds to push rsync's descriptors past FD_SETSIZE before an ordinary transfer; it hangs (caught by a timeout) on the select() code and passes instantly with poll().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

io: use poll() instead of select() to avoid an FD_SETSIZE hang (fixes #231)#1018

io: use poll() instead of select() to avoid an FD_SETSIZE hang (fixes #231)#1018
dr-who wants to merge 1 commit into
RsyncProject:masterfrom
dr-who:fix-231-poll-highfd

dr-who commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

dr-who commented Jun 30, 2026

Problem (#231)

Deterministic reproduction

Fix

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant