sockets_mgm: fix race in receive_fd causing infinite loop on reload #3820
+24
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #3789 —
sockets_mgmreload causes all OpenSIPS worker processes to spin at 100% CPU in an infinite loop insidepush_sock2list()→sock_listadd().Root Cause
During
sockets_reload, every process receives an IPC RPC to runrpc_socket_reload_proc(). Worker (non-dynamic) processes close their copy of each dynamic socket, then callreceive_fd()on the sharedsock_mgm_unix[0]socketpair to obtain a fresh fd from the mgm process.Since
sock_mgm_unix[0]is a singleSOCK_STREAMsocket shared across all worker processes, concurrentreceive_fd()calls race: worker A can consume the fd+metadata response intended for worker B. When this happens, worker B receives the wrongstruct sock_mgm *pointer — one that references a socket already present in worker B's listener list. Thesock_listadd()macro then corrupts the linked list into a circular loop (si->next == si), causingpush_sock2list()to spin indefinitely.GDB confirmed:
si->nextpointed to itself, andprotos[1].listenersformed an infinite cycle through the same node.Fix
Add a
sock_mgm_reload_lockthat serializes the entire send-IPC-to-mgm + receive-fd sequence across worker processes, ensuring only one worker at a time performs the fd-passing handshake on the shared socketpair.Dynamic (mgm) processes are excluded from this lock because they create sockets directly via
sock_mgm_add_listener()and never callreceive_fd(). Including them would deadlock: the worker holding the lock blocks onreceive_fd()waiting for the mgm process to handlerpc_sockets_send(), but the mgm process would be spinning on the same lock.Changes
sockets_mgm.c: Addsock_mgm_reload_lock(shared memory lock), allocated and initialized inmod_init(). Inrpc_socket_reload_proc(), non-dynamic processes acquire the lock before the reload sequence and release it after allreceive_fd()calls complete.Test plan
sockets_reload(add 3rd socket) — MI returns OK, 24 socket entries, no CPU spinningsockets_reload(add 4th socket) — MI returns OK, 32 socket entries, no CPU spinningpsandtop