Skip to content

Conversation

@NormB
Copy link
Member

@NormB NormB commented Feb 11, 2026

Summary

Fixes #3789sockets_mgm reload causes all OpenSIPS worker processes to spin at 100% CPU in an infinite loop inside push_sock2list()sock_listadd().

Root Cause

During sockets_reload, every process receives an IPC RPC to run rpc_socket_reload_proc(). Worker (non-dynamic) processes close their copy of each dynamic socket, then call receive_fd() on the shared sock_mgm_unix[0] socketpair to obtain a fresh fd from the mgm process.

Since sock_mgm_unix[0] is a single SOCK_STREAM socket shared across all worker processes, concurrent receive_fd() calls race: worker A can consume the fd+metadata response intended for worker B. When this happens, worker B receives the wrong struct sock_mgm * pointer — one that references a socket already present in worker B's listener list. The sock_listadd() macro then corrupts the linked list into a circular loop (si->next == si), causing push_sock2list() to spin indefinitely.

GDB confirmed: si->next pointed to itself, and protos[1].listeners formed an infinite cycle through the same node.

Fix

Add a sock_mgm_reload_lock that serializes the entire send-IPC-to-mgm + receive-fd sequence across worker processes, ensuring only one worker at a time performs the fd-passing handshake on the shared socketpair.

Dynamic (mgm) processes are excluded from this lock because they create sockets directly via sock_mgm_add_listener() and never call receive_fd(). Including them would deadlock: the worker holding the lock blocks on receive_fd() waiting for the mgm process to handle rpc_sockets_send(), but the mgm process would be spinning on the same lock.

Changes

  • sockets_mgm.c: Add sock_mgm_reload_lock (shared memory lock), allocated and initialized in mod_init(). In rpc_socket_reload_proc(), non-dynamic processes acquire the lock before the reload sequence and release it after all receive_fd() calls complete.

Test plan

  • Start OpenSIPS with 2 dynamic UDP sockets from DB — 22 processes, 16 dynamic socket entries, all sleeping
  • 1st sockets_reload (add 3rd socket) — MI returns OK, 24 socket entries, no CPU spinning
  • 2nd sockets_reload (add 4th socket) — MI returns OK, 32 socket entries, no CPU spinning
  • All processes remain in S (sleeping) state throughout — verified with ps and top
  • Clean shutdown after testing

During sockets_reload, all processes receive an IPC RPC to run
rpc_socket_reload_proc(). Non-dynamic (worker) processes close their
copy of each dynamic socket and then call receive_fd() on the shared
sock_mgm_unix[0] socketpair to get a fresh fd from the mgm process.

Because sock_mgm_unix[0] is shared across all workers and SOCK_STREAM
delivers bytes in order (not per-message), concurrent receive_fd()
calls race: worker A can consume the fd response intended for worker B.
When this happens, worker B receives worker A's fd response, which
references a socket already in worker B's listener list.  The
sock_listadd() macro then corrupts the linked list into a circular
loop (si->next == si), causing push_sock2list() to spin at 100% CPU
indefinitely.

Add a sock_mgm_reload_lock that serializes the entire
send-IPC-to-mgm + receive-fd sequence for worker processes.  Dynamic
(mgm) processes are excluded from this lock because they create
sockets directly via sock_mgm_add_listener() and never call
receive_fd(); including them would deadlock since the worker holding
the lock blocks on receive_fd() waiting for the mgm to process
rpc_sockets_send().

Fixes: OpenSIPS#3789
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] sockets_mgm module causes high CPU usage and "freezes" OpenSIPS

1 participant