cachedb_redis: add dynamic cluster topology management and observability by NormB · Pull Request #3855 · OpenSIPS/opensips

NormB · 2026-03-30T05:04:06Z

Summary

Replace static cluster topology with runtime discovery and automatic refresh
Add MI commands for cluster observability (redis_cluster_info, redis_cluster_refresh, redis_ping_nodes)
Add per-node statistics counters and module-level stat variables
Add hash tag {...} extraction and fix hash slot calculation
Add TCP keepalive for connection health detection

Details

The cachedb_redis module currently builds its cluster topology once during child_init via CLUSTER NODES and never updates it. Adding or removing a Redis node requires restarting OpenSIPS. This PR replaces that static model with full runtime topology management.

Topology discovery and refresh

At startup, the module probes the cluster using CLUSTER SHARDS (Redis 7+) with automatic fallback to CLUSTER SLOTS (Redis 3+). The topology is stored in an O(1) slot_table[16384] array mapping each slot directly to its owning master node.

The topology refreshes automatically when:

A MOVED redirection is received (permanent slot migration)
A connection failure occurs and the node cannot be reconnected
A query targets a slot with no known owner
An operator triggers redis_cluster_refresh via MI

Refreshes are rate-limited to once per second. The MI command bypasses the rate limit. During refresh, nodes not present in the new topology are pruned and their connections freed. New nodes are created and connected on demand.

The redirect retry loop is capped at 5 total redirects to prevent a worker from hanging on a pathological cluster state.

Hash slot correctness

The existing redisHash() uses crc16(key) & con->slots_assigned which only produces correct results when slots_assigned is 2^n - 1. This is replaced with crc16(key) % 16384 per the Redis Cluster specification.

Hash tag extraction is added: if a key contains {substring}, only the substring between the first { and the next } is hashed, enabling key co-location (e.g., {user:1000}.session and {user:1000}.profile land on the same slot).

MI commands

Command	Description
`redis_cluster_info`	Topology dump: nodes, slots, per-node counters (queries, errors, moved), connection status, last activity. Optional `group` filter.
`redis_cluster_refresh`	Trigger topology refresh. Optional `group` filter.
`redis_ping_nodes`	PING each node with microsecond latency reporting. Optional `group` filter.

Statistics

redis_queries, redis_queries_failed, redis_moved, redis_topology_refreshes

New parameter

Parameter	Type	Default	Description
`redis_keepalive`	integer	10	TCP keepalive interval in seconds (0 to disable)

Testing

170 tests across 7 suites:

Suite	Tests	Description
test_hash (C)	37	Hash slot calculation, hash tags, edge cases
test_mi_counters (C)	34	MI counter helpers, slot counting, struct layout
test_topology_startup	12	Slot coverage, multi-node routing, random keys
test_topology_refresh	13	Dynamic topology updates after slot migration
test_mi_commands	49	All MI commands, statistics, health checks
test_edge_cases	10	Node add/remove, outage recovery
test_load	6	Memory leak detection under topology changes

All integration tests include trap EXIT cleanup handlers to restore cluster state on unexpected exit.

Compatibility

Plain keys (no {...}) produce identical slot values — no behavioral change for existing deployments
Keys with hash tags will route to different (correct per Redis spec) slots
CLUSTER NODES parsing is replaced by CLUSTER SHARDS/CLUSTER SLOTS
The slots_assigned field is removed from redis_con

Dependencies

Depends on cachedb_redis: add Redis cluster hash tag support to redisHash #3815 (hash tag extraction and modulo fix)
Depends on cachedb_redis: add ASK redirect handling for cluster resharding #3852 (ASK redirect handling)
Depends on cachedb_redis: fix safety issues in cluster redirect parsing #3854 (safety fixes)

This PR includes hash tag support for standalone functionality. If #3815 merges first, the overlap resolves cleanly on rebase.

Closing issues

Partially addresses #2811

Fix several correctness and safety issues in parse_moved_reply() and the MOVED redirect handler: - Add slot value overflow protection: return ERR_INVALID_SLOT when parsed slot exceeds 16383 during digit accumulation, preventing signed integer overflow on malformed MOVED replies. - Add port value overflow protection: return ERR_INVALID_PORT when parsed port exceeds 65535 during digit accumulation, complementing the existing post-loop range check and preventing signed integer overflow on malformed input. - Fix undefined behavior in the no-colon endpoint fallback path: replace comparison of potentially-NULL out->endpoint.s against end pointer with (p < end), which achieves the same logic using the scan position variable that is always valid. - Replace pkg_malloc heap allocation of redis_moved struct with stack allocation in the MOVED handler. The struct is small (~24 bytes) and never outlives the enclosing scope, making heap allocation unnecessary. This eliminates the OOM error path and two pkg_free() calls.

Replace the static cluster topology (built once at startup, never refreshed) with runtime discovery and automatic refresh: Topology discovery and refresh: - Probe CLUSTER SHARDS (Redis 7+) with fallback to CLUSTER SLOTS (Redis 3+) for backward compatibility - O(1) slot_table[16384] lookup replaces per-query linked-list scan - Automatic topology refresh on MOVED redirect, connection failure, or query targeting an unmapped slot (rate-limited to 1/sec) - Dynamic node creation when MOVED points to an unknown endpoint - Stale node pruning during refresh with safe connection cleanup - Cap redirect loop at 5 max redirects to prevent worker hang on pathological cluster state Cluster observability via MI commands: - redis_cluster_info: full topology dump including per-node connection status, slot assignments, query/error/moved/ask counters, and last activity timestamp - redis_cluster_refresh: trigger manual topology refresh (bypasses rate limit) - redis_ping_nodes: per-node PING with microsecond latency reporting - All MI commands support optional group filter parameter Statistics: - redis_queries, redis_queries_failed, redis_moved, redis_ask, redis_topology_refreshes (module-level stat counters) - Per-node query, error, moved, ask counters in redis_cluster_info Hash slot correctness: - Hash tag {…} extraction per Redis Cluster specification - CRC16 modulo 16384 replaces bitwise AND with slots_assigned ASK redirect handling: - Detect ASK responses alongside existing MOVED handling - Send ASKING command to target node before retrying original query - Do not update slot map (ASK is a temporary mid-migration redirect) - Refactor parse_moved_reply into parse_redirect_reply with prefix parameter; inline wrappers for backward compatibility Connection reliability: - TCP keepalive via redis_keepalive parameter (default 10s) - Stack allocation for redis_moved structs (eliminates OOM paths) - NULL guards on malformed CLUSTER SHARDS/SLOTS reply elements - Integer overflow protection in slot and port parsing - NULL guards in MI command handlers for group_name/initial_url Documentation: - New section: Redis Cluster Support (topology discovery, automatic refresh, MOVED/ASK handling, hash tags) - MI command reference: redis_cluster_info, redis_cluster_refresh, redis_ping_nodes - Authentication URL format documentation (classic, ACL, no-auth) - New parameter: redis_keepalive Test suite (186 tests): - C unit tests: hash slot calculation (37), MI counter helpers (41) - Integration: topology startup (12), ASK redirect (16), topology refresh (13), MI commands (50), edge cases (16) - Trap EXIT handlers for safe cluster state restoration - python3 preflight checks for JSON-dependent tests Depends on: OpenSIPS#3815 (hash tag + modulo fix), OpenSIPS#3852 (ASK redirect)

Debian added 2 commits March 30, 2026 04:30

This was referenced Mar 30, 2026

cachedb_redis: add Unix socket transport and lazy connection #3856

Open

cachedb_redis: add Redis cluster hash tag support to redisHash #3815

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cachedb_redis: add dynamic cluster topology management and observability#3855

cachedb_redis: add dynamic cluster topology management and observability#3855
NormB wants to merge 2 commits intoOpenSIPS:masterfrom
NormB:mr/feature-redis-cluster-management

NormB commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NormB commented Mar 30, 2026

Summary

Details

Topology discovery and refresh

Hash slot correctness

MI commands

Statistics

New parameter

Testing

Compatibility

Dependencies

Closing issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant