Environment
- python-socketio 5.16.1 (server,
async_mode='asgi'), redis-py 7.4.0
- Valkey 8 (alpine), used via
AsyncRedisManager (redis://…/1)
- uvicorn with 8 workers (one
AsyncServer + AsyncRedisManager per worker), Engine.IO v4, websocket-only transport
- Deployment: Open-WebUI v0.9.6 in production (~150 users)
Summary
After Redis force-closed all pub/sub subscriber connections for exceeding client-output-buffer-limit pubsub (a single ~33MB published message hit the 32mb hard limit), 7 of our 8 workers' listeners recovered via _redis_listen_with_retries, but one worker's listener died permanently: its process stayed alive and kept accepting websocket connections, but it never re-subscribed to the channel, so every client attached to that worker silently stopped receiving events until the service was restarted. PUBSUB NUMSUB socketio showed 7 subscribers with 8 workers for ~24h.
With the default logger=False, the death is completely silent — there is nothing in the logs to distinguish a healthy worker from a deaf one.
What Redis logged (kill storm)
1:M 10 Jun 2026 20:51:58 # Client id=543906 … flags=P … cmd=subscribe …
omem=33554456 tot-mem=33557200 … scheduled to be closed ASAP for overcoming of output buffer limits.
(3 rounds of kills within 10 minutes; each round disconnected every subscriber. The retry path logged Cannot receive from redis... retrying in 1 secs in bursts and recovered — except for one worker.)
Code analysis (5.16.1)
-
AsyncRedisManager._redis_listen_with_retries only catches the Redis client's error class (redis.exceptions.RedisError / ValkeyError). Any other exception escaping pubsub.subscribe() / pubsub.listen() (e.g. redis-py's asyncio PubSub can raise plain RuntimeError for connection-state issues) propagates out of the generator.
-
In AsyncPubSubManager._thread():
- the outer
except Exception logs to self.server.logger — invisible with the default logger=False;
- if the
_listen() generator ever exits instead of raising, the loop hits self.server.logger.error('pubsub listen() exited unexpectedly') followed by break — the background task ends permanently, with no recovery and (by default) no visible trace.
We could not capture the exact escaping exception precisely because logging was disabled, but the observable outcome was a permanently dead listener following the buffer-limit kill storm, while sibling workers recovered.
Suggested improvements
Any of these would have avoided the silent permanent failure:
- Broaden the retry in
_redis_listen_with_retries to also retry on connection-layer exceptions that are not subclasses of RedisError (or simply except Exception), since the loop already reconnects from scratch.
- In
_thread(), restart _listen() instead of break-ing permanently when the generator exits.
- Log listener death at
logging.getLogger('socketio') level regardless of the logger=False convenience flag, or expose a health indicator (e.g. a listening property / callback) so multi-worker deployments can monitor listener liveness — today the only reliable external check we found is comparing PUBSUB NUMSUB <channel> against the worker count.
Workaround we applied
Raised Valkey's client-output-buffer-limit pubsub (32mb→64mb hard) to stop the kills at the source, plus an external cron alert on PUBSUB NUMSUB socketio < workers.
Related
Environment
async_mode='asgi'), redis-py 7.4.0AsyncRedisManager(redis://…/1)AsyncServer+AsyncRedisManagerper worker), Engine.IO v4, websocket-only transportSummary
After Redis force-closed all pub/sub subscriber connections for exceeding
client-output-buffer-limit pubsub(a single ~33MB published message hit the 32mb hard limit), 7 of our 8 workers' listeners recovered via_redis_listen_with_retries, but one worker's listener died permanently: its process stayed alive and kept accepting websocket connections, but it never re-subscribed to the channel, so every client attached to that worker silently stopped receiving events until the service was restarted.PUBSUB NUMSUB socketioshowed 7 subscribers with 8 workers for ~24h.With the default
logger=False, the death is completely silent — there is nothing in the logs to distinguish a healthy worker from a deaf one.What Redis logged (kill storm)
(3 rounds of kills within 10 minutes; each round disconnected every subscriber. The retry path logged
Cannot receive from redis... retrying in 1 secsin bursts and recovered — except for one worker.)Code analysis (5.16.1)
AsyncRedisManager._redis_listen_with_retriesonly catches the Redis client's error class (redis.exceptions.RedisError/ValkeyError). Any other exception escapingpubsub.subscribe()/pubsub.listen()(e.g. redis-py's asyncioPubSubcan raise plainRuntimeErrorfor connection-state issues) propagates out of the generator.In
AsyncPubSubManager._thread():except Exceptionlogs toself.server.logger— invisible with the defaultlogger=False;_listen()generator ever exits instead of raising, the loop hitsself.server.logger.error('pubsub listen() exited unexpectedly')followed bybreak— the background task ends permanently, with no recovery and (by default) no visible trace.We could not capture the exact escaping exception precisely because logging was disabled, but the observable outcome was a permanently dead listener following the buffer-limit kill storm, while sibling workers recovered.
Suggested improvements
Any of these would have avoided the silent permanent failure:
_redis_listen_with_retriesto also retry on connection-layer exceptions that are not subclasses ofRedisError(or simplyexcept Exception), since the loop already reconnects from scratch._thread(), restart_listen()instead ofbreak-ing permanently when the generator exits.logging.getLogger('socketio')level regardless of thelogger=Falseconvenience flag, or expose a health indicator (e.g. alisteningproperty / callback) so multi-worker deployments can monitor listener liveness — today the only reliable external check we found is comparingPUBSUB NUMSUB <channel>against the worker count.Workaround we applied
Raised Valkey's
client-output-buffer-limit pubsub(32mb→64mb hard) to stop the kills at the source, plus an external cron alert onPUBSUB NUMSUB socketio < workers.Related