Skip to content

AsyncRedisManager listener can die permanently (and silently with default logger) after Redis closes pubsub clients for exceeding client-output-buffer-limit #1581

@lhgarciadev

Description

@lhgarciadev

Environment

  • python-socketio 5.16.1 (server, async_mode='asgi'), redis-py 7.4.0
  • Valkey 8 (alpine), used via AsyncRedisManager (redis://…/1)
  • uvicorn with 8 workers (one AsyncServer + AsyncRedisManager per worker), Engine.IO v4, websocket-only transport
  • Deployment: Open-WebUI v0.9.6 in production (~150 users)

Summary

After Redis force-closed all pub/sub subscriber connections for exceeding client-output-buffer-limit pubsub (a single ~33MB published message hit the 32mb hard limit), 7 of our 8 workers' listeners recovered via _redis_listen_with_retries, but one worker's listener died permanently: its process stayed alive and kept accepting websocket connections, but it never re-subscribed to the channel, so every client attached to that worker silently stopped receiving events until the service was restarted. PUBSUB NUMSUB socketio showed 7 subscribers with 8 workers for ~24h.

With the default logger=False, the death is completely silent — there is nothing in the logs to distinguish a healthy worker from a deaf one.

What Redis logged (kill storm)

1:M 10 Jun 2026 20:51:58 # Client id=543906 … flags=P … cmd=subscribe …
  omem=33554456 tot-mem=33557200 … scheduled to be closed ASAP for overcoming of output buffer limits.

(3 rounds of kills within 10 minutes; each round disconnected every subscriber. The retry path logged Cannot receive from redis... retrying in 1 secs in bursts and recovered — except for one worker.)

Code analysis (5.16.1)

  1. AsyncRedisManager._redis_listen_with_retries only catches the Redis client's error class (redis.exceptions.RedisError / ValkeyError). Any other exception escaping pubsub.subscribe() / pubsub.listen() (e.g. redis-py's asyncio PubSub can raise plain RuntimeError for connection-state issues) propagates out of the generator.

  2. In AsyncPubSubManager._thread():

    • the outer except Exception logs to self.server.logger — invisible with the default logger=False;
    • if the _listen() generator ever exits instead of raising, the loop hits self.server.logger.error('pubsub listen() exited unexpectedly') followed by break — the background task ends permanently, with no recovery and (by default) no visible trace.

We could not capture the exact escaping exception precisely because logging was disabled, but the observable outcome was a permanently dead listener following the buffer-limit kill storm, while sibling workers recovered.

Suggested improvements

Any of these would have avoided the silent permanent failure:

  1. Broaden the retry in _redis_listen_with_retries to also retry on connection-layer exceptions that are not subclasses of RedisError (or simply except Exception), since the loop already reconnects from scratch.
  2. In _thread(), restart _listen() instead of break-ing permanently when the generator exits.
  3. Log listener death at logging.getLogger('socketio') level regardless of the logger=False convenience flag, or expose a health indicator (e.g. a listening property / callback) so multi-worker deployments can monitor listener liveness — today the only reliable external check we found is comparing PUBSUB NUMSUB <channel> against the worker count.

Workaround we applied

Raised Valkey's client-output-buffer-limit pubsub (32mb→64mb hard) to stop the kills at the source, plus an external cron alert on PUBSUB NUMSUB socketio < workers.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions