[grid] Add support Redis-backed for SessionQueue#17678
Conversation
Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
Code Review by Qodo
Context used 1. addToQueue not exception-safe
|
PR Summary by QodoAdd Redis-backed NewSessionQueue and fix SessionQueue readiness reporting WalkthroughsDescription• Add Redis-backed NewSessionQueue for stateless, horizontally scalable queue replicas. • Add sessionqueue backend-url config/flag to point external queue implementations at Redis. • Fix /status and /readyz to reflect real readiness during backend outages and shutdown draining. Diagramgraph TD
LB{{"Load balancer / K8s"}} -->|"/readyz, /status"| SQ["SessionQueue server"] -->|"delegates"| IMPL["NewSessionQueue impl"]
DIST["Distributor"] -->|"poll/complete"| SQ
IMPL -->|"Redis ops"| RC["GridRedisClient"] --> REDIS[("Redis")]
subgraph Legend
direction LR
_ext{{"External"}} ~~~ _svc["Service"] ~~~ _db[("Database")]
end
High-Level AssessmentThe following are alternative approaches to this PR: 1. Use Redis Streams (consumer groups) for queue ordering/claiming
2. Use blocking list pop (BLPOP/BRPOPLPUSH) with a claim queue
3. Publish completion via Redis Pub/Sub instead of polling result keys
Recommendation: The PR’s approach (Redis list for ordering + per-request keys with TTL + winner-takes-all completion marker) is a pragmatic fit for the existing SessionQueue API and matches the established Redis-backed Grid pattern. If Redis load becomes an issue at scale, consider moving claim semantics to Streams or a blocking-pop claim queue, but the current design is reasonable to land now given its test coverage and minimal additional dependencies. File ChangesEnhancement (2)
Bug fix (2)
Tests (1)
Other (5)
|
| try (Span ignored = context.createSpan("sessionqueue.add_to_queue")) { | ||
| contexts.put(reqId, context); | ||
|
|
||
| Instant endTime = request.getEnqueued().plus(requestTimeout); | ||
| CountDownLatch latch = new CountDownLatch(1); | ||
| // Register the latch before the request becomes claimable so a completion can never be | ||
| // missed. | ||
| waiters.put(reqId, latch); | ||
| redis.setWithTtl(endTimeKey(reqId), String.valueOf(endTime.toEpochMilli()), keyTtlMillis); | ||
| redis.setWithTtl(requestKey(reqId), JSON.toJson(request), keyTtlMillis); | ||
| redis.rpush(QUEUE_KEY, reqId.toString()); | ||
|
|
||
| if (isTimedOut(Instant.now(), endTime)) { | ||
| failDueToTimeout(reqId); | ||
| } | ||
|
|
||
| Either<SessionNotCreatedException, CreateSessionResponse> result = | ||
| awaitResult(reqId, latch, endTime); | ||
|
|
||
| waiters.remove(reqId); | ||
| contexts.remove(reqId); | ||
| redis.lrem(QUEUE_KEY, reqId.toString()); | ||
| clearAll(reqId); | ||
|
|
There was a problem hiding this comment.
1. addtoqueue not exception-safe 📘 Rule violation ☼ Reliability
RedisBackedNewSessionQueue.addToQueue registers per-request state in in-memory registries (waiters, contexts) and performs multiple Redis side effects, but the cleanup of those registries/keys only happens on the normal path after awaitResult returns. If any Redis write or other runtime error occurs before that point, the method can leak per-request registrations and leave partial/stale Redis queue state behind, especially during transient Redis failures/outages.
Agent Prompt
## Issue description
`RedisBackedNewSessionQueue.addToQueue` performs multi-step side effects (registering request state in `contexts`/`waiters`, writing Redis keys, pushing a queue entry) but only cleans up that state on the happy path after `awaitResult` returns. If a Redis operation (or any runtime exception) occurs before the normal cleanup block, the method can exit without removing the in-memory tracking entries and without best-effort cleanup of partially-created Redis keys/list entries, leaking memory and leaving stale queue state.
## Issue Context
This is concurrent/resource-heavy, Redis-backed queue code expected to run in failure-prone environments (timeouts, connection resets, closed connections), so cleanup must be bounded and exception-safe across all terminal paths, including partial failures while enqueuing. The current implementation lacks a `try/finally` to guarantee removal from `waiters`/`contexts`, and does not ensure best-effort Redis cleanup when failures happen early.
## Fix Focus Areas
- java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[227-256]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
|
Code review by qodo was updated up to the latest commit 3f683ed |
🔗 Related Issues
💥 What does this PR do?
Adds a stateless, horizontally-scalable NewSessionQueue implementation backed by Redis, following the same pattern as the existing RedisBackedDistributor. Several SessionQueue replicas can now share one logical queue in Redis, so the queue tier can scale out behind a load balancer / Kubernetes Service instead of being a single in-memory instance.
Also fixes the SessionQueue server's health endpoints, which previously always reported "ready" — meaning a load balancer could not route traffic away from an unhealthy or draining instance.
🔧 Implementation Notes
🤖 AI assistance
💡 Additional Considerations
🔄 Types of changes