Skip to content

[grid] Add support Redis-backed for SessionQueue#17678

Open
VietND96 wants to merge 2 commits into
trunkfrom
sessionqueue-redis-backed
Open

[grid] Add support Redis-backed for SessionQueue#17678
VietND96 wants to merge 2 commits into
trunkfrom
sessionqueue-redis-backed

Conversation

@VietND96

Copy link
Copy Markdown
Member

🔗 Related Issues

💥 What does this PR do?

Adds a stateless, horizontally-scalable NewSessionQueue implementation backed by Redis, following the same pattern as the existing RedisBackedDistributor. Several SessionQueue replicas can now share one logical queue in Redis, so the queue tier can scale out behind a load balancer / Kubernetes Service instead of being a single in-memory instance.

Also fixes the SessionQueue server's health endpoints, which previously always reported "ready" — meaning a load balancer could not route traffic away from an unhealthy or draining instance.

🔧 Implementation Notes

🤖 AI assistance

  • No substantial AI assistance used
  • AI assisted (complete below)
    • Tool(s):
    • What was generated:
    • I reviewed all AI output and can explain the change

💡 Additional Considerations

🔄 Types of changes

  • Cleanup (formatting, renaming)
  • Bug fix (backwards compatible)
  • New feature (non-breaking change which adds functionality and tests!)
  • Breaking change (fix or feature that would cause existing functionality to change)

Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
@qodo-code-review

qodo-code-review Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (1) 📎 Requirement gaps (0)

Context used

Grey Divider


Action required

1. addToQueue not exception-safe 📘 Rule violation ☼ Reliability
Description
RedisBackedNewSessionQueue.addToQueue registers per-request state in in-memory registries
(waiters, contexts) and performs multiple Redis side effects, but the cleanup of those
registries/keys only happens on the normal path after awaitResult returns. If any Redis write or
other runtime error occurs before that point, the method can leak per-request registrations and
leave partial/stale Redis queue state behind, especially during transient Redis failures/outages.
Code

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[R233-256]

+    try (Span ignored = context.createSpan("sessionqueue.add_to_queue")) {
+      contexts.put(reqId, context);
+
+      Instant endTime = request.getEnqueued().plus(requestTimeout);
+      CountDownLatch latch = new CountDownLatch(1);
+      // Register the latch before the request becomes claimable so a completion can never be
+      // missed.
+      waiters.put(reqId, latch);
+      redis.setWithTtl(endTimeKey(reqId), String.valueOf(endTime.toEpochMilli()), keyTtlMillis);
+      redis.setWithTtl(requestKey(reqId), JSON.toJson(request), keyTtlMillis);
+      redis.rpush(QUEUE_KEY, reqId.toString());
+
+      if (isTimedOut(Instant.now(), endTime)) {
+        failDueToTimeout(reqId);
+      }
+
+      Either<SessionNotCreatedException, CreateSessionResponse> result =
+          awaitResult(reqId, latch, endTime);
+
+      waiters.remove(reqId);
+      contexts.remove(reqId);
+      redis.lrem(QUEUE_KEY, reqId.toString());
+      clearAll(reqId);
+
Evidence
Rule 16 calls for exception-safe cleanup in concurrent/resource-heavy code, and the cited
addToQueue flow shows per-request state being created (contexts.put, waiters.put) and Redis
keys being written before the later cleanup section that removes those entries (waiters.remove,
contexts.remove) and clears Redis state (clearAll). Because that cleanup is not protected by a
finally (it only runs after awaitResult completes), any exception thrown by Redis operations or
other runtime failures before the cleanup point can cause the method to exit early, leaving stale
in-memory registrations and partially-created Redis state.

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[227-256]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`RedisBackedNewSessionQueue.addToQueue` performs multi-step side effects (registering request state in `contexts`/`waiters`, writing Redis keys, pushing a queue entry) but only cleans up that state on the happy path after `awaitResult` returns. If a Redis operation (or any runtime exception) occurs before the normal cleanup block, the method can exit without removing the in-memory tracking entries and without best-effort cleanup of partially-created Redis keys/list entries, leaking memory and leaving stale queue state.

## Issue Context
This is concurrent/resource-heavy, Redis-backed queue code expected to run in failure-prone environments (timeouts, connection resets, closed connections), so cleanup must be bounded and exception-safe across all terminal paths, including partial failures while enqueuing. The current implementation lacks a `try/finally` to guarantee removal from `waiters`/`contexts`, and does not ensure best-effort Redis cleanup when failures happen early.

## Fix Focus Areas
- java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[227-256]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Latch loop can busy-spin 🐞 Bug ➹ Performance ⭐ New
Description
In RedisBackedNewSessionQueue.awaitResult, if the same-replica CountDownLatch is already counted
down but readResult(reqId) continues to return null (e.g., Redis keys were flushed/expired/cleared
unexpectedly), the loop loses all backoff because latch.await(...) returns immediately and the code
keeps re-reading Redis in a tight loop until endTime.
Code

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[R297-305]

+        // A countdown means a same-replica complete() fired, so read the freshly written result
+        // immediately. Otherwise we loop and poll Redis for a result a different replica may have
+        // written.
+        if (latch.await(Math.min(remaining, POLL_INTERVAL_MS), MILLISECONDS)) {
+          Either<SessionNotCreatedException, CreateSessionResponse> signalled = readResult(reqId);
+          if (signalled != null) {
+            return signalled;
+          }
+        }
Evidence
The only throttling inside the while (true) loop is latch.await(...); when it returns true but
readResult(reqId) is null, the loop continues without any other wait. Because the latch is
permanently counted down by complete(), subsequent iterations can repeatedly execute readResult
with no delay, producing a tight loop until the timeout check fires.

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[276-306]
java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[465-472]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`awaitResult` uses a `CountDownLatch` as a fast-path and falls back to polling Redis. After the latch has fired, `await(...)` returns immediately forever; if `readResult` keeps returning `null` (exceptional Redis inconsistency), the loop can become a CPU-burning busy spin until timeout.

## Issue Context
This is triggered only when the latch is signalled but the Redis result is not readable (e.g., result keys missing due to flush/eviction/TTL bug or partial writes). Even though this is not the normal path (since `complete()` writes the result before counting down), it’s still worth guarding because the current loop becomes unthrottled.

## Fix Focus Areas
- java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[276-306]

## Suggested change
Introduce an explicit sleep/backoff path when the latch is already at 0 but no result is visible, e.g.:
- compute `waitMs = Math.min(remaining, POLL_INTERVAL_MS)`
- if `latch.getCount() > 0`, `latch.await(waitMs, MILLISECONDS)`
- else `Thread.sleep(waitMs)` (or equivalent timed wait)
This preserves the fast-path while preventing a tight loop under inconsistent/missing-result scenarios.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. O(n) Redis queue scans 🐞 Bug ➹ Performance
Description
RedisBackedNewSessionQueue.getNextAvailable retrieves the entire Redis list via lrange(QUEUE_KEY, 0,
-1) on each poll and then iterates until batchSize is met. This scales poorly with large queues and
can create unnecessary Redis/network load for every distributor poll cycle.
Code

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[R406-417]

+    List<SessionRequest> availableRequests = new ArrayList<>();
+    for (String idStr : redis.lrange(QUEUE_KEY, 0, -1)) {
+      if (availableRequests.size() >= batchSize) {
+        break;
+      }
+
+      RequestId reqId = new RequestId(UUID.fromString(idStr));
+      String raw = redis.get(requestKey(reqId));
+      if (raw == null) {
+        // Request payload expired or was completed elsewhere — drop the stale id.
+        redis.lrem(QUEUE_KEY, idStr);
+        continue;
Evidence
Both getNextAvailable and the periodic timeout scanner iterate over `redis.lrange(QUEUE_KEY, 0,
-1)`, which necessarily loads the entire queue contents each time they run.

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[406-417]
java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[546-557]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`getNextAvailable` and `timeoutSessions` use `lrange(key, 0, -1)` which loads the full queue into memory and transfers it over the network every time. With large queues or frequent polling, this becomes a bottleneck.

### Issue Context
In the local in-memory queue, `limit(batchSize)` avoids scanning the full queue. In Redis, `lrange(0, -1)` always returns the full list, even if only a small prefix is needed.

### Fix Focus Areas
- java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[406-435]
- java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[546-557]

Suggested implementation direction:
- In `getNextAvailable`, only fetch a bounded range (e.g., `lrange(0, K)` with K > batchSize to account for non-matching entries), and loop with pagination if needed.
- For `timeoutSessions`, avoid scanning the queue list; consider tracking deadlines in a sorted set keyed by endTime so you can query only expired request IDs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Null OK payload NPE 🐞 Bug ≡ Correctness
Description
RedisBackedNewSessionQueue.readResult calls JSON.toType(payload, CreateSessionResponse.class) when
the stored status is "OK" even if Redis returns a null payload, which will throw a
NullPointerException. That exception is caught by awaitResult and converted into a generic
session-not-created failure, masking the real Redis inconsistency and potentially causing incorrect
session teardown decisions upstream.
Code

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[R314-325]

+  private Either<SessionNotCreatedException, CreateSessionResponse> readResult(RequestId reqId) {
+    String status = redis.get(resultStatusKey(reqId));
+    if (status == null) {
+      return null;
+    }
+    String payload = redis.get(resultPayloadKey(reqId));
+    if ("OK".equals(status)) {
+      return Either.right(JSON.toType(payload, CreateSessionResponse.class));
+    }
+    return Either.left(
+        new SessionNotCreatedException(payload == null ? "Session not created" : payload));
+  }
Evidence
readResult fetches payload via redis.get(...) and passes it directly to JSON.toType(...)
when status is OK, but Json.toType will throw if the input string is null (it constructs a
StringReader).

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[314-325]
java/src/org/openqa/selenium/json/Json.java[153-170]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`readResult` assumes that if the status key is present and equals `OK`, then the payload key is non-null. If the payload is missing (null), `JSON.toType(null, ...)` will NPE.

### Issue Context
`Json.toType(String, Type)` constructs a `StringReader` from the input string, which will throw if the input is null. In `awaitResult`, runtime exceptions are caught and turned into a generic `SessionNotCreatedException`, which obscures the true cause.

### Fix Focus Areas
- java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[314-325]
- java/src/org/openqa/selenium/json/Json.java[153-170]

Suggested implementation direction:
- If `status == "OK"` and `payload == null` (or empty), treat it as "not yet visible" and return `null` from `readResult` so the polling loop continues.
- Alternatively, return a left/ERR with a specific message like "Incomplete result in Redis" (but avoid NPE).
- Consider making the result publication atomic (e.g., `mset` + TTL or Lua) if you want to eliminate split-key visibility entirely.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 3f683ed

Results up to commit 06dce29


🐞 Bugs (2) 📘 Rule violations (1) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0)


Action required
1. addToQueue not exception-safe 📘 Rule violation ☼ Reliability
Description
RedisBackedNewSessionQueue.addToQueue registers per-request state in in-memory registries
(waiters, contexts) and performs multiple Redis side effects, but the cleanup of those
registries/keys only happens on the normal path after awaitResult returns. If any Redis write or
other runtime error occurs before that point, the method can leak per-request registrations and
leave partial/stale Redis queue state behind, especially during transient Redis failures/outages.
Code

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[R233-256]

+    try (Span ignored = context.createSpan("sessionqueue.add_to_queue")) {
+      contexts.put(reqId, context);
+
+      Instant endTime = request.getEnqueued().plus(requestTimeout);
+      CountDownLatch latch = new CountDownLatch(1);
+      // Register the latch before the request becomes claimable so a completion can never be
+      // missed.
+      waiters.put(reqId, latch);
+      redis.setWithTtl(endTimeKey(reqId), String.valueOf(endTime.toEpochMilli()), keyTtlMillis);
+      redis.setWithTtl(requestKey(reqId), JSON.toJson(request), keyTtlMillis);
+      redis.rpush(QUEUE_KEY, reqId.toString());
+
+      if (isTimedOut(Instant.now(), endTime)) {
+        failDueToTimeout(reqId);
+      }
+
+      Either<SessionNotCreatedException, CreateSessionResponse> result =
+          awaitResult(reqId, latch, endTime);
+
+      waiters.remove(reqId);
+      contexts.remove(reqId);
+      redis.lrem(QUEUE_KEY, reqId.toString());
+      clearAll(reqId);
+
Evidence
Rule 16 calls for exception-safe cleanup in concurrent/resource-heavy code, and the cited
addToQueue flow shows per-request state being created (contexts.put, waiters.put) and Redis
keys being written before the later cleanup section that removes those entries (waiters.remove,
contexts.remove) and clears Redis state (clearAll). Because that cleanup is not protected by a
finally (it only runs after awaitResult completes), any exception thrown by Redis operations or
other runtime failures before the cleanup point can cause the method to exit early, leaving stale
in-memory registrations and partially-created Redis state.

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[227-256]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`RedisBackedNewSessionQueue.addToQueue` performs multi-step side effects (registering request state in `contexts`/`waiters`, writing Redis keys, pushing a queue entry) but only cleans up that state on the happy path after `awaitResult` returns. If a Redis operation (or any runtime exception) occurs before the normal cleanup block, the method can exit without removing the in-memory tracking entries and without best-effort cleanup of partially-created Redis keys/list entries, leaking memory and leaving stale queue state.

## Issue Context
This is concurrent/resource-heavy, Redis-backed queue code expected to run in failure-prone environments (timeouts, connection resets, closed connections), so cleanup must be bounded and exception-safe across all terminal paths, including partial failures while enqueuing. The current implementation lacks a `try/finally` to guarantee removal from `waiters`/`contexts`, and does not ensure best-effort Redis cleanup when failures happen early.

## Fix Focus Areas
- java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[227-256]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended
2. Null OK payload NPE 🐞 Bug ≡ Correctness
Description
RedisBackedNewSessionQueue.readResult calls JSON.toType(payload, CreateSessionResponse.class) when
the stored status is "OK" even if Redis returns a null payload, which will throw a
NullPointerException. That exception is caught by awaitResult and converted into a generic
session-not-created failure, masking the real Redis inconsistency and potentially causing incorrect
session teardown decisions upstream.
Code

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[R314-325]

+  private Either<SessionNotCreatedException, CreateSessionResponse> readResult(RequestId reqId) {
+    String status = redis.get(resultStatusKey(reqId));
+    if (status == null) {
+      return null;
+    }
+    String payload = redis.get(resultPayloadKey(reqId));
+    if ("OK".equals(status)) {
+      return Either.right(JSON.toType(payload, CreateSessionResponse.class));
+    }
+    return Either.left(
+        new SessionNotCreatedException(payload == null ? "Session not created" : payload));
+  }
Evidence
readResult fetches payload via redis.get(...) and passes it directly to JSON.toType(...)
when status is OK, but Json.toType will throw if the input string is null (it constructs a
StringReader).

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[314-325]
java/src/org/openqa/selenium/json/Json.java[153-170]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`readResult` assumes that if the status key is present and equals `OK`, then the payload key is non-null. If the payload is missing (null), `JSON.toType(null, ...)` will NPE.

### Issue Context
`Json.toType(String, Type)` constructs a `StringReader` from the input string, which will throw if the input is null. In `awaitResult`, runtime exceptions are caught and turned into a generic `SessionNotCreatedException`, which obscures the true cause.

### Fix Focus Areas
- java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[314-325]
- java/src/org/openqa/selenium/json/Json.java[153-170]

Suggested implementation direction:
- If `status == "OK"` and `payload == null` (or empty), treat it as "not yet visible" and return `null` from `readResult` so the polling loop continues.
- Alternatively, return a left/ERR with a specific message like "Incomplete result in Redis" (but avoid NPE).
- Consider making the result publication atomic (e.g., `mset` + TTL or Lua) if you want to eliminate split-key visibility entirely.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. O(n) Redis queue scans 🐞 Bug ➹ Performance
Description
RedisBackedNewSessionQueue.getNextAvailable retrieves the entire Redis list via lrange(QUEUE_KEY, 0,
-1) on each poll and then iterates until batchSize is met. This scales poorly with large queues and
can create unnecessary Redis/network load for every distributor poll cycle.
Code

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[R406-417]

+    List<SessionRequest> availableRequests = new ArrayList<>();
+    for (String idStr : redis.lrange(QUEUE_KEY, 0, -1)) {
+      if (availableRequests.size() >= batchSize) {
+        break;
+      }
+
+      RequestId reqId = new RequestId(UUID.fromString(idStr));
+      String raw = redis.get(requestKey(reqId));
+      if (raw == null) {
+        // Request payload expired or was completed elsewhere — drop the stale id.
+        redis.lrem(QUEUE_KEY, idStr);
+        continue;
Evidence
Both getNextAvailable and the periodic timeout scanner iterate over `redis.lrange(QUEUE_KEY, 0,
-1)`, which necessarily loads the entire queue contents each time they run.

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[406-417]
java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[546-557]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`getNextAvailable` and `timeoutSessions` use `lrange(key, 0, -1)` which loads the full queue into memory and transfers it over the network every time. With large queues or frequent polling, this becomes a bottleneck.

### Issue Context
In the local in-memory queue, `limit(batchSize)` avoids scanning the full queue. In Redis, `lrange(0, -1)` always returns the full list, even if only a small prefix is needed.

### Fix Focus Areas
- java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[406-435]
- java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[546-557]

Suggested implementation direction:
- In `getNextAvailable`, only fetch a bounded range (e.g., `lrange(0, K)` with K > batchSize to account for non-matching entries), and loop with pagination if needed.
- For `timeoutSessions`, avoid scanning the queue list; consider tracking deadlines in a sorted set keyed by endTime so you can query only expired request IDs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

@selenium-ci selenium-ci added B-grid Everything grid and server related C-java Java Bindings B-build Includes scripting, bazel and CI integrations labels Jun 12, 2026
@qodo-code-review

Copy link
Copy Markdown
Contributor

PR Summary by Qodo

Add Redis-backed NewSessionQueue and fix SessionQueue readiness reporting
✨ Enhancement 🐞 Bug fix 🧪 Tests ⚙️ Configuration changes 🕐 40+ Minutes

Grey Divider

Walkthroughs

Description
• Add Redis-backed NewSessionQueue for stateless, horizontally scalable queue replicas.
• Add sessionqueue backend-url config/flag to point external queue implementations at Redis.
• Fix /status and /readyz to reflect real readiness during backend outages and shutdown draining.
Diagram
graph TD
  LB{{"Load balancer / K8s"}} -->|"/readyz, /status"| SQ["SessionQueue server"] -->|"delegates"| IMPL["NewSessionQueue impl"]
  DIST["Distributor"] -->|"poll/complete"| SQ
  IMPL -->|"Redis ops"| RC["GridRedisClient"] --> REDIS[("Redis")]
  subgraph Legend
    direction LR
    _ext{{"External"}} ~~~ _svc["Service"] ~~~ _db[("Database")]
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Use Redis Streams (consumer groups) for queue ordering/claiming
  • ➕ Built-in claim/ack semantics and clearer ownership model
  • ➕ Better support for multiple consumers without list scans
  • ➕ Easier to implement visibility timeouts and re-delivery semantics
  • ➖ More complex implementation/migration than list-based queue
  • ➖ May require additional operational tuning (stream trimming, consumer group management)
2. Use blocking list pop (BLPOP/BRPOPLPUSH) with a claim queue
  • ➕ Avoids polling and reduces Redis load under low traffic
  • ➕ Natural work-queue pattern with an explicit in-progress list
  • ➖ Harder to implement stereotype matching without pre-sharding queues
  • ➖ Requires requeue/repair logic for crashed consumers and timed-out requests
3. Publish completion via Redis Pub/Sub instead of polling result keys
  • ➕ Eliminates result polling latency and load
  • ➕ Keeps the current storage model while improving responsiveness
  • ➖ Pub/Sub messages are ephemeral; requires fallback polling anyway
  • ➖ More moving parts (subscriptions, reconnect behavior)

Recommendation: The PR’s approach (Redis list for ordering + per-request keys with TTL + winner-takes-all completion marker) is a pragmatic fit for the existing SessionQueue API and matches the established Redis-backed Grid pattern. If Redis load becomes an issue at scale, consider moving claim semantics to Streams or a blocking-pop claim queue, but the current design is reasonable to land now given its test coverage and minimal additional dependencies.

Grey Divider

File Changes

Enhancement (2)
RedisBackedNewSessionQueue.java Introduce stateless Redis-backed NewSessionQueue implementation +597/-0

Introduce stateless Redis-backed NewSessionQueue implementation

• Implements a horizontally scalable NewSessionQueue storing request ordering, payloads, deadlines, cancellation, and results in Redis with TTL-based cleanup. Supports cross-replica completion via result keys plus same-replica latch fast-path, and uses a SET NX completion marker to resolve timeout-vs-success races deterministically.

java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java


GridRedisClient.java Add Redis list primitives needed by SessionQueue +29/-0

Add Redis list primitives needed by SessionQueue

• Adds rpush/lpush/lrange/lrem/llen helpers to support list-based queue operations and atomic claim-by-removal semantics.

java/src/org/openqa/selenium/redis/GridRedisClient.java


Bug fix (2)
NewSessionQueueServer.java Fix /status and /readyz to reflect real queue readiness +40/-14

Fix /status and /readyz to reflect real queue readiness

• Wires /status and /readyz responses to NewSessionQueue.isReady(), returning 200/503 for /readyz and accurate status messaging for /status so load balancers can drain unhealthy instances.

java/src/org/openqa/selenium/grid/sessionqueue/httpd/NewSessionQueueServer.java


LocalNewSessionQueue.java Report not-ready during shutdown draining +7/-1

Report not-ready during shutdown draining

• Adds a shutdown flag so LocalNewSessionQueue.isReady() flips to false during close(), preventing traffic from being routed to a draining instance.

java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java


Tests (1)
RedisBackedNewSessionQueueTest.java Add integration tests for Redis-backed queue behavior and races +388/-0

Add integration tests for Redis-backed queue behavior and races

• Uses a Redis Testcontainer to validate readiness behavior, cross-replica completion, matching/claim semantics, timeouts, clearQueue behavior, and backend-url configuration validation.

java/test/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueueTest.java


Other (5)
ci-grid.yml Run Redis-backed SessionQueue tests in CI +1/-0

Run Redis-backed SessionQueue tests in CI

• Adds the sessionqueue/redis test target to the grid CI workflow so the new Redis-backed queue is exercised in automation.

.github/workflows/ci-grid.yml


NewSessionQueueFlags.java Add backend-url flag/config for SessionQueue backends +12/-0

Add backend-url flag/config for SessionQueue backends

• Introduces --sessionqueue-backend-url and maps it to [sessionqueue] backend-url for external datastore-backed queue implementations (e.g., Redis).

java/src/org/openqa/selenium/grid/sessionqueue/config/NewSessionQueueFlags.java


NewSessionQueueOptions.java Parse SessionQueue backend-url as a validated URI +13/-0

Parse SessionQueue backend-url as a validated URI

• Adds getBackendUri() to retrieve and validate [sessionqueue] backend-url, throwing a ConfigException for invalid URIs.

java/src/org/openqa/selenium/grid/sessionqueue/config/NewSessionQueueOptions.java


BUILD.bazel Add Bazel target for Redis-backed SessionQueue implementation +27/-0

Add Bazel target for Redis-backed SessionQueue implementation

• Defines a new java_library for the Redis-backed SessionQueue package with dependencies on GridRedisClient, sessionqueue APIs, and tracing/logging utilities.

java/src/org/openqa/selenium/grid/sessionqueue/redis/BUILD.bazel


BUILD.bazel Add Bazel test suite for Redis-backed SessionQueue +33/-0

Add Bazel test suite for Redis-backed SessionQueue

• Creates a medium-sized JUnit5 test suite target for the Redis SessionQueue tests, including Testcontainers dependencies.

java/test/org/openqa/selenium/grid/sessionqueue/redis/BUILD.bazel


Grey Divider

Qodo Logo

Comment on lines +233 to +256
try (Span ignored = context.createSpan("sessionqueue.add_to_queue")) {
contexts.put(reqId, context);

Instant endTime = request.getEnqueued().plus(requestTimeout);
CountDownLatch latch = new CountDownLatch(1);
// Register the latch before the request becomes claimable so a completion can never be
// missed.
waiters.put(reqId, latch);
redis.setWithTtl(endTimeKey(reqId), String.valueOf(endTime.toEpochMilli()), keyTtlMillis);
redis.setWithTtl(requestKey(reqId), JSON.toJson(request), keyTtlMillis);
redis.rpush(QUEUE_KEY, reqId.toString());

if (isTimedOut(Instant.now(), endTime)) {
failDueToTimeout(reqId);
}

Either<SessionNotCreatedException, CreateSessionResponse> result =
awaitResult(reqId, latch, endTime);

waiters.remove(reqId);
contexts.remove(reqId);
redis.lrem(QUEUE_KEY, reqId.toString());
clearAll(reqId);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. addtoqueue not exception-safe 📘 Rule violation ☼ Reliability

RedisBackedNewSessionQueue.addToQueue registers per-request state in in-memory registries
(waiters, contexts) and performs multiple Redis side effects, but the cleanup of those
registries/keys only happens on the normal path after awaitResult returns. If any Redis write or
other runtime error occurs before that point, the method can leak per-request registrations and
leave partial/stale Redis queue state behind, especially during transient Redis failures/outages.
Agent Prompt
## Issue description
`RedisBackedNewSessionQueue.addToQueue` performs multi-step side effects (registering request state in `contexts`/`waiters`, writing Redis keys, pushing a queue entry) but only cleans up that state on the happy path after `awaitResult` returns. If a Redis operation (or any runtime exception) occurs before the normal cleanup block, the method can exit without removing the in-memory tracking entries and without best-effort cleanup of partially-created Redis keys/list entries, leaking memory and leaving stale queue state.

## Issue Context
This is concurrent/resource-heavy, Redis-backed queue code expected to run in failure-prone environments (timeouts, connection resets, closed connections), so cleanup must be bounded and exception-safe across all terminal paths, including partial failures while enqueuing. The current implementation lacks a `try/finally` to guarantee removal from `waiters`/`contexts`, and does not ensure best-effort Redis cleanup when failures happen early.

## Fix Focus Areas
- java/src/org/openqa/selenium/grid/sessionqueue/redis/RedisBackedNewSessionQueue.java[227-256]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
@qodo-code-review

qodo-code-review Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Code review by qodo was updated up to the latest commit 3f683ed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B-build Includes scripting, bazel and CI integrations B-grid Everything grid and server related C-java Java Bindings

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants