Background
I am using a Sitemap crawler with the Redis storage client to manage the request queue.
Bug
During a long running job, the pending_request_count went negative, preventing the crawler from terminating.
As you can see from my Redis state:
LLEN request_queues:<name>:queue = 0
HLEN request_queues:<name>:in_progress = 0
SCARD request_queues:<name>:pending_set = 0
JSON.GET ...:metadata $.pending_request_count = -6
JSON.GET ...:metadata $.total_request_count = 89582
JSON.GET ...:metadata $.handled_request_count = 89588 # > total
The pending_request_count = -6, this causes the RedisRequestQueueClient.is_empty() to return False:
return metadata.pending_request_count == 0
This results in crawlee continuing to endlessly query redis every 1-3ms (across each connection).
Possible cause
mark_request_as_handled guards on hexists(in_progress, unique_key) in a separate round-trip from the pipeline that does delta_handled_request_count=+1, delta_pending_request_count=-1. If two coroutines race on the same unique_key (or a retry/error path calls it twice), both hexists checks can pass before either pipeline executes, and the counters get double-applied.
Background
I am using a Sitemap crawler with the Redis storage client to manage the request queue.
Bug
During a long running job, the
pending_request_countwent negative, preventing the crawler from terminating.As you can see from my Redis state:
The
pending_request_count = -6, this causes theRedisRequestQueueClient.is_empty()to returnFalse:This results in crawlee continuing to endlessly query redis every 1-3ms (across each connection).
Possible cause
mark_request_as_handledguards onhexists(in_progress, unique_key)in a separate round-trip from the pipeline that doesdelta_handled_request_count=+1, delta_pending_request_count=-1. If two coroutines race on the sameunique_key(or a retry/error path calls it twice), bothhexistschecks can pass before either pipeline executes, and the counters get double-applied.