#784 - Support Retry-After in FetcherBolt#1944
Conversation
Honour the Retry-After HTTP response header by delaying the next fetch from the affected internal queue. The header value is parsed as either a number of seconds or an HTTP date. A new fetcher.max.retry.after config caps the honoured delay (-1, the default, honours it as-is).
|
thanks @rzo1 |
Contains SOLR changes only, so most likely gone. -- I agree, that purging the queue for the given host might be beneficial. Let me later look into that in more depth again. |
was actually in branch 990, sorry |
|
Agreed, the in-queue hold is the fragile part. A long For this PR, I would take the purge route. On a I'd treat the host stream / host-aware spout design (#867, your branch 990) as the proper long-term home for this, happy to follow up there once we settle the short-term behaviour. Does that split sound right to you? |
|
yes, the split makes sense
Could set an explicit |
Yes this was my reasoning too but wanted to get some confirmation on it :) |
| // We remove the entry and put it at the end of the map | ||
| i.remove(); | ||
|
|
||
| // reap empty queues |
There was a problem hiding this comment.
If the next fetch time is in the future, the queue cannot be released for now. Otherwise the retry-after is not honored for new fetch items of the same site arriving through the topology.
See Nutch's FetchItemQueues code to release queues in combination with the exponential back-off.
or should we set an explicit future
nextFetchDateso the scheduler honors the exact back-off?
This would only help if there are not too many items from the same site.
I'd treat the host stream / host-aware spout design (#867, your branch 990) as the proper long-term home for this,
Definitely. Avoiding back-pressure is important. And in a broad crawl the number of queues, which need to be kept to ensure the retry-after delay, can grow large.
Honour the Retry-After HTTP response header by delaying the next fetch from the affected internal queue. The header value is parsed as either a number of seconds or an HTTP date. A new fetcher.max.retry.after config caps the honoured delay (-1, the default, honours it as-is).
For all changes
Is there a issue associated with this PR? Is it referenced in the commit message?
Does your PR title start with
#XXXXwhereXXXXis the issue number you are trying to resolve?Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
Is the code properly formatted with
mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?For code changes
mvn clean verify?