Skip to content

#784 - Support Retry-After in FetcherBolt#1944

Draft
rzo1 wants to merge 2 commits into
mainfrom
784-retry-after-fetcherbolt
Draft

#784 - Support Retry-After in FetcherBolt#1944
rzo1 wants to merge 2 commits into
mainfrom
784-retry-after-fetcherbolt

Conversation

@rzo1

@rzo1 rzo1 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Honour the Retry-After HTTP response header by delaying the next fetch from the affected internal queue. The header value is parsed as either a number of seconds or an HTTP date. A new fetcher.max.retry.after config caps the honoured delay (-1, the default, honours it as-is).

For all changes

  • Is there a issue associated with this PR? Is it referenced in the commit message?

  • Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

  • Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

For code changes

  • Have you ensured that the full suite of tests is executed via mvn clean verify?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file?

Honour the Retry-After HTTP response header by delaying the next fetch
from the affected internal queue. The header value is parsed as either a
number of seconds or an HTTP date. A new fetcher.max.retry.after config
caps the honoured delay (-1, the default, honours it as-is).
@rzo1 rzo1 added this to the 3.6.1 milestone Jun 14, 2026
@rzo1 rzo1 requested review from mvolikas and sigee June 15, 2026 06:50
@jnioche

jnioche commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

thanks @rzo1
my concern is that this would cause unnecessary tuple timeouts. If we come across a retry after, we should probably purge the queue instead.
A more robust approach would be to implement it via the host stream (see #867) and have a host aware spout implementation. I had made a start in the branch 851

@rzo1

rzo1 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

I had made a start in the branch 851

Contains SOLR changes only, so most likely gone.

--

I agree, that purging the queue for the given host might be beneficial. Let me later look into that in more depth again.

@rzo1 rzo1 marked this pull request as draft June 15, 2026 09:27
@jnioche

jnioche commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

I had made a start in the branch 851

Contains SOLR changes only, so most likely gone.

was actually in branch 990, sorry

@dpol1 dpol1 linked an issue Jun 15, 2026 that may be closed by this pull request
@rzo1

rzo1 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Agreed, the in-queue hold is the fragile part. A long Retry-After (and the default is uncapped) would park the queue's siblings past topology.message.timeout.secs and trigger replays.

For this PR, I would take the purge route. On a Retry-After, re-emit the affected queue's pending items rather than holding them, mirroring the existing crawl-delay-too-long path at FetcherBolt.java:682 so the frontier reschedules them. One thing I'd like your take on: should the re-emitted URLs go out as Status.ERROR (reuses the existing path, but carries error/retry semantics), or should we set an explicit future nextFetchDate so the scheduler honors the exact back-off?

I'd treat the host stream / host-aware spout design (#867, your branch 990) as the proper long-term home for this, happy to follow up there once we settle the short-term behaviour. Does that split sound right to you?

@jnioche

jnioche commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

yes, the split makes sense

One thing I'd like your take on: should the re-emitted URLs go out as Status.ERROR (reuses the existing path, but carries error/retry semantics), or should we set an explicit future nextFetchDate so the scheduler honors the exact back-off?

Status.ERROR is not the right status: it indicates an irremediable problem with the content of the document, like a pdf that would be unparsable for instance or a URL blocked by robots.txt

Could set an explicit nextFetchDate but I think just mimicking what is done via crawl-delay-too-long would be good enough.

@rzo1

rzo1 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Status.ERROR is not the right status: it indicates an irremediable problem with the content of the document, like a pdf that would be unparsable for instance or a URL blocked by robots.txt

Yes this was my reasoning too but wanted to get some confirmation on it :)

// We remove the entry and put it at the end of the map
i.remove();

// reap empty queues

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the next fetch time is in the future, the queue cannot be released for now. Otherwise the retry-after is not honored for new fetch items of the same site arriving through the topology.

See Nutch's FetchItemQueues code to release queues in combination with the exponential back-off.

or should we set an explicit future nextFetchDate so the scheduler honors the exact back-off?

This would only help if there are not too many items from the same site.

I'd treat the host stream / host-aware spout design (#867, your branch 990) as the proper long-term home for this,

Definitely. Avoiding back-pressure is important. And in a broad crawl the number of queues, which need to be kept to ensure the retry-after delay, can grow large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support retry-after in FetcherBolt

3 participants