#784 - Support Retry-After in FetcherBolt by rzo1 · Pull Request #1944 · apache/stormcrawler

rzo1 · 2026-06-14T17:44:22Z

Honour the Retry-After HTTP response header by delaying the next fetch from the affected internal queue. The header value is parsed as either a number of seconds or an HTTP date. A new fetcher.max.retry.after config caps the honoured delay (-1, the default, honours it as-is).

For all changes

Is there a issue associated with this PR? Is it referenced in the commit message?
Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

For code changes

Have you ensured that the full suite of tests is executed via mvn clean verify?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file?
If applicable, have you updated the NOTICE file, including the main NOTICE file?

Honour the Retry-After HTTP response header by delaying the next fetch from the affected internal queue. The header value is parsed as either a number of seconds or an HTTP date. A new fetcher.max.retry.after config caps the honoured delay (-1, the default, honours it as-is).

jnioche · 2026-06-15T09:06:41Z

thanks @rzo1
my concern is that this would cause unnecessary tuple timeouts. If we come across a retry after, we should probably purge the queue instead.
A more robust approach would be to implement it via the host stream (see #867) and have a host aware spout implementation. I had made a start in the branch 851

rzo1 · 2026-06-15T09:25:17Z

I had made a start in the branch 851

Contains SOLR changes only, so most likely gone.

--

I agree, that purging the queue for the given host might be beneficial. Let me later look into that in more depth again.

jnioche · 2026-06-15T09:27:57Z

I had made a start in the branch 851

Contains SOLR changes only, so most likely gone.

was actually in branch 990, sorry

rzo1 · 2026-06-16T09:22:53Z

Agreed, the in-queue hold is the fragile part. A long Retry-After (and the default is uncapped) would park the queue's siblings past topology.message.timeout.secs and trigger replays.

For this PR, I would take the purge route. On a Retry-After, re-emit the affected queue's pending items rather than holding them, mirroring the existing crawl-delay-too-long path at FetcherBolt.java:682 so the frontier reschedules them. One thing I'd like your take on: should the re-emitted URLs go out as Status.ERROR (reuses the existing path, but carries error/retry semantics), or should we set an explicit future nextFetchDate so the scheduler honors the exact back-off?

I'd treat the host stream / host-aware spout design (#867, your branch 990) as the proper long-term home for this, happy to follow up there once we settle the short-term behaviour. Does that split sound right to you?

jnioche · 2026-06-16T10:50:50Z

yes, the split makes sense

One thing I'd like your take on: should the re-emitted URLs go out as Status.ERROR (reuses the existing path, but carries error/retry semantics), or should we set an explicit future nextFetchDate so the scheduler honors the exact back-off?

Status.ERROR is not the right status: it indicates an irremediable problem with the content of the document, like a pdf that would be unparsable for instance or a URL blocked by robots.txt

Could set an explicit nextFetchDate but I think just mimicking what is done via crawl-delay-too-long would be good enough.

rzo1 · 2026-06-16T10:52:59Z

Status.ERROR is not the right status: it indicates an irremediable problem with the content of the document, like a pdf that would be unparsable for instance or a URL blocked by robots.txt

Yes this was my reasoning too but wanted to get some confirmation on it :)

sebastian-nagel · 2026-06-16T12:53:59Z

                // We remove the entry and put it at the end of the map
                i.remove();

                // reap empty queues


If the next fetch time is in the future, the queue cannot be released for now. Otherwise the retry-after is not honored for new fetch items of the same site arriving through the topology.

See Nutch's FetchItemQueues code to release queues in combination with the exponential back-off.

or should we set an explicit future nextFetchDate so the scheduler honors the exact back-off?

This would only help if there are not too many items from the same site.

I'd treat the host stream / host-aware spout design (#867, your branch 990) as the proper long-term home for this,

Definitely. Avoiding back-pressure is important. And in a broad crawl the number of queues, which need to be kept to ensure the retry-after delay, can grow large.

rzo1 added this to the 3.6.1 milestone Jun 14, 2026

rzo1 requested review from dpol1, jnioche and sebastian-nagel June 14, 2026 17:44

Apply code formatting

9fbe279

rzo1 requested review from mvolikas and sigee June 15, 2026 06:50

rzo1 marked this pull request as draft June 15, 2026 09:27

dpol1 linked an issue Jun 15, 2026 that may be closed by this pull request

support retry-after in FetcherBolt #784

Open

sebastian-nagel reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#784 - Support Retry-After in FetcherBolt#1944

#784 - Support Retry-After in FetcherBolt#1944
rzo1 wants to merge 2 commits into
mainfrom
784-retry-after-fetcherbolt

rzo1 commented Jun 14, 2026

Uh oh!

jnioche commented Jun 15, 2026

Uh oh!

rzo1 commented Jun 15, 2026 •

edited

Loading

Uh oh!

jnioche commented Jun 15, 2026

Uh oh!

rzo1 commented Jun 16, 2026

Uh oh!

jnioche commented Jun 16, 2026

Uh oh!

rzo1 commented Jun 16, 2026

Uh oh!

sebastian-nagel Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rzo1 commented Jun 14, 2026

For all changes

For code changes

Uh oh!

jnioche commented Jun 15, 2026

Uh oh!

rzo1 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnioche commented Jun 15, 2026

Uh oh!

rzo1 commented Jun 16, 2026

Uh oh!

jnioche commented Jun 16, 2026

Uh oh!

rzo1 commented Jun 16, 2026

Uh oh!

sebastian-nagel Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rzo1 commented Jun 15, 2026 •

edited

Loading