-
Notifications
You must be signed in to change notification settings - Fork 71
Open
Labels
Description
we need to crawl many Internet sites and encountered an issue with www prefix:
some sites redirect to their domains without www, some other way round.
Unfortunately, such case cannot be handle by NC in general way (globally): we can normalize URLs bei removing www prefix, and, if a site would redicrect to www.some.site again, the collector would follow, as it is configured to follow sub-domains. But, there will be cases, when a site is available with www prefix only (e.g. https://www.pony.at/ does not work without www), so we will miss such sites again.
So, I'm looking for a general solution for that problem.
Any ideas - very welcome! Thank you!
Common requirements for a crawler:
<startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false" includeSubdomains="true">