Skip to content

ignore www prefix #614

@jetnet

Description

@jetnet

we need to crawl many Internet sites and encountered an issue with www prefix:
some sites redirect to their domains without www, some other way round.
Unfortunately, such case cannot be handle by NC in general way (globally): we can normalize URLs bei removing www prefix, and, if a site would redicrect to www.some.site again, the collector would follow, as it is configured to follow sub-domains. But, there will be cases, when a site is available with www prefix only (e.g. https://www.pony.at/ does not work without www), so we will miss such sites again.
So, I'm looking for a general solution for that problem.
Any ideas - very welcome! Thank you!

Common requirements for a crawler:

<startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false" includeSubdomains="true">

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions