Skip to content

Question: crawling in similar domain #612

@FcrbPeter

Description

@FcrbPeter

Hi Pascal,

I am working on a website which include different domains, such as...

// Below are the domains in the start url section
www.rthk.hk
app3.rthk.hk
app4.rthk.hk
programme.rthk.hk
news.rthk.hk
podcast.rthk.hk
// Below are the domains that need to crawl but not listed above
app1.rthk.hk
app2.rthk.hk
... more with "rthk.hk"

In the config.xml, I did something like...

// stayOnDomain = false, because there would be other similar doamin
// stayOnPort & stayOnProtocol = false, because there are http and https
<startURLs stayOnDomain="false" stayOnPort="false" stayOnProtocol="false">
<url>http://app3.rthk.hk/search/google/start.php</url>
<url>http://programme.rthk.hk/archivelist_gsa.php?channel=dtt31</url>
<url>https://www.rthk.hk/</url>
<url>https://news.rthk.hk/</url>
<url>http://podcast.rthk.hk/</url>
<url>http://app4.rthk.hk/special/rthkmemory/</url>
<url>http://app4.rthk.hk/elearning/healthpedia/</url>
<startURLs>

<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*rthk\.hk/.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*rthk\.org\.hk/.*
</filter>

... other exclude filters
</referenceFilters>

I found this solution from the past issues.
However, it seems not working in my case.

I got the following log which there is a unwanted url got fetched.

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/picture/0/1d916bc2a14c46e2999138ed408fecb9.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/picture/0/04dd334f9961456586f017a5c44ce7dc.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/art/2018/9/5/../../../../module/visitcount/visit.jsp?type=3&i_webid=2&i_columnid=316&i_articleid=944973 (RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,reg
ex=.*\/([^\/]*)\/\1\/\1\/.*])
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/images/10/gmz_dqwz_pic.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html (No "include" document filters matched.)

I would like to ask if there is any wrong from the config.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions