-
Notifications
You must be signed in to change notification settings - Fork 71
Open
Labels
Description
Hi Pascal,
I am working on a website which include different domains, such as...
// Below are the domains in the start url section
www.rthk.hk
app3.rthk.hk
app4.rthk.hk
programme.rthk.hk
news.rthk.hk
podcast.rthk.hk
// Below are the domains that need to crawl but not listed above
app1.rthk.hk
app2.rthk.hk
... more with "rthk.hk"
In the config.xml, I did something like...
// stayOnDomain = false, because there would be other similar doamin
// stayOnPort & stayOnProtocol = false, because there are http and https
<startURLs stayOnDomain="false" stayOnPort="false" stayOnProtocol="false">
<url>http://app3.rthk.hk/search/google/start.php</url>
<url>http://programme.rthk.hk/archivelist_gsa.php?channel=dtt31</url>
<url>https://www.rthk.hk/</url>
<url>https://news.rthk.hk/</url>
<url>http://podcast.rthk.hk/</url>
<url>http://app4.rthk.hk/special/rthkmemory/</url>
<url>http://app4.rthk.hk/elearning/healthpedia/</url>
<startURLs>
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
.*rthk\.hk/.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
.*rthk\.org\.hk/.*
</filter>
... other exclude filters
</referenceFilters>
I found this solution from the past issues.
However, it seems not working in my case.
I got the following log which there is a unwanted url got fetched.
INFO [CrawlerEventManager] DOCUMENT_FETCHED: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO [CrawlerEventManager] CREATED_ROBOTS_META: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO [CrawlerEventManager] REJECTED_FILTER: http://gtob.ningbo.gov.cn/picture/0/1d916bc2a14c46e2999138ed408fecb9.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO [CrawlerEventManager] REJECTED_FILTER: http://gtob.ningbo.gov.cn/picture/0/04dd334f9961456586f017a5c44ce7dc.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO [CrawlerEventManager] REJECTED_FILTER: http://gtob.ningbo.gov.cn/art/2018/9/5/../../../../module/visitcount/visit.jsp?type=3&i_webid=2&i_columnid=316&i_articleid=944973 (RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,reg
ex=.*\/([^\/]*)\/\1\/\1\/.*])
INFO [CrawlerEventManager] REJECTED_FILTER: http://gtob.ningbo.gov.cn/images/10/gmz_dqwz_pic.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO [CrawlerEventManager] URLS_EXTRACTED: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO [CrawlerEventManager] REJECTED_FILTER: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html (No "include" document filters matched.)
I would like to ask if there is any wrong from the config.
Thanks!