phpcrawl/CHANGELOG.txt at master · merzilla/phpcrawl · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
PHPCrawl Version-History / Changelog

Version 0.83
2015/01/27

* Fixed bug #74: The crawler now doesn't treat hostnames case sensitive anymore and won't follow URLs differing
just in hostname letter case more than one time (like "www.foo.net/page" and "www.FOO.net/page")

* Fixed bug #38: Detection for socket-EOF sometimes failed on SSL-connections under windows-OS causing a (infinite)
hang of the crawling-process.

* Fixed bug #59: If obeyRobotsTxt() is set to TRUE, the crawler now doesn't throw any errors anymore when parsing any
malformed robots.txt-files.

* Fixed bug #36: The crawler now uses the correct user-agent identification for requests of robots.txt-documents when it
was changed by the user with setUserAgentString().

* A second, optional parameter "robots_txt_uri" is now available for obeyRobotsTxt().
This let's the user alternatively specify the location of the robots.txt-file to obey manually as a URI (URL or file).

* Fixed bug #71: When running phpcrawl from the test-interface-GUI and output "Content-Size" was chosen, a
"PHP Fatal error: Call to undefined method PHPCrawlerUtils::getHeaderTag()" was thrown.

* Fixed bug/issue #61: Renamed method setPageLimit() to setRequestLimit() (because that is what this method does,
it limits the number of requests). The method setPageLimit() still is present for compatibility reasons though, but
was marked as deprecated.
Also, all requests resulting in a server answer other than "2xx" are marked as "document not received" now
(PHPCrawlerDocumentInfo::received property). This makes the option setRequestLimit() with the optional
"only_count_received_documents"-parameter set to behave as expected.

* Feature Request #11: Added ability to set a limit to the crawling-depth, method setCrawlingDepthLimit() added.
Property PHPCrawlerDocumentInfo::url_link_depth added.

* Fixed bug #79: When using phpcrawl in multiprocess mode MPMODE_PARENT_EXECUTES_USERCODE, the array
PHPCrawlerDocumentInfo::links_found_url_descriptors sometimes was missing some found URLs (and contained a lot
of NULL entries).

* Fixed undocumented bug: Links with a unknown protocol containing a minus (like "android-app://...") get rebuild
correctly now and won't be followed anymore.

* Fixed bug #76: Links containig numerical-reference-entities or hexadecimal-reference-entities (like &#x2f;) get
rebuild and followed correctly now.

* Fixed bug #25: Added an option/method "excludeLinkSearchDocumentSections()" to specify what HTML-document-sections
should get ignored by the internal link-finding algorithm.
This gives users the opportunity to prevent the crawler from finding links in HTML-comments and <script>-areas of
HTML-pages (as reported/asked in the bugreport).


Version 0.82
2014/01/16

* Feature Request #6: Added the possibility to delay HTTP-requests the crawler executes, method setRequestDelay() added.

* Fixed bug #39: Process resumption failed on Windows machines due to an unnecessary sem_get() call which is no present
on windows-machines ("Call to undefined function sem_get()").

* Fixed bug #58: If using the sqlite-URL-cache (setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE) on
Windows, the cleanup-method failed after the crawling-process was finished and threw an "unlink() permission denied"
error on two db3-files.

* Feature Request #9: Added support for gzip-encoded content, method requestGzipContent() added. By default
the crawler is not requesting gzip-encoded content for compatibility reasons.

* Fixex bug #57: Internal detection for the systems temp-directory sometimes failed/threw an error.

* Added new summarizing benchmarks for a crawling-porocess, added PHPCrawlerProcessReport properties
"avg_proc_data_transfer_rate", "avg_server_connect_time" and "avg_server_response_time"

* Added new benchmarks for single page-requests, added properties PHPCrawlerDocumentInfo::server_connect_time
and PHPCrawlerDocumentInfo::server_response_time

* Adjusted/corrected some page benchmark results to give some "more correct" results
(PHPCrawlerDocumentInfo::data_transfer_rate and PHPCrawlerDocumentInfo::data_transfer_time)

* Fixed bug #56: Requests for SSL/https-sites sometimes didn't return (hung up) due to failed detection of socket-EOF

* Fixed bug #54: The crawler didn't recognize newlines in server-responses correctly in some rare cases and returned
an "Host doesn't respond with a HTTP-header"-error.

* Overridable and deprecated method handlePageData() removed because it used a lot of memory. For compatibility reasons
this function will sill be called if defined by the user though (data gets internally prepared for it only in this case).

* Adjusted some internal timing and buffer settings for some better performance in multiprocessing-mode

* Implemented HTTP 1.1 protocol including chunked transfer encoding (because some servers don't support other
than http 1.1 and chunked encoding anymore). HTTP 1.1 is now the default protocol. Method setHTTPProtocolVersion()
added for optionally switch back to HTTP 1.0.

* Fixed bug #49: Some servers didn't accept request-headers from phpcrawl because of HTTP 1.0 protocol and uppercase
"HOST"-word in request-headers. Changed to HTTP 1.1 and "HOST" to "Host"

* Fixed undocumented bug: If there are no cookies to send, an empty header-line in request-headers was send to the
server (didn't seem to cause any problems though)

* Fixed undocumented bug: If a cookie had the domain-attribute ".domainname.com" (with a leading dot), the cookie
was NOT send with any requests for the host "doimanname.com" (withoud leading www. or similar subdomains)

* Unnecessary slashes get removed now generally at the end of URLs if the URL has no path ("www.site.com/" -> "www.site.com")

* Fixed bug #33: Missing "Accept"-directive in request-headers caused problems with some websites/webservers
(no response/no delivered content)

* Increased the default stream- and connection-timeout settings (a lot of people reported of "slow" websites that
don't respond within the given timeouts)

* Fixed bug #29: Fixed a memory-leak caused by the stream_client_connect()-function (together with stream_context_create())
that suffers from a memory leak in PHP version 5.3.X and maybe other versions (see https://bugs.php.net/bug.php?id=61371).

* Applied some other minor memory tweaks (for some reduced memory usage)

* Fixed "bug" #37: Fixed/removed some "code style errors" detected by PhpStorms copdeinspector


Version 0.81
2012/10/14

* Added the ability to resume aborted/terminated crawling-processes (Feature request 3500669).
Methods getCrawlerId(), enableResumption() and resume() added.

* Fixed bug 3530805: Links that are partially urlencoded and partially not get rebuild/encoded
correctly now.

* Fixed bug 3526202: Removed a unnecessary debug var_dump() from PHPCrawlerRobotsTxtParser.class.php

* Fixed bug 3533872: Server-name-indication in TLS/SSL works correctly now.

* Fixed bug 3533877: "base-href"-tags in websites get interpreted correctly now again.


Version 0.80 beta
2012/04/23

Version 0.80 was (almost) completely rewritten and ported to "proper" object oriented PHP5-code.
It brings new features like the ability to use multiple processes to spider a website and a new,
internal, hdd-based caching-mechanism making it possible to spider even very huge websites. Even
though some methods and options were renamded and redone, phpcrawl 0.8 should be fully compatible
to older versions.

The changes in detail:

* Code was completely refactored, ported to PHP5-OO-code and a lot of code was rewritten.

* Added the ability to use use multiple processes to spider a website. Method "goMultiProcessed()"
added.

* New overridable method "initChildProcess()" added for initiating child-processes when using the
crawler in multi-process-mode.

* Implementet an alternative, internal SQlite caching-mechanism for URLs making it possible to spider
very large websites. Method "setUrlCacheType()" added.

* New method setWorkingDirectory() added for defining the location of the crawlers temporary
working-directory manually. Therefor method "setTmpFile()" is marked as deprecated (has no function
anymore).

* New method "addContentTypeReceiveRule()" replaces the old method "addReceiveContentType()".
The function "addReceiveContentType()" still is present, but was marked as deprecated.

* New method "addURLFollowRule()" replaces the old method "addFollowMatch()".
The function "addFollowMatch()" still is present, but was marked as deprecated.

* New method "addURLFilterRule()" replaces the old method "addNonFollowMatch()".
The function "addNonFollowMatch()" still is present, but was marked as deprecated.

* The crawler now is able to parse and obey "nofollow"-tags defined as meta-tags or rel-options
(like "<a href="page.html" rel="nofollow">") Method "obeyNoFollowTags()" added.

* Overridabel user-method "handleDocumentInfo()" replaces the old method "handlePageData()".
Method "handlePageData()" still exists, but was marked as deprecated.
The new method now provides information about a document as a PHPCrawlerDocumentInfo-object instead of an array.

* New method "enableAggressiveLinkSearch()" replaces the old method "setAggressiveLinkExtraction()".
The function "setAggressiveLinkExtraction()" still is present, but was marked as deprecated.

* New method "setLinkExtractionTags()" replaces the old method "addLinkExtractionTags()" (since it
was named wrong,it didn't add tags, it overwrites tha tag-list).
The function "addLinkExtractionTags()" still is present, but was marked as deprecated.

* New method "addStreamToFileContentType()" replaces the old method "addReceiveToTmpFileMatch()".
The function "addReceiveToTmpFileMatch()" still is present, but was marked as deprecated.

* New method "enableCookieHandling()" replaces the old method "setCookieHandling()".
The function "setCookieHandling()" still is present, but was marked as deprecated.

* It's possible now to use a proxy-server for spidering websites. Method "setProxy()" added.

* Method "addReceiveToMemoryMatch()" has no function anymore and is marked as deprecated (since it
was redundant). Users should use addStreamToFileContentType() now instead.

* Method "disableExtendedLinkInfo()" has no function anymore and is marked as deprecated.
To reduce the memory-usage of the crawler users now should use the internal SQLite-urlcache
(by calling "setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE)").

* New method "getProcessReport()" replaces the old method "getReport()".
The function "getReport()" still is present, but was marked as deprecated.

* Method "setFollowRedirectsTillContent()" has no function anymore and was marked as deprecated
(since it completely interfered with other settings and rules that can be applied to the crawler).

* New overridable method "handleHeaderInfo()" added. This method will be called after the header
of a document was received and BEFORE the content will be received. Gives the user the opportunity
to abort the request (based on the http-header the server responded).

* Added some more information about found documents to the array (respectively PHPCrawlerDocumentInfo-object)
that is passed to the overridable user-function handleDocumentInfo(): cookies (all cookies the server
with the document), responseHeader (header the webserver responded a PHPCrawlerResponseHeader-object),
error_occured (flag indication whether an error occured during the request), data_transfer_rate (average
data-transferrate for receiving the document), data_transfer_time (time it took to receive the document)
and meta_attributes (all meta-tag atteributes found in the source of the document).

* Added the ability to send post-data with requests for defined URLs, method "addPostData()" added.

* Fixed bug 3368719: When creating two (or more) instances of the crawler and obeyRobotsTxt() was
set to true in both of them, the second call of the go()-method caused a fatal error.

* Fixed bug 3368722: Flag "file_limit_reached" in the array that getReport() returns wasn't set to
true when a page-limimt (setPageLimit()) was hit.

* Fixed bug 3389965: phpcrawl now shouldn't throw any notices and warnings anymore when error_reporting
is set to E_STRICT or E_ALL.

* Fixed bug 3413627: Links like "<a href="?page=3">" get recognized and rebuild correctly now.

* Fixed bug 3465964: Robots.txt-documents get parsed correctly now, even if the "User-agent"-directive
in robots.txt-files is written like "user-agent" or "User-Agent".

* Fixed bug 3485155: Links like "test.htm?redirect=http://www.foo.ie&a=b" get recognized and rebuild
correctly now.

* Fixed bug 3504517: Requests for links containing special characters (like "äöü") sometimes didn't work
as expected due to wrong character-encoding (server responded 404 NOT FOUND).

* Fixed bug 3510270: The array "links_found" of the array $page_data passed to the function
handlePageData() never contained a linktext.


Version 0.71
2011/06/30

* Bugfix: Empty links found in documents (like <a href="">) are rebuild correctly now
so that they lead to the same document they were found in.

* Bugfix: It's possible now to initiate more than one instance of the phpcrawler-class
without getting an "Fatal error: Call to undefined method stdClass::receivePage()" error.

* A new method "setLinkExtractionTags()" replaces the "addLinkExtractionTags()" method.
The old method addLinkExtractionTags() is named wrong since it doesn't ADD new tags, it
OVERWRITES the tag-list.
addLinkExtractionTags() still exists for compatibility-reasons, but was marked as deprecated.

* Bugfix: Links containing spaces like <a href="any file.hmtl"> are recognized and
processed correctly now.

* Bugfix: Links containing quotes like <a href="any'file.html"> are recognized and
processed correctly now.

* The search-patterns used for agressive link-extractions (if setAggressiveLinkExtraction()
set to TRUE) were redone. They should give some better results now.

* Bugfix: Phpcrawl doesn't crash anymore with a segmentation fault when parsing links that
are laid over very long text or html
(like <a href="foo,htm"> ... very very long text goes here ...</a>)

* Method addLinkSearchContentType() added.
It's possible now to manually define what kind of documents should get parsed in order to
find links in (according to their content/mime-type).
Before (and still by default) only documents of type "text/html" get checked for links.

* Bugfix: Links containing linebreaks are recognized and processed correctly now.

* Default value for socket-stream-timeout increased to 5 seconds.
Although this can be set manually by using the setStreamTimeout()-method, the default value of 2 seconds
was a little to low.

* Fixed some minor-bugs ih the test-interface and updated the containing example-setup (since
the old one didn't work anymore because of site-changes over at php.net)


Version 0.70
2007/01/05

* New function setLinkPriority() added. Its possible now to set different
priorities (priority-levels) for different links.

* Found links are cached now in a different way so that the general
performance was improved, especially when cralwing a lot of pages (big sites)
in one process.

* Added support for Secure Sockets Layer (SSL), so links like "https://.."
will be followed corretly now.
Please note that the PHP OpenSSL-extension is required for it to work.

* The link-extraction and other parts were redone and should give better results
now.

* Methods setAggressiveLinkExtraction() and addLinkExtractionTag() added for
setting up the link extraction manually.

* Added support for robots.txt-files, method obeyRobotsTxt() added.

* More information about found pages and files will be passed now to the method
handlePageData(), especially information about all found links in the current
page is available now. Method disableExtendedLinkInfo() added.

* The crawler now is able to handle links like "http://www.foo.com:123" correctly
and will send requests to the correct port. Method setPort() added.

* The content of pages and files now can be received/streamed diretly to a temporary
file instead of to memory, so now it shouldn't be a problem anymore to let the
crawler receive big files.
Methods setTmpFile(), addReceiveToTmpFileMatch() and addReceiveToMemoryMatch() added.

* Added support for basic authentication. Now the crawler is able to crawl protected
content. Method addBasicAuthentication() added.

* Now its possible to abort the crawling-process by letting the overridable function
handlePageData() return any negative value.

* Method setUserAgentString() added for setting the user-agent-string in
request-headers the crawler sends.

* A html-testinterface for testing the different setups of the crawler is included in
the package now.

* The crawler doesn't do DNS-lookups anymore for every single page-request, only if the
host is changing a lookup will be done once. This improves performance a little bit.

* The crawler doesn't look for links anymore in every file it finds, only
"text/html"-content will be checked for links.

* Fixed problem with rebuilding links like "*foo".

* Fixed problem with splitting/rebuilding URLs like
"http://foo.bar.com/?http://foo/bar.html".

* Fixed problem that the crawler handled i.e. "foo.com" and "www.foo.com" as different
hosts.


Version 0.65_pl2
2005/08/22

* Phpcrawl now doesn't throw any notices anymore when error-reporting
  is set to E_ALL. Thanks to elitec0der!

* Also there shouldn't be any notices anymore when "allow_call_time_pass_reference"
  is set to OFF. Thanks philipx!

* Fixed a bug that appeared when running the crawler in followmode 3.
  (The crawler never went "path up" anymore)

* Now the crawler sends the referer-header with requests correctly again.


Version 0.65_pl1 (patchlevel 1)
2004/08/06

Just a few bugs were fixed in this version:
(Yes, it took a time, sorry)

* The crawler doesn't send empty "Cookie:" header anymore if there's no cookie to send.
  (A few webserver rerturned a "bad request" header to this) Thanks Mark!

* The crawler doesn't send one and the same cookie-header several times with a request
  anymore if it was set more than one time by the server. (Didn't really matter though)

* Crawler will find links in metatags used with single quotation marks (') like
  <META HTTP-EQUIV='refresh' CONTENT='foo; URL=http://bar/'> now. Thanks Arsa!

* HTTP 1.0 requests will be send now because of problems with HTTP 1.1 headers and
  chunked content. Thanks Arsa again!