match: bound the hash_search() chain walk (fixes #217)#1017
Open
dr-who wants to merge 1 commit into
Open
Conversation
hash_search() walks the entire hash-table chain for the current rolling checksum at every byte offset of the source file. Disk and VM images contain large runs of identical blocks, so a single weak checksum (get_checksum1) can collide thousands of times and pile every one of those blocks onto one chain. When the sender then rolls across a region whose weak checksum keeps landing on that chain without ever producing a strong-checksum match, it re-walks the whole chain for every byte, giving O(file_size * chain_length) behaviour. The result is rsync sitting at 100% CPU for hours with no apparent progress -- the long-standing "rsync hangs on large files" reports. Cap the number of same-weak-checksum candidates examined per offset at MAX_CHAIN_LEN. Once the cap is hit we treat the offset as a non-match and roll forward a byte; any block skipped this way is simply sent as literal data, so the transferred result is always correct -- only the transfer size is marginally affected. This is purely a sender-side search limit: it changes no checksum, emitted byte, or protocol field, so a capped sender interoperates with an unmodified receiver and vice versa. On a synthetic 40000-block basis sharing one weak checksum, syncing a 60KB source dropped from ~18.4s to ~0.7s; the unbounded cost grows with the square of the file size. testsuite/hashsearch-chain_test.py reproduces the pathology with a tiny basis of weak-checksum-colliding decoy blocks and asserts, via the existing false_alarms counter (--debug=deltasum1), that the per-hash-hit chain walk stays bounded. The assertion is exact and machine-independent rather than timing-based.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (#217)
hash_search()walks the entire hash-table chain for the current rolling checksum at every byte offset of the source file. Disk and VM images contain large runs of identical blocks, so a single weak checksum (get_checksum1) can collide thousands of times and pile every one of those blocks onto one chain. When the sender then rolls across a region whose weak checksum keeps landing on that chain without ever producing a strong-checksum match, it re-walks the whole chain for every byte, giving O(file_size × chain_length) behaviour.The result is rsync sitting at 100% CPU for hours with no apparent progress — the long-standing "rsync hangs on large files / qcow2 / VM images" reports in #217. It matches the diagnostics in that thread: ltrace showing endless
memcmpinsidehash_search(), gdb stuck inhash_search(), and instrumentation showing the offset advancing by 1 byte while the inner loop runs ~14644 times (one very common weak checksum appearing ~14643× in the basis).Fix
Cap the number of same-weak-checksum candidates examined per offset at
MAX_CHAIN_LEN(1024, overridable#define). Once the cap is hit we treat the offset as a non-match and roll forward a byte.Key properties:
--inplace).Numbers
On a synthetic 40000-block basis whose blocks all share one weak checksum, syncing a 60 KB source dropped from ~18.4s to ~0.7s; the unbounded cost grows with the square of the file size, which is what produced the multi-hour hangs on real multi-GB images.
Test
testsuite/hashsearch-chain_test.pyreproduces the pathology with a tiny basis of weak-checksum-colliding decoy blocks (an all-Cblock perturbed by(+1,-2,+1), which preservesget_checksum1but changes the strong checksum). Rather than measure wall-clock time, it asserts on the existingfalse_alarmscounter (--debug=deltasum1): with the cap,false_alarms / hash_hitsstays bounded; without it, the ratio equals the full chain length. The assertion is exact and machine-independent. The test passes with this change and fails (the unbounded walk) without it.Full suite: 107 passed, 6 skipped, 0 failed.