Skip to content

match: bound the hash_search() chain walk (fixes #217)#1017

Open
dr-who wants to merge 1 commit into
RsyncProject:masterfrom
dr-who:fix-217-hash-chain-cap
Open

match: bound the hash_search() chain walk (fixes #217)#1017
dr-who wants to merge 1 commit into
RsyncProject:masterfrom
dr-who:fix-217-hash-chain-cap

Conversation

@dr-who

@dr-who dr-who commented Jun 30, 2026

Copy link
Copy Markdown

Problem (#217)

hash_search() walks the entire hash-table chain for the current rolling checksum at every byte offset of the source file. Disk and VM images contain large runs of identical blocks, so a single weak checksum (get_checksum1) can collide thousands of times and pile every one of those blocks onto one chain. When the sender then rolls across a region whose weak checksum keeps landing on that chain without ever producing a strong-checksum match, it re-walks the whole chain for every byte, giving O(file_size × chain_length) behaviour.

The result is rsync sitting at 100% CPU for hours with no apparent progress — the long-standing "rsync hangs on large files / qcow2 / VM images" reports in #217. It matches the diagnostics in that thread: ltrace showing endless memcmp inside hash_search(), gdb stuck in hash_search(), and instrumentation showing the offset advancing by 1 byte while the inner loop runs ~14644 times (one very common weak checksum appearing ~14643× in the basis).

Fix

Cap the number of same-weak-checksum candidates examined per offset at MAX_CHAIN_LEN (1024, overridable #define). Once the cap is hit we treat the offset as a non-match and roll forward a byte.

Key properties:

  • Always correct. Any block skipped by the cap is simply sent as literal data, never corrupted. Only the transfer size is marginally affected, and only in the pathological case.
  • Protocol-compatible. This is purely a sender-side search limit — it changes no checksum, emitted byte, or protocol field — so a capped sender interoperates with an unmodified receiver and vice versa.
  • No effect on normal data. Well-distributed checksums never reach a chain length of 1024, so ordinary transfers are byte-for-byte and stat-for-stat identical (verified across small/big/sparse files, with and without --inplace).

Numbers

On a synthetic 40000-block basis whose blocks all share one weak checksum, syncing a 60 KB source dropped from ~18.4s to ~0.7s; the unbounded cost grows with the square of the file size, which is what produced the multi-hour hangs on real multi-GB images.

Test

testsuite/hashsearch-chain_test.py reproduces the pathology with a tiny basis of weak-checksum-colliding decoy blocks (an all-C block perturbed by (+1,-2,+1), which preserves get_checksum1 but changes the strong checksum). Rather than measure wall-clock time, it asserts on the existing false_alarms counter (--debug=deltasum1): with the cap, false_alarms / hash_hits stays bounded; without it, the ratio equals the full chain length. The assertion is exact and machine-independent. The test passes with this change and fails (the unbounded walk) without it.

Full suite: 107 passed, 6 skipped, 0 failed.

hash_search() walks the entire hash-table chain for the current rolling
checksum at every byte offset of the source file. Disk and VM images
contain large runs of identical blocks, so a single weak checksum
(get_checksum1) can collide thousands of times and pile every one of
those blocks onto one chain. When the sender then rolls across a region
whose weak checksum keeps landing on that chain without ever producing a
strong-checksum match, it re-walks the whole chain for every byte, giving
O(file_size * chain_length) behaviour. The result is rsync sitting at
100% CPU for hours with no apparent progress -- the long-standing "rsync
hangs on large files" reports.

Cap the number of same-weak-checksum candidates examined per offset at
MAX_CHAIN_LEN. Once the cap is hit we treat the offset as a non-match and
roll forward a byte; any block skipped this way is simply sent as literal
data, so the transferred result is always correct -- only the transfer
size is marginally affected. This is purely a sender-side search limit:
it changes no checksum, emitted byte, or protocol field, so a capped
sender interoperates with an unmodified receiver and vice versa.

On a synthetic 40000-block basis sharing one weak checksum, syncing a
60KB source dropped from ~18.4s to ~0.7s; the unbounded cost grows with
the square of the file size.

testsuite/hashsearch-chain_test.py reproduces the pathology with a tiny
basis of weak-checksum-colliding decoy blocks and asserts, via the
existing false_alarms counter (--debug=deltasum1), that the per-hash-hit
chain walk stays bounded. The assertion is exact and machine-independent
rather than timing-based.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant