match: bound the hash_search() chain walk (fixes #217) by dr-who · Pull Request #1017 · RsyncProject/rsync

dr-who · 2026-06-30T19:30:04Z

Problem (#217)

hash_search() walks the entire hash-table chain for the current rolling checksum at every byte offset of the source file. Disk and VM images contain large runs of identical blocks, so a single weak checksum (get_checksum1) can collide thousands of times and pile every one of those blocks onto one chain. When the sender then rolls across a region whose weak checksum keeps landing on that chain without ever producing a strong-checksum match, it re-walks the whole chain for every byte, giving O(file_size × chain_length) behaviour.

The result is rsync sitting at 100% CPU for hours with no apparent progress — the long-standing "rsync hangs on large files / qcow2 / VM images" reports in #217. It matches the diagnostics in that thread: ltrace showing endless memcmp inside hash_search(), gdb stuck in hash_search(), and instrumentation showing the offset advancing by 1 byte while the inner loop runs ~14644 times (one very common weak checksum appearing ~14643× in the basis).

Fix

Cap the number of same-weak-checksum candidates examined per offset at MAX_CHAIN_LEN (1024, overridable #define). Once the cap is hit we treat the offset as a non-match and roll forward a byte.

Key properties:

Always correct. Any block skipped by the cap is simply sent as literal data, never corrupted. Only the transfer size is marginally affected, and only in the pathological case.
Protocol-compatible. This is purely a sender-side search limit — it changes no checksum, emitted byte, or protocol field — so a capped sender interoperates with an unmodified receiver and vice versa.
No effect on normal data. Well-distributed checksums never reach a chain length of 1024, so ordinary transfers are byte-for-byte and stat-for-stat identical (verified across small/big/sparse files, with and without --inplace).

Numbers

On a synthetic 40000-block basis whose blocks all share one weak checksum, syncing a 60 KB source dropped from ~18.4s to ~0.7s; the unbounded cost grows with the square of the file size, which is what produced the multi-hour hangs on real multi-GB images.

Test

testsuite/hashsearch-chain_test.py reproduces the pathology with a tiny basis of weak-checksum-colliding decoy blocks (an all-C block perturbed by (+1,-2,+1), which preserves get_checksum1 but changes the strong checksum). Rather than measure wall-clock time, it asserts on the existing false_alarms counter (--debug=deltasum1): with the cap, false_alarms / hash_hits stays bounded; without it, the ratio equals the full chain length. The assertion is exact and machine-independent. The test passes with this change and fails (the unbounded walk) without it.

Full suite: 107 passed, 6 skipped, 0 failed.

hash_search() walks the entire hash-table chain for the current rolling checksum at every byte offset of the source file. Disk and VM images contain large runs of identical blocks, so a single weak checksum (get_checksum1) can collide thousands of times and pile every one of those blocks onto one chain. When the sender then rolls across a region whose weak checksum keeps landing on that chain without ever producing a strong-checksum match, it re-walks the whole chain for every byte, giving O(file_size * chain_length) behaviour. The result is rsync sitting at 100% CPU for hours with no apparent progress -- the long-standing "rsync hangs on large files" reports. Cap the number of same-weak-checksum candidates examined per offset at MAX_CHAIN_LEN. Once the cap is hit we treat the offset as a non-match and roll forward a byte; any block skipped this way is simply sent as literal data, so the transferred result is always correct -- only the transfer size is marginally affected. This is purely a sender-side search limit: it changes no checksum, emitted byte, or protocol field, so a capped sender interoperates with an unmodified receiver and vice versa. On a synthetic 40000-block basis sharing one weak checksum, syncing a 60KB source dropped from ~18.4s to ~0.7s; the unbounded cost grows with the square of the file size. testsuite/hashsearch-chain_test.py reproduces the pathology with a tiny basis of weak-checksum-colliding decoy blocks and asserts, via the existing false_alarms counter (--debug=deltasum1), that the per-hash-hit chain walk stays bounded. The assertion is exact and machine-independent rather than timing-based.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

match: bound the hash_search() chain walk (fixes #217)#1017

match: bound the hash_search() chain walk (fixes #217)#1017
dr-who wants to merge 1 commit into
RsyncProject:masterfrom
dr-who:fix-217-hash-chain-cap

dr-who commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

dr-who commented Jun 30, 2026

Problem (#217)

Fix

Numbers

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant