Skip to content

Add configurable on-disk value compression (zstd + lz4)#29

Open
christianparpart wants to merge 1 commit into
masterfrom
feature/on-disk-compression
Open

Add configurable on-disk value compression (zstd + lz4)#29
christianparpart wants to merge 1 commit into
masterfrom
feature/on-disk-compression

Conversation

@christianparpart

@christianparpart christianparpart commented Jul 2, 2026

Copy link
Copy Markdown
Member

Cache values written to the persistent CoW-tree are now compressed, shrinking the on-disk footprint and improving throughput for I/O-bound workloads. The codec is chosen per store and defaults to zstd; administrators opt out with --compression none, or pick lz4 for a CPU-minimal codec. Reads always return plaintext — decompression happens only on an L1 cache miss, off the hot path — and each record is tagged with its own codec, so a store may freely mix codecs and changing the setting only affects new writes (no migration).

Compression lives at the value level inside CowTreeStorage (compress in StoreEntry, decompress in LoadEntry), so keys stay cleartext for B+tree navigation, the fixed page-offset layout is untouched, and the L1 cache and wire protocols only ever see the original bytes. A threshold plus a shrink-check keep the zstd default safe: values below --compression-min-bytes, and any value that does not actually get smaller (incompressible or already-compressed data), are stored verbatim under an Identity tag, so compression never enlarges a value or burns CPU pointlessly.

Changes

  • New Core/Compression codec registry (stateless, modelled on Core/Crc32c): a descriptor table drives name↔id, availability, and the compress/decompress dispatch, so adding a codec is a single table row.
  • CowTreeStorage leaf-record format bumped v3→v4 with a per-entry codec id + original length; StoreEntry compresses once (threshold + shrink-check) and LoadEntry decompresses to plaintext. Metadata-only Touch/MarkStale preserve the codec without recompressing.
  • Old, pre-v4 stores are detected via a version sentinel and refused on open rather than mis-parsed (detect-and-bail; a migrator is out of scope, and dovetails with the existing "expose on-disk format version" TODO).
  • Config plumbing following the existing StorageDurability pattern: --compression, --compression-level, and --compression-min-bytes across CLI, YAML, ConfigMerge, and the startup banner.
  • Gated behind FASTCACHED_ENABLE_COMPRESSION (ON by default; the standard build fetches/links zstd + lz4 via CPM, preferring a system package). Building with it OFF degrades the codec table to Identity-only and makes selecting zstd/lz4 a clean startup error, linking no compression symbols.

Performance: writes compress once and are fsync-bound in the default Batched durability, so lz4 and zstd level 3 add negligible wall-clock; Append/Prepend are read-modify-write and so decompress+recompress. End-to-end, a 200 KiB compressible value shrinks the store from 278 KiB to 64 KiB (~4.25×) with an identical client-visible round-trip, verified across a restart.


Performance comparison

Performance comparison: master vs compression modes vs value size

Setup. Storage-layer micro-benchmark driving CowTreeStorage Set/Get
directly (isolates the codec from network/protocol noise), plus an end-to-end
sanity pass over the memcached text protocol against the running daemon.
Release builds (clang-release, -O2 -DNDEBUG, no sanitizers), pinned to one
core, durability=batched (default), single shard. master = origin/master
(no compression code); the other three are this branch with the named
--compression. Backing store on tmpfs, so throughput reflects CPU + fsync
cost rather than disk-seek latency — on real disk the compression throughput
win is larger (fewer bytes through the actual I/O path). On-disk footprint is
filesystem-independent. Machine: i9-14900HX.

Storage micro-benchmark — compressible data

value codec set MiB/s get MiB/s set ops/s get ops/s on-disk ratio
64 B master 0.4 0.8 6,223 13,152 271 MiB 0.04×
64 B none 0.4 0.8 6,235 13,143 276 MiB 0.04×
64 B lz4 0.4 0.8 6,094 12,978 276 MiB 0.04×
64 B zstd 0.4 0.8 6,177 13,240 276 MiB 0.04×
1 KiB master 6 13 6,107 13,293 157 MiB 0.62×
1 KiB none 6 13 6,117 13,337 157 MiB 0.62×
1 KiB lz4 7 13 6,980 13,489 129 MiB 0.76×
1 KiB zstd 7 13 6,839 13,334 131 MiB 0.75×
16 KiB master 110 200 7,045 12,825 634 MiB 0.49×
16 KiB none 111 202 7,088 12,916 634 MiB 0.49×
16 KiB lz4 138 297 8,863 18,980 16.5 MiB 18.9×
16 KiB zstd 140 300 8,994 19,224 6.6 MiB 47.4×
256 KiB master 457 541 1,826 2,165 536 MiB 0.93×
256 KiB none 457 542 1,828 2,166 536 MiB 0.93×
256 KiB lz4 2,054 3,117 8,218 12,467 4.9 MiB 101.6×
256 KiB zstd 1,883 3,275 7,530 13,100 4.9 MiB 101.3×

Storage micro-benchmark — random (incompressible) data

value codec set MiB/s get MiB/s on-disk ratio
1 KiB master 6 13 157 MiB 0.62×
1 KiB zstd 6 13 157 MiB 0.62×
16 KiB master 110 201 634 MiB 0.49×
16 KiB zstd 104 201 634 MiB 0.49×
256 KiB master 457 541 536 MiB 0.93×
256 KiB lz4 446 540 536 MiB 0.93×
256 KiB zstd 426 541 536 MiB 0.93×

(random rows for none/lz4 omitted — identical to master/zstd.)

End-to-end (daemon, memcached text, single connection) — corroboration

config 1 KiB compressible GET ops/s 256 KiB compressible SET ops/s 256 KiB compressible GET ops/s 256 KiB random SET ops/s
master 50,478 709 1,340 595
none 50,918 ~580* ~997* 555
lz4 50,477 1,654 2,042 572
zstd 111,700 1,744 2,082 600

* The single-connection Python driver adds noise at large sizes; the storage
micro-benchmark above is the clean signal (there nonemaster exactly).

Findings

  • Small values (≤ 1 KiB) are commit-bound, not codec-bound. Throughput is
    identical across all four configs — the fsync/commit dominates, and the codec
    is in the noise. Footprint is dominated by the fixed 16 KiB page + per-entry
    overhead, so the ratio is < 1 regardless (compression can't beat the page
    granularity here). This is exactly what the 256-byte --compression-min-bytes
    default and the shrink-check exist to avoid wasting CPU on.
  • Large compressible values are where compression pays, on both axes. At
    256 KiB, zstd/lz4 cut the on-disk footprint ~100× and run Set/Get
    ~4–6× faster than master — fewer bytes to serialize, write, and fsync more
    than repay the compression CPU. At 16 KiB, zstd is 47× smaller and ~1.3–1.5×
    faster.
  • zstd vs lz4: on this text-like data zstd gives a better or equal ratio at
    comparable speed (level 3). lz4 has a slight edge in raw compress throughput
    at 256 KiB (2,054 vs 1,883 MiB/s set) but a worse ratio at 16 KiB (18.9× vs
    47.4×). zstd is the right general-purpose default; lz4 is the pick when CPU is
    the scarce resource.
  • Incompressible data pays essentially nothing. The shrink-check stores
    random values verbatim under Identity, so footprint and read throughput match
    master exactly. The only cost is one wasted compression attempt on the write
    path — visible as ~7% lower Set MiB/s for zstd at 256 KiB random (426 vs 457),
    and negligible for lz4.
  • The branch adds no overhead when compression is off. --compression none
    tracks master within measurement noise across every size and data type — the
    new code sits entirely behind the codec check.

sccache-sized values (compile-cache workload)

sccache is the project's stated primary workload, so a dedicated sweep over
compile-cache-representative sizes and shapes. Two shapes per size:

  • objfile — a raw compiled object file: semi-compressible (repetitive
    symbol/debug strings + zero padding interleaved with incompressible machine
    code), zstd ratio ~1.3–1.7×.
  • precompressed — a blob sccache already compressed itself (its default is
    zstd/lz4): effectively incompressible; exercises the shrink-check.

Same setup as above (release, one core, batched durability, single shard, store
on tmpfs — so on real disk the compression throughput picture is more favourable
than shown, since fewer bytes reach the actual I/O path).

objfile (semi-compressible raw .o)

value codec set MiB/s get MiB/s set ops/s get ops/s on-disk ratio
256 B master 2 3 6,241 13,352 117 MiB 0.21×
256 B none 2 3 6,279 13,361 126 MiB 0.19×
256 B lz4 2 3 6,263 13,359 126 MiB 0.19×
256 B zstd 2 3 6,083 13,316 126 MiB 0.19×
8 KiB master 58 88 7,424 11,272 649 MiB 0.48×
8 KiB none 57 87 7,273 11,193 653 MiB 0.48×
8 KiB lz4 56 89 7,142 11,404 653 MiB 0.48×
8 KiB zstd 50 84 6,460 10,816 653 MiB 0.48×
64 KiB master 283 404 4,522 6,466 632 MiB 0.79×
64 KiB none 283 404 4,528 6,459 631 MiB 0.79×
64 KiB lz4 239 444 3,824 7,097 381 MiB 1.31×
64 KiB zstd 178 389 2,841 6,220 381 MiB 1.31×
512 KiB master 500 569 999 1,139 779 MiB 0.96×
512 KiB none 500 572 999 1,143 779 MiB 0.96×
512 KiB lz4 342 668 684 1,337 568 MiB 1.32×
512 KiB zstd 230 635 461 1,271 474 MiB 1.58×
2 MiB master 540 589 270 294 808 MiB 0.99×
2 MiB none 539 587 270 294 808 MiB 0.99×
2 MiB lz4 356 686 178 343 589 MiB 1.36×
2 MiB zstd 236 660 118 330 477 MiB 1.68×

precompressed (already sccache-compressed)

value codec set MiB/s get MiB/s set ops/s get ops/s on-disk ratio
256 B master 2 3 6,262 13,387 117 MiB 0.21×
256 B none 2 3 6,272 13,355 126 MiB 0.19×
256 B lz4 2 3 6,209 13,271 126 MiB 0.19×
256 B zstd 2 3 6,159 13,241 126 MiB 0.19×
8 KiB master 58 88 7,452 11,314 649 MiB 0.48×
8 KiB none 58 88 7,363 11,241 653 MiB 0.48×
8 KiB lz4 57 88 7,311 11,234 653 MiB 0.48×
8 KiB zstd 55 88 7,049 11,273 653 MiB 0.48×
64 KiB master 283 403 4,524 6,454 632 MiB 0.79×
64 KiB none 284 404 4,543 6,457 631 MiB 0.79×
64 KiB lz4 279 404 4,463 6,464 631 MiB 0.79×
64 KiB zstd 266 404 4,255 6,464 631 MiB 0.79×
512 KiB master 498 571 997 1,141 779 MiB 0.96×
512 KiB none 501 572 1,002 1,143 779 MiB 0.96×
512 KiB lz4 488 570 975 1,140 779 MiB 0.96×
512 KiB zstd 464 571 928 1,142 779 MiB 0.96×
2 MiB master 540 589 270 295 808 MiB 0.99×
2 MiB none 539 587 270 293 808 MiB 0.99×
2 MiB lz4 523 586 261 293 808 MiB 0.99×
2 MiB zstd 501 586 251 293 808 MiB 0.99×

Findings (sccache)

  • Small entries (256 B – 8 KiB) are commit-bound — all four configs
    equal; the 16 KiB page granularity dominates footprint (ratio < 1). The min-
    bytes default + shrink-check keep compression from wasting CPU here.
  • Object files ≥ 64 KiB compress ~1.3–1.7× — a real footprint saving
    (zstd 2 MiB: 1.68×, ~40% smaller on disk). On tmpfs this comes with a write
    CPU cost
    (zstd 2 MiB set: 236 vs 540 MiB/s for master) because there is no
    disk-write time to reclaim; GET is faster (660 vs 589 — fewer bytes read).
    On a real disk the write cost is partly-to-fully offset by writing ~40% fewer
    bytes.
  • lz4 is the better default for object files: near zstd's ratio (1.36× vs
    1.68× at 2 MiB) at markedly higher write throughput (356 vs 236 MiB/s), so it
    reclaims disk with a smaller CPU hit. zstd wins when space is the priority.
  • Already-compressed sccache blobs cost almost nothing — shrink-check stores
    them verbatim, so footprint and reads equal master; the only cost is one
    wasted compress attempt on write (zstd 2 MiB: 501 vs 540 MiB/s, ~7%; lz4 ~3%).
  • --compression none and master are identical across every sccache size
    and shape — no regression from the feature when it is off.

Recommendation for sccache deployments: object files compress modestly, so
compression trades write CPU for ~30–40% less disk. Use lz4 when the box
is CPU-constrained or write-heavy (best throughput-per-byte-saved), zstd
when disk is the constraint, and none if the cache stores predominantly
sccache-precompressed blobs (compression can't help those, and the shrink-check
already makes the default safe if left on).

Compress cache values before they hit the persistent CoW-tree, shrinking
disk footprint and improving throughput for I/O-bound workloads. The
codec is chosen per store and defaults to zstd; admins opt out with
`--compression none` or pick `lz4` for a CPU-minimal codec.

Design (value level, not page level):
- Compression lives in CowTreeStorage's encode/decode boundary
  (StoreEntry compresses; LoadEntry decompresses), so keys stay cleartext
  for B+tree navigation, the fixed page-offset layout is untouched, and
  the L1 cache / wire protocols only ever see plaintext.
- Threshold + shrink-check: values below --compression-min-bytes, and any
  value that does not actually get smaller (incompressible / already
  compressed), are stored verbatim under an Identity tag — a zstd default
  never enlarges a value or burns CPU pointlessly.
- Per-entry codec tag + original length in the leaf record (format v3->v4):
  a store may freely mix codecs, changing the config only affects new
  writes, and no migration is ever required. Old-format stores are
  detected via a version sentinel and refused rather than mis-parsed.

Data-driven, DI-respecting:
- New Core/Compression codec registry (stateless, modelled on Core/Crc32c)
  with a descriptor table driving name<->id, availability, and the
  compress/decompress dispatch — adding a codec is one table row.
- Config threads through the existing StorageDurability enum pattern:
  --compression / --compression-level / --compression-min-bytes across
  CLI, YAML, ConfigMerge, and the startup banner.

Build: gated behind FASTCACHED_ENABLE_COMPRESSION (ON by default; the
standard build fetches/links zstd + lz4 via CPM, preferring a system
package). Building OFF degrades the codec table to Identity-only and
turns selecting zstd/lz4 into a clean startup error — no compression
symbols linked.

Performance: writes compress once and are fsync-bound in the default
Batched durability, so lz4/zstd-3 add negligible wall-clock; reads
decompress only on an L1 miss, off the hot path. Append/Prepend are
read-modify-write and therefore decompress+recompress. End-to-end: a
200 KiB compressible value shrinks the store from 278 KiB to 64 KiB
(~4.25x) with an identical client-visible round-trip, verified across a
restart.

Risk: on-disk record format bumped v3->v4 (pre-release; no external
stores). A v3 store is detected and refused (detect-and-bail; migrator
out of scope). New third-party deps (zstd, lz4) are contained behind the
build flag.

Tests: 13 new codec + storage cases (per-codec round-trip inline/overflow
across reopen, footprint shrink, incompressible->Identity fallback,
corrupt-input / wrong-length rejection, Append/Prepend on compressed
entries, Touch preserves value+codec, mixed-codec store) plus CLI/YAML/
merge config cases; all green in both ON and OFF builds (829 tests pass).

Coverage: the clang-coverage preset does not instrument the FastCache
library target (pre-existing: --coverage is link-only there, no .gcno
emitted), so no numeric delta is available; the new paths are exercised
by the dedicated cases above in both build configurations.

Signed-off-by: Christian Parpart <christian@parpart.family>
Claude-Session: https://claude.ai/code/session_01H8xUNRPoP6JfAif4742zJv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant