Add configurable on-disk value compression (zstd + lz4) by christianparpart · Pull Request #29 · LASTRADA-Software/fastcached

christianparpart · 2026-07-02T05:44:47Z

Cache values written to the persistent CoW-tree are now compressed, shrinking the on-disk footprint and improving throughput for I/O-bound workloads. The codec is chosen per store and defaults to zstd; administrators opt out with --compression none, or pick lz4 for a CPU-minimal codec. Reads always return plaintext — decompression happens only on an L1 cache miss, off the hot path — and each record is tagged with its own codec, so a store may freely mix codecs and changing the setting only affects new writes (no migration).

Compression lives at the value level inside CowTreeStorage (compress in StoreEntry, decompress in LoadEntry), so keys stay cleartext for B+tree navigation, the fixed page-offset layout is untouched, and the L1 cache and wire protocols only ever see the original bytes. A threshold plus a shrink-check keep the zstd default safe: values below --compression-min-bytes, and any value that does not actually get smaller (incompressible or already-compressed data), are stored verbatim under an Identity tag, so compression never enlarges a value or burns CPU pointlessly.

Changes

New Core/Compression codec registry (stateless, modelled on Core/Crc32c): a descriptor table drives name↔id, availability, and the compress/decompress dispatch, so adding a codec is a single table row.
CowTreeStorage leaf-record format bumped v3→v4 with a per-entry codec id + original length; StoreEntry compresses once (threshold + shrink-check) and LoadEntry decompresses to plaintext. Metadata-only Touch/MarkStale preserve the codec without recompressing.
Old, pre-v4 stores are detected via a version sentinel and refused on open rather than mis-parsed (detect-and-bail; a migrator is out of scope, and dovetails with the existing "expose on-disk format version" TODO).
Config plumbing following the existing StorageDurability pattern: --compression, --compression-level, and --compression-min-bytes across CLI, YAML, ConfigMerge, and the startup banner.
Gated behind FASTCACHED_ENABLE_COMPRESSION (ON by default; the standard build fetches/links zstd + lz4 via CPM, preferring a system package). Building with it OFF degrades the codec table to Identity-only and makes selecting zstd/lz4 a clean startup error, linking no compression symbols.

Performance: writes compress once and are fsync-bound in the default Batched durability, so lz4 and zstd level 3 add negligible wall-clock; Append/Prepend are read-modify-write and so decompress+recompress. End-to-end, a 200 KiB compressible value shrinks the store from 278 KiB to 64 KiB (~4.25×) with an identical client-visible round-trip, verified across a restart.

Performance comparison

Performance comparison: master vs compression modes vs value size

Setup. Storage-layer micro-benchmark driving CowTreeStorage Set/Get
directly (isolates the codec from network/protocol noise), plus an end-to-end
sanity pass over the memcached text protocol against the running daemon.
Release builds (clang-release, -O2 -DNDEBUG, no sanitizers), pinned to one
core, durability=batched (default), single shard. master = origin/master
(no compression code); the other three are this branch with the named
--compression. Backing store on tmpfs, so throughput reflects CPU + fsync
cost rather than disk-seek latency — on real disk the compression throughput
win is larger (fewer bytes through the actual I/O path). On-disk footprint is
filesystem-independent. Machine: i9-14900HX.

Storage micro-benchmark — compressible data

value	codec	set MiB/s	get MiB/s	set ops/s	get ops/s	on-disk	ratio
64 B	master	0.4	0.8	6,223	13,152	271 MiB	0.04×
64 B	none	0.4	0.8	6,235	13,143	276 MiB	0.04×
64 B	lz4	0.4	0.8	6,094	12,978	276 MiB	0.04×
64 B	zstd	0.4	0.8	6,177	13,240	276 MiB	0.04×
1 KiB	master	6	13	6,107	13,293	157 MiB	0.62×
1 KiB	none	6	13	6,117	13,337	157 MiB	0.62×
1 KiB	lz4	7	13	6,980	13,489	129 MiB	0.76×
1 KiB	zstd	7	13	6,839	13,334	131 MiB	0.75×
16 KiB	master	110	200	7,045	12,825	634 MiB	0.49×
16 KiB	none	111	202	7,088	12,916	634 MiB	0.49×
16 KiB	lz4	138	297	8,863	18,980	16.5 MiB	18.9×
16 KiB	zstd	140	300	8,994	19,224	6.6 MiB	47.4×
256 KiB	master	457	541	1,826	2,165	536 MiB	0.93×
256 KiB	none	457	542	1,828	2,166	536 MiB	0.93×
256 KiB	lz4	2,054	3,117	8,218	12,467	4.9 MiB	101.6×
256 KiB	zstd	1,883	3,275	7,530	13,100	4.9 MiB	101.3×

Storage micro-benchmark — random (incompressible) data

value	codec	set MiB/s	get MiB/s	on-disk	ratio
1 KiB	master	6	13	157 MiB	0.62×
1 KiB	zstd	6	13	157 MiB	0.62×
16 KiB	master	110	201	634 MiB	0.49×
16 KiB	zstd	104	201	634 MiB	0.49×
256 KiB	master	457	541	536 MiB	0.93×
256 KiB	lz4	446	540	536 MiB	0.93×
256 KiB	zstd	426	541	536 MiB	0.93×

(random rows for none/lz4 omitted — identical to master/zstd.)

End-to-end (daemon, memcached text, single connection) — corroboration

config	1 KiB compressible GET ops/s	256 KiB compressible SET ops/s	256 KiB compressible GET ops/s	256 KiB random SET ops/s
master	50,478	709	1,340	595
none	50,918	~580*	~997*	555
lz4	50,477	1,654	2,042	572
zstd	111,700	1,744	2,082	600

* The single-connection Python driver adds noise at large sizes; the storage
micro-benchmark above is the clean signal (there none ≈ master exactly).

Findings

Small values (≤ 1 KiB) are commit-bound, not codec-bound. Throughput is
identical across all four configs — the fsync/commit dominates, and the codec
is in the noise. Footprint is dominated by the fixed 16 KiB page + per-entry
overhead, so the ratio is < 1 regardless (compression can't beat the page
granularity here). This is exactly what the 256-byte --compression-min-bytes
default and the shrink-check exist to avoid wasting CPU on.
Large compressible values are where compression pays, on both axes. At
256 KiB, zstd/lz4 cut the on-disk footprint ~100× and run Set/Get
~4–6× faster than master — fewer bytes to serialize, write, and fsync more
than repay the compression CPU. At 16 KiB, zstd is 47× smaller and ~1.3–1.5×
faster.
zstd vs lz4: on this text-like data zstd gives a better or equal ratio at
comparable speed (level 3). lz4 has a slight edge in raw compress throughput
at 256 KiB (2,054 vs 1,883 MiB/s set) but a worse ratio at 16 KiB (18.9× vs
47.4×). zstd is the right general-purpose default; lz4 is the pick when CPU is
the scarce resource.
Incompressible data pays essentially nothing. The shrink-check stores
random values verbatim under Identity, so footprint and read throughput match
master exactly. The only cost is one wasted compression attempt on the write
path — visible as ~7% lower Set MiB/s for zstd at 256 KiB random (426 vs 457),
and negligible for lz4.
The branch adds no overhead when compression is off. --compression none
tracks master within measurement noise across every size and data type — the
new code sits entirely behind the codec check.

sccache-sized values (compile-cache workload)

sccache is the project's stated primary workload, so a dedicated sweep over
compile-cache-representative sizes and shapes. Two shapes per size:

objfile — a raw compiled object file: semi-compressible (repetitive
symbol/debug strings + zero padding interleaved with incompressible machine
code), zstd ratio ~1.3–1.7×.
precompressed — a blob sccache already compressed itself (its default is
zstd/lz4): effectively incompressible; exercises the shrink-check.

Same setup as above (release, one core, batched durability, single shard, store
on tmpfs — so on real disk the compression throughput picture is more favourable
than shown, since fewer bytes reach the actual I/O path).

objfile (semi-compressible raw `.o`)

value	codec	set MiB/s	get MiB/s	set ops/s	get ops/s	on-disk	ratio
256 B	master	2	3	6,241	13,352	117 MiB	0.21×
256 B	none	2	3	6,279	13,361	126 MiB	0.19×
256 B	lz4	2	3	6,263	13,359	126 MiB	0.19×
256 B	zstd	2	3	6,083	13,316	126 MiB	0.19×
8 KiB	master	58	88	7,424	11,272	649 MiB	0.48×
8 KiB	none	57	87	7,273	11,193	653 MiB	0.48×
8 KiB	lz4	56	89	7,142	11,404	653 MiB	0.48×
8 KiB	zstd	50	84	6,460	10,816	653 MiB	0.48×
64 KiB	master	283	404	4,522	6,466	632 MiB	0.79×
64 KiB	none	283	404	4,528	6,459	631 MiB	0.79×
64 KiB	lz4	239	444	3,824	7,097	381 MiB	1.31×
64 KiB	zstd	178	389	2,841	6,220	381 MiB	1.31×
512 KiB	master	500	569	999	1,139	779 MiB	0.96×
512 KiB	none	500	572	999	1,143	779 MiB	0.96×
512 KiB	lz4	342	668	684	1,337	568 MiB	1.32×
512 KiB	zstd	230	635	461	1,271	474 MiB	1.58×
2 MiB	master	540	589	270	294	808 MiB	0.99×
2 MiB	none	539	587	270	294	808 MiB	0.99×
2 MiB	lz4	356	686	178	343	589 MiB	1.36×
2 MiB	zstd	236	660	118	330	477 MiB	1.68×

precompressed (already sccache-compressed)

value	codec	set MiB/s	get MiB/s	set ops/s	get ops/s	on-disk	ratio
256 B	master	2	3	6,262	13,387	117 MiB	0.21×
256 B	none	2	3	6,272	13,355	126 MiB	0.19×
256 B	lz4	2	3	6,209	13,271	126 MiB	0.19×
256 B	zstd	2	3	6,159	13,241	126 MiB	0.19×
8 KiB	master	58	88	7,452	11,314	649 MiB	0.48×
8 KiB	none	58	88	7,363	11,241	653 MiB	0.48×
8 KiB	lz4	57	88	7,311	11,234	653 MiB	0.48×
8 KiB	zstd	55	88	7,049	11,273	653 MiB	0.48×
64 KiB	master	283	403	4,524	6,454	632 MiB	0.79×
64 KiB	none	284	404	4,543	6,457	631 MiB	0.79×
64 KiB	lz4	279	404	4,463	6,464	631 MiB	0.79×
64 KiB	zstd	266	404	4,255	6,464	631 MiB	0.79×
512 KiB	master	498	571	997	1,141	779 MiB	0.96×
512 KiB	none	501	572	1,002	1,143	779 MiB	0.96×
512 KiB	lz4	488	570	975	1,140	779 MiB	0.96×
512 KiB	zstd	464	571	928	1,142	779 MiB	0.96×
2 MiB	master	540	589	270	295	808 MiB	0.99×
2 MiB	none	539	587	270	293	808 MiB	0.99×
2 MiB	lz4	523	586	261	293	808 MiB	0.99×
2 MiB	zstd	501	586	251	293	808 MiB	0.99×

Findings (sccache)

Small entries (256 B – 8 KiB) are commit-bound — all four configs
equal; the 16 KiB page granularity dominates footprint (ratio < 1). The min-
bytes default + shrink-check keep compression from wasting CPU here.
Object files ≥ 64 KiB compress ~1.3–1.7× — a real footprint saving
(zstd 2 MiB: 1.68×, ~40% smaller on disk). On tmpfs this comes with a write
CPU cost (zstd 2 MiB set: 236 vs 540 MiB/s for master) because there is no
disk-write time to reclaim; GET is faster (660 vs 589 — fewer bytes read).
On a real disk the write cost is partly-to-fully offset by writing ~40% fewer
bytes.
lz4 is the better default for object files: near zstd's ratio (1.36× vs
1.68× at 2 MiB) at markedly higher write throughput (356 vs 236 MiB/s), so it
reclaims disk with a smaller CPU hit. zstd wins when space is the priority.
Already-compressed sccache blobs cost almost nothing — shrink-check stores
them verbatim, so footprint and reads equal master; the only cost is one
wasted compress attempt on write (zstd 2 MiB: 501 vs 540 MiB/s, ~7%; lz4 ~3%).
--compression none and master are identical across every sccache size
and shape — no regression from the feature when it is off.

Recommendation for sccache deployments: object files compress modestly, so
compression trades write CPU for ~30–40% less disk. Use lz4 when the box
is CPU-constrained or write-heavy (best throughput-per-byte-saved), zstd
when disk is the constraint, and none if the cache stores predominantly
sccache-precompressed blobs (compression can't help those, and the shrink-check
already makes the default safe if left on).

Compress cache values before they hit the persistent CoW-tree, shrinking disk footprint and improving throughput for I/O-bound workloads. The codec is chosen per store and defaults to zstd; admins opt out with `--compression none` or pick `lz4` for a CPU-minimal codec. Design (value level, not page level): - Compression lives in CowTreeStorage's encode/decode boundary (StoreEntry compresses; LoadEntry decompresses), so keys stay cleartext for B+tree navigation, the fixed page-offset layout is untouched, and the L1 cache / wire protocols only ever see plaintext. - Threshold + shrink-check: values below --compression-min-bytes, and any value that does not actually get smaller (incompressible / already compressed), are stored verbatim under an Identity tag — a zstd default never enlarges a value or burns CPU pointlessly. - Per-entry codec tag + original length in the leaf record (format v3->v4): a store may freely mix codecs, changing the config only affects new writes, and no migration is ever required. Old-format stores are detected via a version sentinel and refused rather than mis-parsed. Data-driven, DI-respecting: - New Core/Compression codec registry (stateless, modelled on Core/Crc32c) with a descriptor table driving name<->id, availability, and the compress/decompress dispatch — adding a codec is one table row. - Config threads through the existing StorageDurability enum pattern: --compression / --compression-level / --compression-min-bytes across CLI, YAML, ConfigMerge, and the startup banner. Build: gated behind FASTCACHED_ENABLE_COMPRESSION (ON by default; the standard build fetches/links zstd + lz4 via CPM, preferring a system package). Building OFF degrades the codec table to Identity-only and turns selecting zstd/lz4 into a clean startup error — no compression symbols linked. Performance: writes compress once and are fsync-bound in the default Batched durability, so lz4/zstd-3 add negligible wall-clock; reads decompress only on an L1 miss, off the hot path. Append/Prepend are read-modify-write and therefore decompress+recompress. End-to-end: a 200 KiB compressible value shrinks the store from 278 KiB to 64 KiB (~4.25x) with an identical client-visible round-trip, verified across a restart. Risk: on-disk record format bumped v3->v4 (pre-release; no external stores). A v3 store is detected and refused (detect-and-bail; migrator out of scope). New third-party deps (zstd, lz4) are contained behind the build flag. Tests: 13 new codec + storage cases (per-codec round-trip inline/overflow across reopen, footprint shrink, incompressible->Identity fallback, corrupt-input / wrong-length rejection, Append/Prepend on compressed entries, Touch preserves value+codec, mixed-codec store) plus CLI/YAML/ merge config cases; all green in both ON and OFF builds (829 tests pass). Coverage: the clang-coverage preset does not instrument the FastCache library target (pre-existing: --coverage is link-only there, no .gcno emitted), so no numeric delta is available; the new paths are exercised by the dedicated cases above in both build configurations. Signed-off-by: Christian Parpart <christian@parpart.family> Claude-Session: https://claude.ai/code/session_01H8xUNRPoP6JfAif4742zJv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add configurable on-disk value compression (zstd + lz4)#29

Add configurable on-disk value compression (zstd + lz4)#29
christianparpart wants to merge 1 commit into
masterfrom
feature/on-disk-compression

christianparpart commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

christianparpart commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Performance comparison

Performance comparison: master vs compression modes vs value size

Storage micro-benchmark — compressible data

Storage micro-benchmark — random (incompressible) data

End-to-end (daemon, memcached text, single connection) — corroboration

Findings

sccache-sized values (compile-cache workload)

objfile (semi-compressible raw .o)

precompressed (already sccache-compressed)

Findings (sccache)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christianparpart commented Jul 2, 2026 •

edited

Loading

objfile (semi-compressible raw `.o`)