mzip

Detection-based compression that finds patterns zstd, brotli, and bzip2 miss.

Store the formula, not the data.

TL;DR

A C++17 single-header library that detects mathematical structure and per-file patterns before reaching for an entropy coder. On data that has structure (numeric sequences, templates, columns, prose, audio, gradients, logs) it produces meaningfully smaller output than zstd:19, brotli:11, bzip2:9, xz:9, 7z, and rar. On generic small source code it is competitive but rarely beats brotli. Every output is round-trip-verified before the encoder commits to it.

Benchmark	mzip wins	Notes
250 synthetic tests (50 types × 5 sizes)	233 / 250 (93.2%)	Avg ratio 8.25× — top non-neural ratio among the 8 compressors tested. All 250 roundtrips verified.
47 real GitHub files (7.2 MB)	13 / 47 (27.7%)	Wins everywhere data has structure (logs, CSV, JSON, large mixed). Loses to brotli on small handwritten source code. All 47 roundtrips verified.
enwik9 10 MB Wikipedia prose	2,671,197 bytes	Beats brotli:11 by 5.9%, bzip2:9 by 14.4%. Smallest output of any standard library compressor on this benchmark.

Sections: Why · Strengths & limits · Synthetic results · enwik9 prose · Real-world files · Strategies · Quick start

Why mzip?

Most compressors treat all bytes as random and rely on LZ77 + entropy coding. Real data is rarely random — it has structure that LZ77 cannot reach: a sequential-ID column is a formula v[i] = a + b·i, a JSON API response is a template with variables, an audio waveform is a smooth function. mzip detects that structure first and substitutes the minimal description before the entropy coder ever runs.

A few representative wins (1 MB inputs, vs the best of zstd:19 / brotli:11 / bzip2:9 / xz:9 / 7z / rar):

Input pattern	mzip output	Best other	Advantage
Sequential database IDs	32 B	bzip2: 3.4 KB	106× smaller
Repeating JSON API templates	10 KB	brotli: 49 KB	4.9× smaller
16-bit audio PCM	1.7 KB	bzip2: 4.0 KB	2.4× smaller
RGB image gradient	124 B	brotli: 397 B	3.2× smaller
10 MB enwik9 Wikipedia prose	2.67 MB	brotli:11: 2.83 MB	5.9% smaller

The first four wins come from formula or template detection — algorithmic substitutions LZ77 has no path to. The fifth win comes from a tuned BWT pipeline (capfold + word dict + LZP-after-dict + multi-tree Huffman with per-block dynamic trial) that puts mzip at the top of standard library compressors on long-form English prose.

Key Strengths

mzip is a specialist, not a general archiver. On data that has structure — numeric sequences, templates, columns, prose, audio, gradients, logs — it produces meaningfully smaller output than zstd:19, brotli:11, bzip2:9, xz:9, 7z, or rar. On generic small handwritten source code (a few KB of TypeScript, Markdown, or Python) brotli's 120 KB static dictionary usually wins. The 93.2% synthetic win rate (formula-friendly suite) and 27.7% real-world win rate (47-file GitHub corpus) are both honest measurements of those two regimes. Read both, then pick the tool for the data you actually have.

When to use mzip

Long-form English text (≥ 1 MB): prose, books, email archives, Wikipedia-class content. mzip beats brotli:11 by 5.9% on enwik9 10 MB.
Generated / templated data: K8s manifests, OpenAPI output, log streams, repeating JSON, SQL INSERTs, generated docs.
Numeric arrays and time series: sequential IDs, timestamps, counters, sensor data, audio PCM, GPS coordinates, float arrays.
Stream-separable formats: CSV, fixed-width logs, HTML, large XML, JSON Lines.
Anything ≥ 256 KB where compression ratio is the budget you care about.

When not to use mzip

Latency-bound read paths: mzip decompresses ~38× slower than zstd. If your hot path opens compressed files repeatedly, use zstd.
Small handwritten source code (a few KB of code/config/markdown): brotli usually wins by 3–20%.
Already-compressed or encrypted data: nothing to detect, RAW path falls back to zstd:19 — just use zstd directly.
General folder backup / archival: ZPAQ, 7z LZMA2, and xz are mature CLI archivers with broader format support. mzip is a library, not an archive format.

Where mzip wins big

Data category	Synthetic win rate	Why
Numeric sequences (IDs, timestamps, counters)	100% (40/40)	Formula compression: `v[i] = a + b·i` beats any LZ77
Binary / audio / sensor	100% (20/20)	Delta + ALP for floats, Paeth/E8E9 predictors
Long-form text (prose / markdown)	100% (15/15)	BWT pipeline + capfold + word dict + LZP-after-dict
Code (JS, Py, Go, Rust, ...) at scale	94.5% (52/55)	Identifier-stream separation, BWT, ZSTD_DICT trial
Logs (access / syslog / JSON log)	80–96%	Columnar separation + BWT per column
Anything ≥ 256 KB	97% (97/100)	More data → more patterns to detect

Where mzip loses

Scenario	Winner	Why
Small handwritten source code (4–30 KB)	brotli:11	brotli ships a 120 KB pre-built English/web dictionary; per-file approaches can't fully match it without shipping their own
Random / encrypted / already-compressed data	zstd	No patterns to detect — entropy coder is all that's left
Decompression speed	zstd	mzip is ~38× slower to decompress (BWT vs LZ77 inverse). Trade-off, not a bug. See Decompression Speed (the trade-off).

Benchmark Results

All synthetic benchmarks below are generated deterministically by generators.hpp (seeded RNG, seed=42). Click sample links to download the exact input bytes the table reports on.

Roundtrip-verified. mzip's encoder verifies its own output decodes back to the input before returning. Strategies that produce non-roundtrippable bytes are auto-discarded in favor of the next-best valid candidate. All 250 synthetic + 47 real-world results below pass verification.

Overall Compressor Scoreboard (250 tests: 50 types × 5 sizes)

Compressor	Avg Ratio	Range	Wins	Win%	Rank
mzip	8.25x	1.0–32768x	233	93.2%	1
brotli:11	5.78x	1.0–1716x	15	6.0%	2
bzip2:9	5.66x	1.0–1001x	5	2.0%	3
rar:m5	5.97x	1.0–1014x	0	0.0%	4
xz:9	5.89x	1.0–997x	0	0.0%	5
7z:mx9	5.88x	1.0–922x	0	0.0%	6
zstd:19	5.14x	1.0–2641x	0	0.0%	7
gzip:9	4.78x	1.0–240x	0	0.0%	8

Verified: 250/250 roundtrips pass. Total input: 66.60 MB. lz4/snappy excluded (speed-focused, optimize for a different point on the curve).

Decompression Speed (the trade-off)

mzip optimizes for the smallest output and accepts a slower decode in return. If your workload is read-heavy and latency-bound, this is the wrong tool — use zstd. If you write once and store / transfer often, the savings show up on every later read.

Compressor	Decode time (66.6 MB)	Decode speed
zstd:19	92 ms	722 MB/s
mzip	3,523 ms	19 MB/s

zstd is roughly 38× faster to decompress. Most of the gap is BWT-inverse vs LZ77-inverse — a structural difference, not a tuning gap. Compression time on the same suite is mzip ≈ 3.5× slower than zstd:19, which is closer because both are doing real work. Streaming / incremental decode is on the roadmap but not in main yet.

Win Rate by Size

Size	Wins	Total	Win%
4KB	43	50	86.0%
16KB	48	50	96.0%
64KB	45	50	90.0%
256KB	48	50	96.0%
1MB	49	50	98.0%

Top 10 mzip Wins

Type	mzip	2nd Best	Advantage
Database IDs (1MB)	32B (32768x)	3.4KB	106.8x better
Timestamps (1MB)	32B (32768x)	2.7KB	84.2x better
Database IDs (256KB)	32B (8192x)	937B	29.3x better
Timestamps (256KB)	32B (8192x)	772B	24.1x better
Database IDs (64KB)	32B (2048x)	301B	9.4x better
Timestamps (64KB)	32B (2048x)	287B	9.0x better
Image gradient (256KB)	53B (4946x)	323B	6.1x better
Image gradient (64KB)	39B (1680x)	212B	5.4x better
Timestamps (16KB)	32B (512x)	160B	5.0x better
JSON API (1MB)	10KB (104x)	49KB	4.9x better

Where Others Win (Top 10 Gaps)

The remaining gaps are all on small (4–16KB) text/code/config where brotli's pre-built 120KB English dictionary gives an edge no per-file approach can fully match without shipping its own dictionary.

Type	Size	mzip	Best	Gap
Terraform	1MB	41KB	brotli: 40KB	+825B
Terraform	64KB	3.4KB	brotli: 3.0KB	+379B
HTML	4KB	1004B	brotli: 821B	+183B
Terraform	256KB	10KB	brotli: 10KB	+161B
JSON log	16KB	2.4KB	brotli: 2.3KB	+148B
Makefile	4KB	809B	brotli: 697B	+112B
JSON log	4KB	866B	brotli: 768B	+98B
INI config	64KB	10KB	bzip2: 10KB	+87B
Unicode text	4KB	626B	brotli: 549B	+77B
Bash	4KB	825B	brotli: 765B	+60B

Long-Form Prose (enwik9)

enwik9 is the canonical benchmark for text compressors — the first 10⁹ bytes of an English Wikipedia dump. The numbers below are mzip vs every standard library compressor on the first 1 MB and 10 MB prefixes.

Compressor	enwik9 1 MB	enwik9 10 MB	Class
zstd:19	312,639	2,921,957	LZ77 + entropy
gzip:9	356,643	3,720,323	LZ77 + Huffman
bzip2:9	294,484	3,054,639	BWT + multi-tree Huffman
xz:9	302,832	2,844,360	LZMA
7z:mx9	302,910	2,844,433	LZMA2
brotli:11	293,057	2,827,632	LZ77 + Huffman + 120 KB static dict
mzip	286,307	2,671,197	BWT + capfold + word dict + LZP-after-dict, per-block dynamic trial
bzip3 †	~245,000	~2,300,000	BWT + LZP + arithmetic + 1-symbol context model (CLI, not a library)
bsc-m03 †	—	—	LZP + BWT + M03 context coder (CLI, research-grade, not a library)

mzip produces the smallest output on both prefixes among standard library compressors — beats brotli:11 by 5.9%, xz/7z by 6.5%, bzip2:9 by 14.4% on 10 MB; beats brotli:11 by 2.4%, xz/7z by 5.8%, bzip2:9 by 2.9% on 1 MB.

† The honest ceiling. bzip3 and bsc-m03 are CLI archive tools (not embeddable libraries) that replace the entropy backend with adaptive arithmetic coding driven by a post-BWT context model. They beat mzip on enwik9 by ~14% (bzip3) to ~20% (bsc-m03) at the cost of much slower compression and no library API. ZPAQ achieves better ratios still (closer to cmix) at the cost of being orders of magnitude slower. Closing the gap to bzip3/bsc-class without giving up the library API is the goal of BWT_ROADMAP.md.