Detection-based compression that finds patterns zstd, brotli, and bzip2 miss.
Store the formula, not the data.
A C++17 single-header library that detects mathematical structure and per-file patterns before reaching for an entropy coder. On data that has structure (numeric sequences, templates, columns, prose, audio, gradients, logs) it produces meaningfully smaller output than zstd:19, brotli:11, bzip2:9, xz:9, 7z, and rar. On generic small source code it is competitive but rarely beats brotli. Every output is round-trip-verified before the encoder commits to it.
| Benchmark | mzip wins | Notes |
|---|---|---|
| 250 synthetic tests (50 types × 5 sizes) | 233 / 250 (93.2%) | Avg ratio 8.25× — top non-neural ratio among the 8 compressors tested. All 250 roundtrips verified. |
| 47 real GitHub files (7.2 MB) | 13 / 47 (27.7%) | Wins everywhere data has structure (logs, CSV, JSON, large mixed). Loses to brotli on small handwritten source code. All 47 roundtrips verified. |
| enwik9 10 MB Wikipedia prose | 2,671,197 bytes | Beats brotli:11 by 5.9%, bzip2:9 by 14.4%. Smallest output of any standard library compressor on this benchmark. |
Sections: Why · Strengths & limits · Synthetic results · enwik9 prose · Real-world files · Strategies · Quick start
Most compressors treat all bytes as random and rely on LZ77 + entropy coding. Real data is rarely random — it has structure that LZ77 cannot reach: a sequential-ID column is a formula v[i] = a + b·i, a JSON API response is a template with variables, an audio waveform is a smooth function. mzip detects that structure first and substitutes the minimal description before the entropy coder ever runs.
A few representative wins (1 MB inputs, vs the best of zstd:19 / brotli:11 / bzip2:9 / xz:9 / 7z / rar):
| Input pattern | mzip output | Best other | Advantage |
|---|---|---|---|
| Sequential database IDs | 32 B | bzip2: 3.4 KB | 106× smaller |
| Repeating JSON API templates | 10 KB | brotli: 49 KB | 4.9× smaller |
| 16-bit audio PCM | 1.7 KB | bzip2: 4.0 KB | 2.4× smaller |
| RGB image gradient | 124 B | brotli: 397 B | 3.2× smaller |
| 10 MB enwik9 Wikipedia prose | 2.67 MB | brotli:11: 2.83 MB | 5.9% smaller |
The first four wins come from formula or template detection — algorithmic substitutions LZ77 has no path to. The fifth win comes from a tuned BWT pipeline (capfold + word dict + LZP-after-dict + multi-tree Huffman with per-block dynamic trial) that puts mzip at the top of standard library compressors on long-form English prose.
mzip is a specialist, not a general archiver. On data that has structure — numeric sequences, templates, columns, prose, audio, gradients, logs — it produces meaningfully smaller output than zstd:19, brotli:11, bzip2:9, xz:9, 7z, or rar. On generic small handwritten source code (a few KB of TypeScript, Markdown, or Python) brotli's 120 KB static dictionary usually wins. The 93.2% synthetic win rate (formula-friendly suite) and 27.7% real-world win rate (47-file GitHub corpus) are both honest measurements of those two regimes. Read both, then pick the tool for the data you actually have.
- Long-form English text (≥ 1 MB): prose, books, email archives, Wikipedia-class content. mzip beats brotli:11 by 5.9% on enwik9 10 MB.
- Generated / templated data: K8s manifests, OpenAPI output, log streams, repeating JSON, SQL INSERTs, generated docs.
- Numeric arrays and time series: sequential IDs, timestamps, counters, sensor data, audio PCM, GPS coordinates, float arrays.
- Stream-separable formats: CSV, fixed-width logs, HTML, large XML, JSON Lines.
- Anything ≥ 256 KB where compression ratio is the budget you care about.
- Latency-bound read paths: mzip decompresses ~38× slower than zstd. If your hot path opens compressed files repeatedly, use zstd.
- Small handwritten source code (a few KB of code/config/markdown): brotli usually wins by 3–20%.
- Already-compressed or encrypted data: nothing to detect, RAW path falls back to zstd:19 — just use zstd directly.
- General folder backup / archival: ZPAQ, 7z LZMA2, and xz are mature CLI archivers with broader format support. mzip is a library, not an archive format.
| Data category | Synthetic win rate | Why |
|---|---|---|
| Numeric sequences (IDs, timestamps, counters) | 100% (40/40) | Formula compression: v[i] = a + b·i beats any LZ77 |
| Binary / audio / sensor | 100% (20/20) | Delta + ALP for floats, Paeth/E8E9 predictors |
| Long-form text (prose / markdown) | 100% (15/15) | BWT pipeline + capfold + word dict + LZP-after-dict |
| Code (JS, Py, Go, Rust, ...) at scale | 94.5% (52/55) | Identifier-stream separation, BWT, ZSTD_DICT trial |
| Logs (access / syslog / JSON log) | 80–96% | Columnar separation + BWT per column |
| Anything ≥ 256 KB | 97% (97/100) | More data → more patterns to detect |
| Scenario | Winner | Why |
|---|---|---|
| Small handwritten source code (4–30 KB) | brotli:11 | brotli ships a 120 KB pre-built English/web dictionary; per-file approaches can't fully match it without shipping their own |
| Random / encrypted / already-compressed data | zstd | No patterns to detect — entropy coder is all that's left |
| Decompression speed | zstd | mzip is ~38× slower to decompress (BWT vs LZ77 inverse). Trade-off, not a bug. See Decompression Speed (the trade-off). |
All synthetic benchmarks below are generated deterministically by generators.hpp (seeded RNG, seed=42). Click sample links to download the exact input bytes the table reports on.
Roundtrip-verified. mzip's encoder verifies its own output decodes back to the input before returning. Strategies that produce non-roundtrippable bytes are auto-discarded in favor of the next-best valid candidate. All 250 synthetic + 47 real-world results below pass verification.
| Compressor | Avg Ratio | Range | Wins | Win% | Rank |
|---|---|---|---|---|---|
| mzip | 8.25x | 1.0–32768x | 233 | 93.2% | 1 |
| brotli:11 | 5.78x | 1.0–1716x | 15 | 6.0% | 2 |
| bzip2:9 | 5.66x | 1.0–1001x | 5 | 2.0% | 3 |
| rar:m5 | 5.97x | 1.0–1014x | 0 | 0.0% | 4 |
| xz:9 | 5.89x | 1.0–997x | 0 | 0.0% | 5 |
| 7z:mx9 | 5.88x | 1.0–922x | 0 | 0.0% | 6 |
| zstd:19 | 5.14x | 1.0–2641x | 0 | 0.0% | 7 |
| gzip:9 | 4.78x | 1.0–240x | 0 | 0.0% | 8 |
Verified: 250/250 roundtrips pass. Total input: 66.60 MB. lz4/snappy excluded (speed-focused, optimize for a different point on the curve).
mzip optimizes for the smallest output and accepts a slower decode in return. If your workload is read-heavy and latency-bound, this is the wrong tool — use zstd. If you write once and store / transfer often, the savings show up on every later read.
| Compressor | Decode time (66.6 MB) | Decode speed |
|---|---|---|
| zstd:19 | 92 ms | 722 MB/s |
| mzip | 3,523 ms | 19 MB/s |
zstd is roughly 38× faster to decompress. Most of the gap is BWT-inverse vs LZ77-inverse — a structural difference, not a tuning gap. Compression time on the same suite is mzip ≈ 3.5× slower than zstd:19, which is closer because both are doing real work. Streaming / incremental decode is on the roadmap but not in main yet.
| Size | Wins | Total | Win% |
|---|---|---|---|
| 4KB | 43 | 50 | 86.0% |
| 16KB | 48 | 50 | 96.0% |
| 64KB | 45 | 50 | 90.0% |
| 256KB | 48 | 50 | 96.0% |
| 1MB | 49 | 50 | 98.0% |
| Type | mzip | 2nd Best | Advantage |
|---|---|---|---|
| Database IDs (1MB) | 32B (32768x) | 3.4KB | 106.8x better |
| Timestamps (1MB) | 32B (32768x) | 2.7KB | 84.2x better |
| Database IDs (256KB) | 32B (8192x) | 937B | 29.3x better |
| Timestamps (256KB) | 32B (8192x) | 772B | 24.1x better |
| Database IDs (64KB) | 32B (2048x) | 301B | 9.4x better |
| Timestamps (64KB) | 32B (2048x) | 287B | 9.0x better |
| Image gradient (256KB) | 53B (4946x) | 323B | 6.1x better |
| Image gradient (64KB) | 39B (1680x) | 212B | 5.4x better |
| Timestamps (16KB) | 32B (512x) | 160B | 5.0x better |
| JSON API (1MB) | 10KB (104x) | 49KB | 4.9x better |
The remaining gaps are all on small (4–16KB) text/code/config where brotli's pre-built 120KB English dictionary gives an edge no per-file approach can fully match without shipping its own dictionary.
| Type | Size | mzip | Best | Gap |
|---|---|---|---|---|
| Terraform | 1MB | 41KB | brotli: 40KB | +825B |
| Terraform | 64KB | 3.4KB | brotli: 3.0KB | +379B |
| HTML | 4KB | 1004B | brotli: 821B | +183B |
| Terraform | 256KB | 10KB | brotli: 10KB | +161B |
| JSON log | 16KB | 2.4KB | brotli: 2.3KB | +148B |
| Makefile | 4KB | 809B | brotli: 697B | +112B |
| JSON log | 4KB | 866B | brotli: 768B | +98B |
| INI config | 64KB | 10KB | bzip2: 10KB | +87B |
| Unicode text | 4KB | 626B | brotli: 549B | +77B |
| Bash | 4KB | 825B | brotli: 765B | +60B |
enwik9 is the canonical benchmark for text compressors — the first 10⁹ bytes of an English Wikipedia dump. The numbers below are mzip vs every standard library compressor on the first 1 MB and 10 MB prefixes.
| Compressor | enwik9 1 MB | enwik9 10 MB | Class |
|---|---|---|---|
| zstd:19 | 312,639 | 2,921,957 | LZ77 + entropy |
| gzip:9 | 356,643 | 3,720,323 | LZ77 + Huffman |
| bzip2:9 | 294,484 | 3,054,639 | BWT + multi-tree Huffman |
| xz:9 | 302,832 | 2,844,360 | LZMA |
| 7z:mx9 | 302,910 | 2,844,433 | LZMA2 |
| brotli:11 | 293,057 | 2,827,632 | LZ77 + Huffman + 120 KB static dict |
| mzip | 286,307 | 2,671,197 | BWT + capfold + word dict + LZP-after-dict, per-block dynamic trial |
| bzip3 † | ~245,000 | ~2,300,000 | BWT + LZP + arithmetic + 1-symbol context model (CLI, not a library) |
| bsc-m03 † | — | — | LZP + BWT + M03 context coder (CLI, research-grade, not a library) |
mzip produces the smallest output on both prefixes among standard library compressors — beats brotli:11 by 5.9%, xz/7z by 6.5%, bzip2:9 by 14.4% on 10 MB; beats brotli:11 by 2.4%, xz/7z by 5.8%, bzip2:9 by 2.9% on 1 MB.
† The honest ceiling. bzip3 and bsc-m03 are CLI archive tools (not embeddable libraries) that replace the entropy backend with adaptive arithmetic coding driven by a post-BWT context model. They beat mzip on enwik9 by ~14% (bzip3) to ~20% (bsc-m03) at the cost of much slower compression and no library API. ZPAQ achieves better ratios still (closer to cmix) at the cost of being orders of magnitude slower. Closing the gap to bzip3/bsc-class without giving up the library API is the goal of BWT_ROADMAP.md.
Synthetic data, generated at each size by generators.hpp so results are reproducible.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Timestamps | 32B vs 287B | 32B vs 772B | 32B vs 2.6KB | 64k 256k 1m |
| Database IDs | 32B vs 301B | 32B vs 937B | 32B vs 3.3KB | 64k 256k 1m |
| Integer array | 3.3KB vs 4.4KB | 12KB vs 17KB | 51KB vs 67KB | 64k 256k 1m |
| GPS coordinates | 9.7KB vs 11KB | 38KB vs 44KB | 154KB vs 179KB | 64k 256k 1m |
| Float temperature | 11KB vs 22KB | 40KB vs 87KB | 151KB vs 331KB | 64k 256k 1m |
| Sensor 16-bit | 26KB vs 27KB | 107KB vs 111KB | 430KB vs 445KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| GraphQL queries | 2.8KB vs 2.8KB | 7.5KB vs 7.8KB | 25KB vs 28KB | 64k 256k 1m |
| SQL dump | 4.6KB vs 4.7KB | 14KB vs 15KB | 53KB vs 56KB | 64k 256k 1m |
| JSON API | 1016B vs 3.7KB | 2.8KB vs 12KB | 9.8KB vs 48KB | 64k 256k 1m |
| XML document | 1020B vs 2.2KB | 2.9KB vs 8.0KB | 10KB vs 29KB | 64k 256k 1m |
| CSV data | 8.1KB vs 9.8KB | 27KB vs 33KB | 100KB vs 122KB | 64k 256k 1m |
| Base64 data | 47KB vs 48KB | 189KB vs 192KB | 758KB vs 771KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| JavaScript | 4.0KB vs 5.2KB | 10KB vs 12KB | 37KB vs 43KB | 64k 256k 1m |
| Python | 5.2KB vs 5.3KB | 12KB vs 12KB | 33KB vs 40KB | 64k 256k 1m |
| TypeScript | 4.2KB vs 4.4KB | 12KB vs 12KB | 40KB vs 45KB | 64k 256k 1m |
| HTML | 5.4KB vs 5.7KB | 16KB vs 18KB | 61KB vs 68KB | 64k 256k 1m |
| CSS | 4.1KB vs 4.1KB | 11KB vs 11KB | 40KB vs 43KB | 64k 256k 1m |
| Go | 3.3KB vs 3.4KB | 8.1KB vs 8.5KB | 24KB vs 28KB | 64k 256k 1m |
| Rust | 3.5KB vs 3.5KB | 8.9KB vs 9.1KB | 28KB vs 31KB | 64k 256k 1m |
| Java | 3.7KB vs 3.9KB | 10.0KB vs 10KB | 29KB vs 36KB | 64k 256k 1m |
| C | 4.9KB vs 5.2KB | 14KB vs 15KB | 47KB vs 54KB | 64k 256k 1m |
| Bash | 3.5KB vs 3.7KB | 9.9KB vs 10KB | 34KB vs 37KB | 64k 256k 1m |
| PHP | 3.2KB vs 3.3KB | 7.8KB vs 8.6KB | 23KB vs 27KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Docker Compose | 2.1KB vs 2.1KB | 5.4KB vs 5.6KB | 17KB vs 19KB | 64k 256k 1m |
| Terraform | 3.4KB vs 3.0KB | 10KB vs 10KB | 41KB vs 40KB | 64k 256k 1m |
| K8s manifests | 3.2KB vs 3.3KB | 7.4KB vs 7.6KB | 21KB vs 24KB | 64k 256k 1m |
| YAML config | 3.8KB vs 3.8KB | 11KB vs 11KB | 38KB vs 41KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Access log | 6.3KB vs 6.8KB | 21KB vs 24KB | 85KB vs 94KB | 64k 256k 1m |
| Nginx access log | 6.5KB vs 6.8KB | 21KB vs 22KB | 82KB vs 87KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Image gradient | 39B vs 212B | 53B vs 323B | 124B vs 397B | 64k 256k 1m |
| Audio PCM | 1.7KB vs 4.0KB | 1.7KB vs 4.0KB | 1.7KB vs 4.0KB | 64k 256k 1m |
| Sparse bitmap | 689B vs 880B | 2.6KB vs 3.0KB | 10KB vs 11KB | 64k 256k 1m |
| Protobuf-like | 40KB vs 41KB | 160KB vs 163KB | 640KB vs 650KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Natural text | 6.7KB vs 7.2KB | 26KB vs 28KB | 104KB vs 111KB | 64k 256k 1m |
| Markdown docs | 3.6KB vs 3.8KB | 10KB vs 11KB | 36KB vs 41KB | 64k 256k 1m |
| Email headers | 6.6KB vs 6.8KB | 20KB vs 21KB | 75KB vs 79KB | 64k 256k 1m |
| Unicode text | 2.4KB vs 2.4KB | 7.5KB vs 7.5KB | 27KB vs 28KB | 64k 256k 1m |
| Syslog | 8.8KB vs 9.4KB | 32KB vs 34KB | 126KB vs 133KB | 64k 256k 1m |
| Metrics | 7.8KB vs 7.8KB | 29KB vs 29KB | 117KB vs 117KB | 64k 256k 1m |
| JSON log | 7.2KB vs 8.0KB | 29KB vs 30KB | 116KB vs 122KB | 64k 256k 1m |
| Timestamps (jitter) | 14KB vs 15KB | 56KB vs 61KB | 224KB vs 244KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Makefile | 3.8KB vs 4.6KB | 12KB vs 16KB | 46KB vs 61KB | 64k 256k 1m |
| package.json | 4.6KB vs 4.8KB | 14KB vs 15KB | 54KB vs 57KB | 64k 256k 1m |
| Cargo.toml | 3.3KB vs 3.5KB | 9.5KB vs 10KB | 32KB vs 38KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
47 files (7.2 MB total) pulled from public GitHub repos — React, Linux kernel, Django, Bootstrap, lodash, plus 20+ programming-language files. Mix of source code, configs, logs, JSON, CSV, markdown. All 47 round-trip-verified.
Full per-file table: real_bench_summary.md. Raw output: real_bench_results.txt.
- mzip wins or ties on 13 / 47 files (27.7%)
- brotli:11 wins 32, bzip2:9 wins 1, xz:9 wins 1, all others 0
The synthetic 93.2% does not survive intact on real GitHub source code — the synthetic suite includes a lot of formula-compressible content (sequential IDs, timestamps, gradients, audio, generated templates) that hand-written code rarely contains. Read both numbers, not just one.
| File | Size | mzip ratio | 2nd best | Advantage |
|---|---|---|---|---|
| apache_log_sample.log | 2.26MB | 22.83x | brotli: 116KB | +12.5% |
| dashboard.html | 42KB | 34.04x | brotli: 1.3KB | +7.1% |
| docker-compose.yml | 3.9KB | 4.23x | brotli: 1.0KB | +8.2% |
| nginx_access.log | 417KB | 12.11x | bzip2: 35KB | +1.3% |
| app.log | 464KB | 7.90x | bzip2: 60KB | +2.5% |
| events.csv | 578KB | 7.13x | bzip2: 82KB | +1.1% |
| lodash.js | 532KB | 7.85x | bzip2: 69KB | +2.1% |
| metrics.prom | 176KB | 10.12x | bzip2: 17KB | +1.1% |
| users.json | 170KB | 10.11x | bzip2: 17KB | +1.0% |
| linux_kernel.c | 281KB | 4.41x | bzip2: 64KB | +1.0% |
| go_http.go | 128KB | 3.74x | brotli: 34KB | +0.2% |
| handlers.go | 14KB | 17.72x | brotli: 814B | +0.9% |
| styles.css | 19.6KB | 9.28x | bzip2: 2.1KB | +1.3% |
Brotli's 120KB pre-built static English/web dictionary gives it a structural edge on small code/config/markdown that no per-file approach can fully match.
| File | Size | mzip gap |
|---|---|---|
| contributing.md | 6.6KB | +21.7% |
| sql_schema.sql | 4.1KB | +20.0% |
| k8s_deployments.yaml | 21KB | +18.8% |
| api_docs.md | 17KB | +17.4% |
| ruby_rails.rb | 15KB | +16.6% |
…and 27 more, mostly handwritten source code 4–50 KB, gaps typically +3% to +12%. Full list in real_bench_summary.md.
| Category | mzip wins | Total | Win% |
|---|---|---|---|
| Logs | 3 | 3 | 100% |
| CSV / columnar | 1 | 1 | 100% |
| Metrics | 1 | 1 | 100% |
| Web (HTML/CSS) | 2 | 3 | 67% |
| JSON | 1 | 2 | 50% |
| Config files | 1 | 4 | 25% |
| Source code (general) | 4 | 25 | 16% |
| Markdown | 0 | 3 | 0% |
| SQL | 0 | 2 | 0% |
The encoder runs detection on every block, picks a candidate strategy from one of five families below, then trials multiple variants per block and keeps the smallest output that round-trip-verifies.
| Family | Picked when | Headline strategies |
|---|---|---|
| Formula / numeric | Bytes look like a generator output: linear, periodic, geometric, modular, smooth | LINEAR_GEN, NUMERIC (delta / strided / ALP), PERIODIC, MODULAR, LINEAR_PRED |
| Templates / structured text | Repeating lines or blocks with a few varying tokens (logs, generated docs, K8s, SQL INSERTs) | TEMPLATE, SECTION_TEMPLATE, ML_TEMPLATE, WORD_TEMPLATE / MULTI_WORD_TEMPLATE, LINE_GROUP_TEMPLATE |
| Stream separation | Fixed columns, tag/content split, key/value records | COLUMNAR / BLOCK_COLUMNAR, CSV_COLUMNAR, JSON_COLUMNAR, HTML_STREAM, URL_STREAM, DBF_CONSTCOL |
| Binary / executable | x86 code, raw RGB, sparse bitmaps, base64 text | E8E9_X86 + LZMA_OPTIMAL, PAETH_RGB, SPARSE, BASE64_DECODE |
| Text backends | None of the above fits — prose, code, config, mixed | BWT_TEXT, BG (single-block BWT for ≥1 MB), MC (per-chunk pick), ZSTD_DICT (4–16 KB code/config), WORD_ENCODED, KV_CONFIG, RAW (fallback to zstd:19) |
The BWT text backend (BWT_TEXT / BG) is itself a per-block trial over: pre-RLE on/off, dict size ∈ {64, 128, 192, 255}, number of Huffman trees ∈ {3..7}, LZP-after-dict min-match ∈ {10, 20, 40}, capfold on/off. The smallest valid combination wins.
For the full enum of named strategies (~30) with selection rules, grep BlockType:: and case BlockType:: in mzip.hpp — each block-type carries a one-line // what it does comment at its definition.
Requirements: a C++17 compiler and the zstd library headers + shared object. Install zstd if you don't already have it:
# macOS
brew install zstd
# Debian / Ubuntu
sudo apt install libzstd-dev
# Fedora
sudo dnf install libzstd-devel
# Windows / MSYS2
pacman -S mingw-w64-x86_64-zstdThe amalgamated header bundles mzip + the BWT pipeline + libsais. You only need to add zstd.
// In ONE translation unit:
#include <zstd.h>
#define MZIP_IMPLEMENTATION
#include "mzip_amalgamated.hpp"
// In every other translation unit that uses mzip:
#include "mzip_amalgamated.hpp"
// Usage:
auto compressed = mzip::compress(data.data(), data.size());
auto decompressed = mzip::decompress(compressed.data(), compressed.size());#include <zstd.h> // include zstd first
#include "mzip.hpp"
auto compressed = mzip::compress(data.data(), data.size());
auto decompressed = mzip::decompress(compressed.data(), compressed.size());# Single-header build (libsais bundled inside)
g++ -std=c++17 -O3 -march=native -o mzip_cli mzip_cli.cpp -lzstd
# Separate-headers build (needs libsais.c too)
g++ -std=c++17 -O3 -march=native -o mzip_cli mzip_cli.cpp libsais.c -lzstdIf your zstd headers are not on the default search path, add -I/path/to/zstd/include -L/path/to/zstd/lib.
# Compress
./mzip_cli compress input.bin output.mzip
# Decompress
./mzip_cli decompress output.mzip restored.bin# Build (assumes zstd is installed; otherwise add -I/-L flags as in Quick Start above)
g++ -std=c++17 -O3 -march=native -o mzip_bench mzip_bench.cpp libsais.c -lzstd
# Synthetic suite — 50 types × 5 sizes = 250 tests, ~10–15 min
./mzip_bench --csv full_bench.csv
# Quick (64 KB only)
./mzip_bench --quick
# Single type, all sizes
./mzip_bench --type graphql
# Single real file
./mzip_bench --file path/to/file.bin
# All 47 real-world files
./mzip_bench --file real_bench/*
# Regenerate the README tables from a fresh CSV
python generate_readme_tables.py full_bench.csv| File | Description |
|---|---|
mzip.hpp |
Main library — include this |
mzip_amalgamated.hpp |
Single-header build (mzip + BWT + libsais bundled) |
bwt_compress_v5.hpp / v8.hpp / v9.hpp |
BWT pipelines (v5 = current prose backend) |
word_dict.hpp |
Per-file word-dictionary preprocessor (BWT_TEXT helper) |
cap_fold.hpp |
Capital-letter folding (BWT_TEXT helper) |
bigram_dict.hpp, xml_entity.hpp |
Pre-BWT preprocessing candidates (auto-deselected when they don't help) |
range_coder.hpp |
LZMA-style binary range coder (variant backend) |
mzip_dicts.h |
Pre-trained zstd group dictionaries (ZSTD_DICT strategy) |
lzma_optimal2.hpp, lzma_decoder.hpp |
LZMA optimal encoder + decoder (LZMA_OPTIMAL strategy) |
mzip_base64.hpp |
Base64 detect / decode helper (BASE64_DECODE strategy) |
generators.hpp |
Single source of truth for benchmark / test data |
libsais.h |
BWT suffix array (Apache 2.0) |
stb_image.h, stb_image_write.h |
Image IO (Public Domain) — for image strategies |
mzip_bench.cpp |
Benchmark tool — --csv exports results |
mzip_cli.cpp |
Command-line interface |
mzip_test.cpp |
Quick debug / single-type test |
mzip_unit_tests.cpp |
Unit tests for core strategies |
generate_readme_tables.py |
Auto-generate the markdown tables above from full_bench.csv |
summarize_real_bench.py |
Auto-generate real_bench_summary.md from real_bench_results.txt |
samples/ |
Sample files at 4 / 16 / 64 / 256 KB and 1 MB |
real_bench/ |
47 real-world files used by the real-world benchmark |
full_bench.csv |
Latest synthetic benchmark CSV (one row per (type, size)) |
real_bench_results.txt |
Latest real-files benchmark raw output |
real_bench_summary.md |
Auto-generated summary of real_bench_results.txt |
BWT_ROADMAP.md |
Path from current BWT pipeline toward bzip3 / bsc-m03 class |
HUNT_LOG.md |
Per-loss diagnosis log (Loss / Strategy / Claim / Verdict / Action) |
IMPROVEMENTS.md |
Architectural / adoption-focused improvement plan |
Dual-licensed: AGPL-3.0 OR commercial.
- AGPL-3.0 — free for open-source projects. If you deploy mzip as part of a network service (SaaS, hosted API, etc.), the AGPL requires you to make your source available.
- Commercial — for proprietary or closed-source use, open a GitHub issue tagged
commercial-licenseand I'll follow up. Same for bug reports, benchmarks on your own data, or proposing a new strategy.
Third-party code bundled in the repo: libsais (Apache 2.0), stb_image (Public Domain). zstd is required at link-time but not bundled (BSD).