Skip to content

Cranot/mzip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mzip

Detection-based compression that finds patterns zstd, brotli, and bzip2 miss.

Store the formula, not the data.

TL;DR

A C++17 single-header library that detects mathematical structure and per-file patterns before reaching for an entropy coder. On data that has structure (numeric sequences, templates, columns, prose, audio, gradients, logs) it produces meaningfully smaller output than zstd:19, brotli:11, bzip2:9, xz:9, 7z, and rar. On generic small source code it is competitive but rarely beats brotli. Every output is round-trip-verified before the encoder commits to it.

Benchmark mzip wins Notes
250 synthetic tests (50 types × 5 sizes) 233 / 250 (93.2%) Avg ratio 8.25× — top non-neural ratio among the 8 compressors tested. All 250 roundtrips verified.
47 real GitHub files (7.2 MB) 13 / 47 (27.7%) Wins everywhere data has structure (logs, CSV, JSON, large mixed). Loses to brotli on small handwritten source code. All 47 roundtrips verified.
enwik9 10 MB Wikipedia prose 2,671,197 bytes Beats brotli:11 by 5.9%, bzip2:9 by 14.4%. Smallest output of any standard library compressor on this benchmark.

Sections: Why · Strengths & limits · Synthetic results · enwik9 prose · Real-world files · Strategies · Quick start


Why mzip?

Most compressors treat all bytes as random and rely on LZ77 + entropy coding. Real data is rarely random — it has structure that LZ77 cannot reach: a sequential-ID column is a formula v[i] = a + b·i, a JSON API response is a template with variables, an audio waveform is a smooth function. mzip detects that structure first and substitutes the minimal description before the entropy coder ever runs.

A few representative wins (1 MB inputs, vs the best of zstd:19 / brotli:11 / bzip2:9 / xz:9 / 7z / rar):

Input pattern mzip output Best other Advantage
Sequential database IDs 32 B bzip2: 3.4 KB 106× smaller
Repeating JSON API templates 10 KB brotli: 49 KB 4.9× smaller
16-bit audio PCM 1.7 KB bzip2: 4.0 KB 2.4× smaller
RGB image gradient 124 B brotli: 397 B 3.2× smaller
10 MB enwik9 Wikipedia prose 2.67 MB brotli:11: 2.83 MB 5.9% smaller

The first four wins come from formula or template detection — algorithmic substitutions LZ77 has no path to. The fifth win comes from a tuned BWT pipeline (capfold + word dict + LZP-after-dict + multi-tree Huffman with per-block dynamic trial) that puts mzip at the top of standard library compressors on long-form English prose.


Key Strengths

mzip is a specialist, not a general archiver. On data that has structure — numeric sequences, templates, columns, prose, audio, gradients, logs — it produces meaningfully smaller output than zstd:19, brotli:11, bzip2:9, xz:9, 7z, or rar. On generic small handwritten source code (a few KB of TypeScript, Markdown, or Python) brotli's 120 KB static dictionary usually wins. The 93.2% synthetic win rate (formula-friendly suite) and 27.7% real-world win rate (47-file GitHub corpus) are both honest measurements of those two regimes. Read both, then pick the tool for the data you actually have.

When to use mzip

  • Long-form English text (≥ 1 MB): prose, books, email archives, Wikipedia-class content. mzip beats brotli:11 by 5.9% on enwik9 10 MB.
  • Generated / templated data: K8s manifests, OpenAPI output, log streams, repeating JSON, SQL INSERTs, generated docs.
  • Numeric arrays and time series: sequential IDs, timestamps, counters, sensor data, audio PCM, GPS coordinates, float arrays.
  • Stream-separable formats: CSV, fixed-width logs, HTML, large XML, JSON Lines.
  • Anything ≥ 256 KB where compression ratio is the budget you care about.

When not to use mzip

  • Latency-bound read paths: mzip decompresses ~38× slower than zstd. If your hot path opens compressed files repeatedly, use zstd.
  • Small handwritten source code (a few KB of code/config/markdown): brotli usually wins by 3–20%.
  • Already-compressed or encrypted data: nothing to detect, RAW path falls back to zstd:19 — just use zstd directly.
  • General folder backup / archival: ZPAQ, 7z LZMA2, and xz are mature CLI archivers with broader format support. mzip is a library, not an archive format.

Where mzip wins big

Data category Synthetic win rate Why
Numeric sequences (IDs, timestamps, counters) 100% (40/40) Formula compression: v[i] = a + b·i beats any LZ77
Binary / audio / sensor 100% (20/20) Delta + ALP for floats, Paeth/E8E9 predictors
Long-form text (prose / markdown) 100% (15/15) BWT pipeline + capfold + word dict + LZP-after-dict
Code (JS, Py, Go, Rust, ...) at scale 94.5% (52/55) Identifier-stream separation, BWT, ZSTD_DICT trial
Logs (access / syslog / JSON log) 80–96% Columnar separation + BWT per column
Anything ≥ 256 KB 97% (97/100) More data → more patterns to detect

Where mzip loses

Scenario Winner Why
Small handwritten source code (4–30 KB) brotli:11 brotli ships a 120 KB pre-built English/web dictionary; per-file approaches can't fully match it without shipping their own
Random / encrypted / already-compressed data zstd No patterns to detect — entropy coder is all that's left
Decompression speed zstd mzip is ~38× slower to decompress (BWT vs LZ77 inverse). Trade-off, not a bug. See Decompression Speed (the trade-off).

Benchmark Results

All synthetic benchmarks below are generated deterministically by generators.hpp (seeded RNG, seed=42). Click sample links to download the exact input bytes the table reports on.

Roundtrip-verified. mzip's encoder verifies its own output decodes back to the input before returning. Strategies that produce non-roundtrippable bytes are auto-discarded in favor of the next-best valid candidate. All 250 synthetic + 47 real-world results below pass verification.

Overall Compressor Scoreboard (250 tests: 50 types × 5 sizes)

Compressor Avg Ratio Range Wins Win% Rank
mzip 8.25x 1.0–32768x 233 93.2% 1
brotli:11 5.78x 1.0–1716x 15 6.0% 2
bzip2:9 5.66x 1.0–1001x 5 2.0% 3
rar:m5 5.97x 1.0–1014x 0 0.0% 4
xz:9 5.89x 1.0–997x 0 0.0% 5
7z:mx9 5.88x 1.0–922x 0 0.0% 6
zstd:19 5.14x 1.0–2641x 0 0.0% 7
gzip:9 4.78x 1.0–240x 0 0.0% 8

Verified: 250/250 roundtrips pass. Total input: 66.60 MB. lz4/snappy excluded (speed-focused, optimize for a different point on the curve).

Decompression Speed (the trade-off)

mzip optimizes for the smallest output and accepts a slower decode in return. If your workload is read-heavy and latency-bound, this is the wrong tool — use zstd. If you write once and store / transfer often, the savings show up on every later read.

Compressor Decode time (66.6 MB) Decode speed
zstd:19 92 ms 722 MB/s
mzip 3,523 ms 19 MB/s

zstd is roughly 38× faster to decompress. Most of the gap is BWT-inverse vs LZ77-inverse — a structural difference, not a tuning gap. Compression time on the same suite is mzip ≈ 3.5× slower than zstd:19, which is closer because both are doing real work. Streaming / incremental decode is on the roadmap but not in main yet.

Win Rate by Size

Size Wins Total Win%
4KB 43 50 86.0%
16KB 48 50 96.0%
64KB 45 50 90.0%
256KB 48 50 96.0%
1MB 49 50 98.0%

Top 10 mzip Wins

Type mzip 2nd Best Advantage
Database IDs (1MB) 32B (32768x) 3.4KB 106.8x better
Timestamps (1MB) 32B (32768x) 2.7KB 84.2x better
Database IDs (256KB) 32B (8192x) 937B 29.3x better
Timestamps (256KB) 32B (8192x) 772B 24.1x better
Database IDs (64KB) 32B (2048x) 301B 9.4x better
Timestamps (64KB) 32B (2048x) 287B 9.0x better
Image gradient (256KB) 53B (4946x) 323B 6.1x better
Image gradient (64KB) 39B (1680x) 212B 5.4x better
Timestamps (16KB) 32B (512x) 160B 5.0x better
JSON API (1MB) 10KB (104x) 49KB 4.9x better

Where Others Win (Top 10 Gaps)

The remaining gaps are all on small (4–16KB) text/code/config where brotli's pre-built 120KB English dictionary gives an edge no per-file approach can fully match without shipping its own dictionary.

Type Size mzip Best Gap
Terraform 1MB 41KB brotli: 40KB +825B
Terraform 64KB 3.4KB brotli: 3.0KB +379B
HTML 4KB 1004B brotli: 821B +183B
Terraform 256KB 10KB brotli: 10KB +161B
JSON log 16KB 2.4KB brotli: 2.3KB +148B
Makefile 4KB 809B brotli: 697B +112B
JSON log 4KB 866B brotli: 768B +98B
INI config 64KB 10KB bzip2: 10KB +87B
Unicode text 4KB 626B brotli: 549B +77B
Bash 4KB 825B brotli: 765B +60B

Long-Form Prose (enwik9)

enwik9 is the canonical benchmark for text compressors — the first 10⁹ bytes of an English Wikipedia dump. The numbers below are mzip vs every standard library compressor on the first 1 MB and 10 MB prefixes.

Compressor enwik9 1 MB enwik9 10 MB Class
zstd:19 312,639 2,921,957 LZ77 + entropy
gzip:9 356,643 3,720,323 LZ77 + Huffman
bzip2:9 294,484 3,054,639 BWT + multi-tree Huffman
xz:9 302,832 2,844,360 LZMA
7z:mx9 302,910 2,844,433 LZMA2
brotli:11 293,057 2,827,632 LZ77 + Huffman + 120 KB static dict
mzip 286,307 2,671,197 BWT + capfold + word dict + LZP-after-dict, per-block dynamic trial
bzip3 ~245,000 ~2,300,000 BWT + LZP + arithmetic + 1-symbol context model (CLI, not a library)
bsc-m03 LZP + BWT + M03 context coder (CLI, research-grade, not a library)

mzip produces the smallest output on both prefixes among standard library compressors — beats brotli:11 by 5.9%, xz/7z by 6.5%, bzip2:9 by 14.4% on 10 MB; beats brotli:11 by 2.4%, xz/7z by 5.8%, bzip2:9 by 2.9% on 1 MB.

The honest ceiling. bzip3 and bsc-m03 are CLI archive tools (not embeddable libraries) that replace the entropy backend with adaptive arithmetic coding driven by a post-BWT context model. They beat mzip on enwik9 by ~14% (bzip3) to ~20% (bsc-m03) at the cost of much slower compression and no library API. ZPAQ achieves better ratios still (closer to cmix) at the cost of being orders of magnitude slower. Closing the gap to bzip3/bsc-class without giving up the library API is the goal of BWT_ROADMAP.md.


Per-Category Tables

Synthetic data, generated at each size by generators.hpp so results are reproducible.

NUMERIC

Type 64KB 256KB 1MB Samples
Timestamps 32B vs 287B 32B vs 772B 32B vs 2.6KB 64k 256k 1m
Database IDs 32B vs 301B 32B vs 937B 32B vs 3.3KB 64k 256k 1m
Integer array 3.3KB vs 4.4KB 12KB vs 17KB 51KB vs 67KB 64k 256k 1m
GPS coordinates 9.7KB vs 11KB 38KB vs 44KB 154KB vs 179KB 64k 256k 1m
Float temperature 11KB vs 22KB 40KB vs 87KB 151KB vs 331KB 64k 256k 1m
Sensor 16-bit 26KB vs 27KB 107KB vs 111KB 430KB vs 445KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

STRUCTURED

Type 64KB 256KB 1MB Samples
GraphQL queries 2.8KB vs 2.8KB 7.5KB vs 7.8KB 25KB vs 28KB 64k 256k 1m
SQL dump 4.6KB vs 4.7KB 14KB vs 15KB 53KB vs 56KB 64k 256k 1m
JSON API 1016B vs 3.7KB 2.8KB vs 12KB 9.8KB vs 48KB 64k 256k 1m
XML document 1020B vs 2.2KB 2.9KB vs 8.0KB 10KB vs 29KB 64k 256k 1m
CSV data 8.1KB vs 9.8KB 27KB vs 33KB 100KB vs 122KB 64k 256k 1m
Base64 data 47KB vs 48KB 189KB vs 192KB 758KB vs 771KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

CODE

Type 64KB 256KB 1MB Samples
JavaScript 4.0KB vs 5.2KB 10KB vs 12KB 37KB vs 43KB 64k 256k 1m
Python 5.2KB vs 5.3KB 12KB vs 12KB 33KB vs 40KB 64k 256k 1m
TypeScript 4.2KB vs 4.4KB 12KB vs 12KB 40KB vs 45KB 64k 256k 1m
HTML 5.4KB vs 5.7KB 16KB vs 18KB 61KB vs 68KB 64k 256k 1m
CSS 4.1KB vs 4.1KB 11KB vs 11KB 40KB vs 43KB 64k 256k 1m
Go 3.3KB vs 3.4KB 8.1KB vs 8.5KB 24KB vs 28KB 64k 256k 1m
Rust 3.5KB vs 3.5KB 8.9KB vs 9.1KB 28KB vs 31KB 64k 256k 1m
Java 3.7KB vs 3.9KB 10.0KB vs 10KB 29KB vs 36KB 64k 256k 1m
C 4.9KB vs 5.2KB 14KB vs 15KB 47KB vs 54KB 64k 256k 1m
Bash 3.5KB vs 3.7KB 9.9KB vs 10KB 34KB vs 37KB 64k 256k 1m
PHP 3.2KB vs 3.3KB 7.8KB vs 8.6KB 23KB vs 27KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

CONFIG

Type 64KB 256KB 1MB Samples
Docker Compose 2.1KB vs 2.1KB 5.4KB vs 5.6KB 17KB vs 19KB 64k 256k 1m
Terraform 3.4KB vs 3.0KB 10KB vs 10KB 41KB vs 40KB 64k 256k 1m
K8s manifests 3.2KB vs 3.3KB 7.4KB vs 7.6KB 21KB vs 24KB 64k 256k 1m
YAML config 3.8KB vs 3.8KB 11KB vs 11KB 38KB vs 41KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

LOG

Type 64KB 256KB 1MB Samples
Access log 6.3KB vs 6.8KB 21KB vs 24KB 85KB vs 94KB 64k 256k 1m
Nginx access log 6.5KB vs 6.8KB 21KB vs 22KB 82KB vs 87KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

BINARY

Type 64KB 256KB 1MB Samples
Image gradient 39B vs 212B 53B vs 323B 124B vs 397B 64k 256k 1m
Audio PCM 1.7KB vs 4.0KB 1.7KB vs 4.0KB 1.7KB vs 4.0KB 64k 256k 1m
Sparse bitmap 689B vs 880B 2.6KB vs 3.0KB 10KB vs 11KB 64k 256k 1m
Protobuf-like 40KB vs 41KB 160KB vs 163KB 640KB vs 650KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

ADDITIONAL

Type 64KB 256KB 1MB Samples
Natural text 6.7KB vs 7.2KB 26KB vs 28KB 104KB vs 111KB 64k 256k 1m
Markdown docs 3.6KB vs 3.8KB 10KB vs 11KB 36KB vs 41KB 64k 256k 1m
Email headers 6.6KB vs 6.8KB 20KB vs 21KB 75KB vs 79KB 64k 256k 1m
Unicode text 2.4KB vs 2.4KB 7.5KB vs 7.5KB 27KB vs 28KB 64k 256k 1m
Syslog 8.8KB vs 9.4KB 32KB vs 34KB 126KB vs 133KB 64k 256k 1m
Metrics 7.8KB vs 7.8KB 29KB vs 29KB 117KB vs 117KB 64k 256k 1m
JSON log 7.2KB vs 8.0KB 29KB vs 30KB 116KB vs 122KB 64k 256k 1m
Timestamps (jitter) 14KB vs 15KB 56KB vs 61KB 224KB vs 244KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

BUILD

Type 64KB 256KB 1MB Samples
Makefile 3.8KB vs 4.6KB 12KB vs 16KB 46KB vs 61KB 64k 256k 1m
package.json 4.6KB vs 4.8KB 14KB vs 15KB 54KB vs 57KB 64k 256k 1m
Cargo.toml 3.3KB vs 3.5KB 9.5KB vs 10KB 32KB vs 38KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.


Real-World File Benchmark

47 files (7.2 MB total) pulled from public GitHub repos — React, Linux kernel, Django, Bootstrap, lodash, plus 20+ programming-language files. Mix of source code, configs, logs, JSON, CSV, markdown. All 47 round-trip-verified.

Full per-file table: real_bench_summary.md. Raw output: real_bench_results.txt.

Result

  • mzip wins or ties on 13 / 47 files (27.7%)
  • brotli:11 wins 32, bzip2:9 wins 1, xz:9 wins 1, all others 0

The synthetic 93.2% does not survive intact on real GitHub source code — the synthetic suite includes a lot of formula-compressible content (sequential IDs, timestamps, gradients, audio, generated templates) that hand-written code rarely contains. Read both numbers, not just one.

Where mzip wins

File Size mzip ratio 2nd best Advantage
apache_log_sample.log 2.26MB 22.83x brotli: 116KB +12.5%
dashboard.html 42KB 34.04x brotli: 1.3KB +7.1%
docker-compose.yml 3.9KB 4.23x brotli: 1.0KB +8.2%
nginx_access.log 417KB 12.11x bzip2: 35KB +1.3%
app.log 464KB 7.90x bzip2: 60KB +2.5%
events.csv 578KB 7.13x bzip2: 82KB +1.1%
lodash.js 532KB 7.85x bzip2: 69KB +2.1%
metrics.prom 176KB 10.12x bzip2: 17KB +1.1%
users.json 170KB 10.11x bzip2: 17KB +1.0%
linux_kernel.c 281KB 4.41x bzip2: 64KB +1.0%
go_http.go 128KB 3.74x brotli: 34KB +0.2%
handlers.go 14KB 17.72x brotli: 814B +0.9%
styles.css 19.6KB 9.28x bzip2: 2.1KB +1.3%

Where brotli wins

Brotli's 120KB pre-built static English/web dictionary gives it a structural edge on small code/config/markdown that no per-file approach can fully match.

File Size mzip gap
contributing.md 6.6KB +21.7%
sql_schema.sql 4.1KB +20.0%
k8s_deployments.yaml 21KB +18.8%
api_docs.md 17KB +17.4%
ruby_rails.rb 15KB +16.6%

…and 27 more, mostly handwritten source code 4–50 KB, gaps typically +3% to +12%. Full list in real_bench_summary.md.

Per-category breakdown

Category mzip wins Total Win%
Logs 3 3 100%
CSV / columnar 1 1 100%
Metrics 1 1 100%
Web (HTML/CSS) 2 3 67%
JSON 1 2 50%
Config files 1 4 25%
Source code (general) 4 25 16%
Markdown 0 3 0%
SQL 0 2 0%

Compression Strategies

The encoder runs detection on every block, picks a candidate strategy from one of five families below, then trials multiple variants per block and keeps the smallest output that round-trip-verifies.

Family Picked when Headline strategies
Formula / numeric Bytes look like a generator output: linear, periodic, geometric, modular, smooth LINEAR_GEN, NUMERIC (delta / strided / ALP), PERIODIC, MODULAR, LINEAR_PRED
Templates / structured text Repeating lines or blocks with a few varying tokens (logs, generated docs, K8s, SQL INSERTs) TEMPLATE, SECTION_TEMPLATE, ML_TEMPLATE, WORD_TEMPLATE / MULTI_WORD_TEMPLATE, LINE_GROUP_TEMPLATE
Stream separation Fixed columns, tag/content split, key/value records COLUMNAR / BLOCK_COLUMNAR, CSV_COLUMNAR, JSON_COLUMNAR, HTML_STREAM, URL_STREAM, DBF_CONSTCOL
Binary / executable x86 code, raw RGB, sparse bitmaps, base64 text E8E9_X86 + LZMA_OPTIMAL, PAETH_RGB, SPARSE, BASE64_DECODE
Text backends None of the above fits — prose, code, config, mixed BWT_TEXT, BG (single-block BWT for ≥1 MB), MC (per-chunk pick), ZSTD_DICT (4–16 KB code/config), WORD_ENCODED, KV_CONFIG, RAW (fallback to zstd:19)

The BWT text backend (BWT_TEXT / BG) is itself a per-block trial over: pre-RLE on/off, dict size ∈ {64, 128, 192, 255}, number of Huffman trees ∈ {3..7}, LZP-after-dict min-match ∈ {10, 20, 40}, capfold on/off. The smallest valid combination wins.

For the full enum of named strategies (~30) with selection rules, grep BlockType:: and case BlockType:: in mzip.hpp — each block-type carries a one-line // what it does comment at its definition.


Quick Start

Requirements: a C++17 compiler and the zstd library headers + shared object. Install zstd if you don't already have it:

# macOS
brew install zstd

# Debian / Ubuntu
sudo apt install libzstd-dev

# Fedora
sudo dnf install libzstd-devel

# Windows / MSYS2
pacman -S mingw-w64-x86_64-zstd

Option 1: Single-header (recommended)

The amalgamated header bundles mzip + the BWT pipeline + libsais. You only need to add zstd.

// In ONE translation unit:
#include <zstd.h>
#define MZIP_IMPLEMENTATION
#include "mzip_amalgamated.hpp"

// In every other translation unit that uses mzip:
#include "mzip_amalgamated.hpp"

// Usage:
auto compressed   = mzip::compress(data.data(), data.size());
auto decompressed = mzip::decompress(compressed.data(), compressed.size());

Option 2: Separate headers

#include <zstd.h>      // include zstd first
#include "mzip.hpp"

auto compressed   = mzip::compress(data.data(), data.size());
auto decompressed = mzip::decompress(compressed.data(), compressed.size());

Build

# Single-header build (libsais bundled inside)
g++ -std=c++17 -O3 -march=native -o mzip_cli mzip_cli.cpp -lzstd

# Separate-headers build (needs libsais.c too)
g++ -std=c++17 -O3 -march=native -o mzip_cli mzip_cli.cpp libsais.c -lzstd

If your zstd headers are not on the default search path, add -I/path/to/zstd/include -L/path/to/zstd/lib.

CLI Usage

# Compress
./mzip_cli compress input.bin output.mzip

# Decompress
./mzip_cli decompress output.mzip restored.bin

Run Benchmarks

# Build (assumes zstd is installed; otherwise add -I/-L flags as in Quick Start above)
g++ -std=c++17 -O3 -march=native -o mzip_bench mzip_bench.cpp libsais.c -lzstd

# Synthetic suite — 50 types × 5 sizes = 250 tests, ~10–15 min
./mzip_bench --csv full_bench.csv

# Quick (64 KB only)
./mzip_bench --quick

# Single type, all sizes
./mzip_bench --type graphql

# Single real file
./mzip_bench --file path/to/file.bin

# All 47 real-world files
./mzip_bench --file real_bench/*

# Regenerate the README tables from a fresh CSV
python generate_readme_tables.py full_bench.csv

Files

File Description
mzip.hpp Main library — include this
mzip_amalgamated.hpp Single-header build (mzip + BWT + libsais bundled)
bwt_compress_v5.hpp / v8.hpp / v9.hpp BWT pipelines (v5 = current prose backend)
word_dict.hpp Per-file word-dictionary preprocessor (BWT_TEXT helper)
cap_fold.hpp Capital-letter folding (BWT_TEXT helper)
bigram_dict.hpp, xml_entity.hpp Pre-BWT preprocessing candidates (auto-deselected when they don't help)
range_coder.hpp LZMA-style binary range coder (variant backend)
mzip_dicts.h Pre-trained zstd group dictionaries (ZSTD_DICT strategy)
lzma_optimal2.hpp, lzma_decoder.hpp LZMA optimal encoder + decoder (LZMA_OPTIMAL strategy)
mzip_base64.hpp Base64 detect / decode helper (BASE64_DECODE strategy)
generators.hpp Single source of truth for benchmark / test data
libsais.h BWT suffix array (Apache 2.0)
stb_image.h, stb_image_write.h Image IO (Public Domain) — for image strategies
mzip_bench.cpp Benchmark tool — --csv exports results
mzip_cli.cpp Command-line interface
mzip_test.cpp Quick debug / single-type test
mzip_unit_tests.cpp Unit tests for core strategies
generate_readme_tables.py Auto-generate the markdown tables above from full_bench.csv
summarize_real_bench.py Auto-generate real_bench_summary.md from real_bench_results.txt
samples/ Sample files at 4 / 16 / 64 / 256 KB and 1 MB
real_bench/ 47 real-world files used by the real-world benchmark
full_bench.csv Latest synthetic benchmark CSV (one row per (type, size))
real_bench_results.txt Latest real-files benchmark raw output
real_bench_summary.md Auto-generated summary of real_bench_results.txt
BWT_ROADMAP.md Path from current BWT pipeline toward bzip3 / bsc-m03 class
HUNT_LOG.md Per-loss diagnosis log (Loss / Strategy / Claim / Verdict / Action)
IMPROVEMENTS.md Architectural / adoption-focused improvement plan

License & Contact

Dual-licensed: AGPL-3.0 OR commercial.

  • AGPL-3.0 — free for open-source projects. If you deploy mzip as part of a network service (SaaS, hosted API, etc.), the AGPL requires you to make your source available.
  • Commercial — for proprietary or closed-source use, open a GitHub issue tagged commercial-license and I'll follow up. Same for bug reports, benchmarks on your own data, or proposing a new strategy.

Third-party code bundled in the repo: libsais (Apache 2.0), stb_image (Public Domain). zstd is required at link-time but not bundled (BSD).

About

Store the formula, not the data. Detection-based compression: 32KB for 1MB of sequential IDs (32768x). Beats zstd/brotli/bzip2 on structured data.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors