Generated from 3-perspective Opus analysis: Compression Ratio, User Adoption, Architecture.
- Win rate: 74% (186/250 tests), 98% at 1MB
- Weakness: bzip2 wins 64 tests at 4-16KB (14-88 byte gaps)
- Decompression: 49x slower than zstd (BWT vs LZ77 trade-off)
- Codebase: 15K line monolith, 400+ line if/else cascades
Why: Enables safe refactoring, prevents regressions like "Audio PCM 2nd best was made up"
What to test:
- LINEAR_GEN: detection (base/delta params), roundtrip, expected compression (<50 bytes for 256KB IDs)
- TEMPLATE: roundtrip, variable extraction
- BWT_TEXT: roundtrip at various sizes
- COLUMNAR: 9-column format, roundtrip
Framework: googletest or catch2
Why: +8-12% win rate, flip ~20-30 bzip2 losses to wins
Current overhead: 17-31 bytes
- magic (4) + version (1) + original_size (8) + block_count (4) + block_headers
New format: ~5 bytes
- 2-byte magic ("MZ") + 1-byte strategy + varint size
Targets:
| File | Current | New (est) | Saved |
|---|---|---|---|
| Natural text 4KB | 741 | ~715 | 26 |
| Markdown 4KB | 751 | ~725 | 26 |
| TOML 4KB | 1064 | ~1040 | 24 |
Implementation: Branch in compress() for single-block files <16KB
Why: "30 seconds vs 30 minutes setup" determines adoption
Actions:
- Create
mzip_amalgamated.hpp(like SQLite, stb) - Bundle libsais inline
- Make zstd optional with graceful degradation
- CMake
find_packagesupport - vcpkg/conan package submission
Why: +3-5% on code files, close bzip2 gap
Current: Fixed English prose model (55% zeros, exponential decay) Problem: Code has different MTF patterns
New models:
- PROSE (current default)
- CODE (higher symbol diversity)
- CONFIG (key-value patterns)
- BINARY (uniform distribution)
Detection: Count {, }, :, indentation, keywords
Overhead: 1 byte in header for model selection
Why: Extensibility, testability, less duplication
Current problem:
// analyze_block - 400+ lines
if (detect_linear_gen(...)) return result;
if (detect_geometric(...)) return result;
// ... 30+ moreNew design:
struct ICompressionStrategy {
virtual BlockType type() const = 0;
virtual int priority() const = 0;
virtual optional<BlockAnalysis> detect(data, n) = 0;
virtual vector<uint8_t> encode(data, n, analysis) = 0;
virtual vector<uint8_t> decode(encoded, size, orig_size) = 0;
};Structure:
mzip/
├── mzip.hpp # Main API
├── core/
│ ├── block_types.hpp # Enums, BlockAnalysis
│ ├── varint.hpp # Encoding utilities
│ └── mdl.hpp # MDL scoring
├── strategies/
│ ├── linear_gen.hpp
│ ├── template.hpp
│ ├── columnar.hpp
│ └── ...
└── strategy_registry.hpp
Do AFTER strategy interface is in place.
Problem: 49x slower than zstd - dealbreaker for read-heavy workloads Root cause: BWT inverse transform O(n), MTF decode O(n*k) Options:
- SIMD-accelerated MTF
- Cache-friendly inverse BWT
- Hybrid: zstd for hot paths, BWT only when ratio gain > threshold
Problem: Current API requires full buffer, fails on large files Solution: Expose MC (Multi-Chunk) through API Challenge: Detection assumes full data visibility
Problem: YAML/TOML/INI miss template detection Solution: CONFIG_TEMPLATE strategy - extract keys, separate values, LINEAR_GEN on numerics Potential: +10-15% on config files
-
bzip2 wins are header overhead - 64 losses at 4-16KB with 14-88 byte gaps. Micro format fixes most.
-
49x decompression is adoption killer - "write-once-read-rarely" is narrow niche. Must address for mainstream adoption.
-
Tests enable everything - Can't safely refactor 15K monolith without tests. Do tests FIRST.
-
BWT model mismatch - English prose model on code files loses 30-100 bytes vs bzip2's per-file adaptation.