@@ -18,76 +18,56 @@ drop-in CLI for gzip/zstd/brotli workflows. Written in Rust.
1818## Features
1919
2020- ** Adaptive pipeline** — SSR analysis auto-selects transforms and entropy backend per-file
21- - ** 11 transforms** — delta, zigzag, transpose, ROLZ, float-split, field-LZ, range-pack,
22- tokenize, prefix-strip, dedup, parse-int
21+ - ** 26+ transforms** — BWT (SA-IS), delta, zigzag, transpose, ROLZ, float-split, field-LZ,
22+ range-pack, tokenize, prefix-strip, dedup, parse-int, normalize, conditioned-BWT, predict,
23+ byte-plane, const-elim, and more
2324- ** SIMD acceleration** — runtime dispatch: AVX-512 → AVX2 → SSE4.1 → SSE2 → NEON → scalar
2425- ** DAG profiles** — composable transform chains with auto-select and 5 built-in profiles
25- - ** Block-parallel** — rayon-based parallel compress/decompress (CPBL wire format )
26+ - ** Block-parallel** — rayon-based parallel compress/decompress (CPBL v1/v2/v3 wire formats )
2627- ** Memory-mapped I/O** — auto-mmap for files > 64 MB, manual ` --mmap ` flag
2728- ** Streaming** — block-based streaming with progress callbacks and adaptive block sizing
2829- ** 12 entropy backends** — Zstd, Brotli, Gzip, LZMA, XZ, LZ4, Snappy, LZHAM, Lizard, zlib-ng, OpenZL, Raw
2930- ** Encryption** — ChaCha20-Poly1305, AES-256-GCM, Argon2 KDF
3031- ** Post-quantum crypto** — ML-KEM-768 + X25519 hybrid encryption (CPHE), ML-DSA-65 signatures
3132- ** Archives** — multi-file ` .cpar ` format with per-entry compression
32- - ** Domain handlers** — CSV, JSON, XML, YAML, log file specializations
33+ - ** MSN domain handlers** — CSV, JSON, XML, YAML, syslog, Apache, and log file specializations (opt-in)
3334- ** CAS analysis** — constraint inference (range, enum, constant, monotonic, functional dependency)
34- - ** Benchmarking** — built-in benchmark suite with baselines (gzip-9, zstd-3, brotli-11, lzma-6),
35- lossless verification, memory tracking, Criterion microbenchmarks, ** industry-standard corpora**
36- (Canterbury, Silesia, Calgary), automated batch runner with CSV/Markdown reports
37- - ** Corpus management** — YAML-driven corpus configs, automatic HTTP/ZIP/TAR.GZ downloads,
38- progress bars, 18+ curated benchmark datasets
39- - ** Host detection** — CPU, cores, RAM, SIMD tier detection with safe auto-defaults
40- - ** Cross-platform** — Windows (primary), Linux, macOS; x86_64 and aarch64
4135- ** Transcode compression** — lossless image (PNG/BMP/TIFF/WebP) pixel-domain compression
4236- ** Auto-analysis** — directory-level analysis with YAML config generation
37+ - ** Benchmarking** — profile-driven benchmark suite with 17+ curated corpora, 12 baseline
38+ backends, YAML-driven corpus configs, automatic HTTP/ZIP/TAR.GZ downloads
39+ - ** Host detection** — CPU, cores, RAM, SIMD tier detection with safe auto-defaults
4340- ** Hardware acceleration** — pluggable accel layer (QAT, IAA, GPU, FPGA, SVE2 stubs)
44- - ** 539+ tests** — comprehensive regression suite, golden vectors, property-based tests,
45- determinism validation, transform-specific tests
46-
47- ## Agent Quick Start
48-
49- This repository supports AI agent workflows with session commands.
50-
51- When starting a new conversation with an AI agent (Warp, Claude, etc.) in this
52- repository, use the following prompt to establish context:
53-
54- ** Standard load session:**
55- ```
56- You are in the cpac repository (CPAC compression engine). First read AGENTS.md
57- and WARP.md, then verify build with cargo build --workspace && cargo test --workspace.
58- ```
59-
60- On startup, an agent SHOULD:
61- 1 . Read ` AGENTS.md ` (architecture, entry points, conventions)
62- 2 . Read ` WARP.md ` (build commands, presubmit checklist, commit style)
63- 3 . Set Windows PATH if needed: ` $env:PATH = "$env:USERPROFILE\.cargo\bin;$env:PATH" `
64- 4 . Verify build: ` cargo build --workspace && cargo test --workspace `
65- 5 . Run presubmit before any commits: build, test, clippy, fmt (see WARP.md)
66-
67- For full agent conventions and session behavior, see ` AGENTS.md ` .
41+ - ** Cross-platform** — Windows (primary), Linux, macOS; x86_64 and aarch64
6842
6943## Quick Start
7044
7145``` bash
7246# Build
73- cargo build --workspace
47+ cargo build --workspace --release
7448
75- # Run tests (289+)
49+ # Run tests
7650cargo test --workspace
7751
78- # Install the CLI
79- cargo install --path crates/cpac-cli
80-
8152# Compress a file
8253cpac compress myfile.txt
8354
8455# Decompress
8556cpac decompress myfile.txt.cpac
8657```
8758
88- ### Windows note
59+ ### Windows (PowerShell)
60+
61+ All commands go through the unified build system:
62+
63+ ``` powershell
64+ .\shell.ps1 build --release
65+ .\shell.ps1 test
66+ .\shell.ps1 benchmark-all # balanced profile (default)
67+ .\shell.ps1 benchmark-all --profile full # full profile (17 corpora)
68+ ```
8969
90- If ` cargo ` is not on your PATH in PowerShell :
70+ If ` cargo ` is not on your PATH:
9171
9272``` powershell
9373$env:PATH = "$env:USERPROFILE\.cargo\bin;$env:PATH"
@@ -143,15 +123,15 @@ cpac completions powershell > cpac.ps1
143123
144124## Architecture
145125
146- CPAC is a 16 -crate Cargo workspace. See [ docs/ARCHITECTURE.md] ( docs/ARCHITECTURE.md )
126+ CPAC is a 21 -crate Cargo workspace. See [ docs/ARCHITECTURE.md] ( docs/ARCHITECTURE.md )
147127for the full design.
148128
149129```
150130cpac-types Shared types, CpacError, ResourceConfig
151131cpac-ssr Structural Summary Record analysis
152- cpac-transforms 11 encoding transforms + SIMD kernels
132+ cpac-transforms 26+ encoding transforms + SIMD kernels
153133cpac-dag DAG composition, profiles, auto-select
154- cpac-entropy Zstd / Brotli / Gzip / LZMA / Raw backends
134+ cpac-entropy 12 entropy backends (Zstd, Brotli, Gzip, LZMA, XZ, LZ4, ...)
155135cpac-frame Wire format encode/decode (CP frame)
156136cpac-engine Top-level API, host detection, parallel, benchmarks
157137cpac-cli Command-line interface (clap)
@@ -162,6 +142,12 @@ cpac-domains Domain-aware handlers
162142cpac-cas Constraint-Aware Schema inference
163143cpac-archive Multi-file .cpar archive format
164144cpac-dict Dictionary training (Zstd)
145+ cpac-lab Benchmarking, calibration, auto-analysis
146+ cpac-conditioning Data conditioning / partitioning
147+ cpac-predict Prediction transforms
148+ cpac-transcode Lossless image transcode compression (CPTC)
149+ cpac-lizard-sys Lizard C library sys crate
150+ cpac-lzham-sys LZHAM C library sys crate
165151cpac-ffi C/C++ FFI bindings
166152```
167153
@@ -172,7 +158,7 @@ Compression pipeline: `SSR → Preprocess (transforms) → Entropy coding → Fr
172158See [ docs/SPEC.md] ( docs/SPEC.md ) for complete wire format specifications.
173159
174160- ** CP** — standard CPAC frame (single-block)
175- - ** CPBL** — block-parallel frame (multi-block, rayon )
161+ - ** CPBL** — block-parallel frame (v1 basic, v2 shared MSN metadata, v3 auto-dictionary )
176162- ** TP** — transform preprocess frame
177163- ** CS** — streaming frame
178164- ** CPHE** — hybrid post-quantum encryption frame
@@ -193,126 +179,68 @@ cargo build --profile release-small
193179
194180## Performance Benchmarks
195181
196- ### Latest Results (Session 21 — Post Phase 1–6 Optimizations)
197-
198- ** Date:** March 2026 | ** Version:** 0.3.0 | ** Platform:** Windows x86_64, Rust 1.93+
199- ** Methodology:** Balanced profile, 3 iterations, 12 entropy backends, 777 files across 8 standard corpora. All results verified lossless.
200-
201- #### Headline Numbers (Best Ratio per Corpus)
202-
203- | Corpus | Files | Avg Best | Range |
204- | --------| -------| ----------| -------|
205- | loghub2_2k | 14 | ** 16.63×** | 6.97–32.63× |
206- | nasa_logs | 4 | ** 8.56×** | 1.00–16.23× |
207- | canterbury | 11 | ** 5.84×** | 2.86–20.96× |
208- | silesia | 12 | ** 4.30×** | 1.63–12.42× |
209- | calgary | 18 | ** 4.03×** | 1.85–12.56× |
210- | enwik8 | 1 | ** 3.75×** | — |
211- | cloud_configs | 691 | ** 3.63×** | 0.00–17.98× |
212-
213- #### Zstd Ratio by CPAC Level (corpus average)
214-
215- | Corpus | Fast | Default | High | Best |
216- | --------| ------| ---------| ------| ------|
217- | loghub2_2k | 11.04× | 12.87× | 13.61× | ** 15.25×** |
218- | nasa_logs | 5.41× | 6.30× | 7.09× | ** 8.41×** |
219- | silesia | 3.17× | 3.46× | 3.71× | ** 4.04×** |
220- | canterbury | 3.96× | 4.17× | 4.22× | ** 5.06×** |
221- | enwik8 | 2.82× | 3.06× | 3.27× | ** 3.64×** |
222-
223- #### Key Achievements
182+ For full benchmark results, methodology, and corpus descriptions, see
183+ ** [ docs/BENCHMARKING.md] ( docs/BENCHMARKING.md ) ** .
224184
225- - ** 16.63×** on loghub2_2k — MSN semantic extraction + BWT + auto-dictionary
226- - ** 777 files benchmarked** across 8 standard corpora (773 OK, 4 timeout)
227- - ** 12 entropy backends** — Zstd, Brotli, Gzip, LZMA, XZ, LZ4, Snappy, LZHAM, Lizard, zlib-ng, OpenZL, Raw
228- - ** 100% lossless** — verified roundtrip across all measurements
229- - ** Parallel + streaming** — auto-parallel >256 KB, streaming with bounded memory
230- - ** Post-quantum encryption** — ML-KEM-768 + X25519 hybrid, ML-DSA-65 signatures
231-
232- #### Optimization Phases
185+ ** Headline numbers** (Session 21, balanced profile, 8 corpora, 777 files):
233186
234- ** Phases 1–2:** Re-enabled parallel smart transforms, CPBL v2 shared MSN header
235- ** Phases 3–4:** Auto-dictionary (CPBL v3), ConditionedBwtTransform (ID 26)
236- ** Phases 5–6:** Per-block backend selection, CAS bridge for MSN fields
187+ - ** loghub2_2k** : 16.63× average best ratio (log data)
188+ - ** nasa_logs** : 8.56× (HTTP access logs)
189+ - ** silesia** : 4.30× (mixed content)
190+ - ** enwik8** : 3.75× (Wikipedia XML)
191+ - ** 776/776 files OK** on full profile (Session 30, 0 timeouts)
192+ - ** 12 entropy backends** , 100% lossless verified
237193
238- ### Benchmark Profiles
194+ Benchmark profiles: quick (1 iter), balanced (3 iter), full (10 iter, 17 corpora).
239195
240- ``` bash
241- # Single file benchmark with baselines
242- cpac benchmark myfile.txt
196+ ## Compression Presets
243197
244- # Profile options (matches Python engine)
245- # Quick: 1 iteration (fast validation)
246- # Balanced: 3 iterations (default, reliable)
247- # Full: 10 iterations (publication-grade)
248- ```
249-
250- ### Criterion Microbenchmarks
198+ | Preset | Level | Smart | Use Case |
199+ | --------| -------| -------| ----------|
200+ | ` turbo ` | Fast | off | Maximum throughput, real-time pipelines |
201+ | ` balanced ` | Default | on | General purpose, good ratio/speed balance |
202+ | ` maximum ` | High | on | Best ratio with reasonable speed |
203+ | ` archive ` | Best | on | Cold storage, archival workloads |
251204
252205``` bash
253- # Full Criterion suite
254- cargo bench -p cpac-engine
255-
256- # Individual bench suites
257- cargo bench -p cpac-engine --bench compress # pipeline + backends
258- cargo bench -p cpac-engine --bench simd # SIMD vs scalar
259- cargo bench -p cpac-engine --bench dag # DAG compile + execute
206+ cpac compress --preset archive big_dataset.tar
207+ cpac compress --preset turbo streaming_logs.jsonl
260208```
261209
262- ## Completed Features (Phase 1+2) ✓
263-
264- - ✓ ** Dictionary training** — Zstd dictionary compression/decompression via stream API
265- - ✓ ** SIMD acceleration** — AVX2 kernels for delta encoding with runtime CPU detection
266- - ✓ ** Streaming API** — Block-based streaming with progress callbacks (CS format)
267- - ✓ ** C/C++ FFI** — Complete bindings in ` cpac-ffi ` crate with cbindgen headers
268- - ✓ ** Python bindings** — PyO3-based bindings in ` cpac-py ` (submodule)
269- - ✓ ** Additional transforms** — BWT, MTF added to transform library
270- - ✓ ** ARM SIMD** — NEON scaffolding and SVE/SVE2 infrastructure
271- - ✓ ** Memory pool** — Buffer pool infrastructure (signal-driven activation)
272- - ✓ ** Parallel compression** — Block-parallel CPBL format with auto-enable >1MB
273-
274- ## Planned Features
275-
276- ### Near-Term (Signal-Driven, Phase 3+)
277-
278- All future optimizations are ** bottleneck signal-driven** . See ` BENCHMARKING.md ` for the full corpus results.
279-
280- ** Top Priorities** (when signals indicate):
281- - ** Memory pool activation** — When profiling shows >10% time in allocator
282- - ** Dictionary caching** — When training overhead >1s on repeated corpora
283- - ** ARM NEON implementation** — When profiling shows significant scalar fallback time
284- - ** Preprocessing cache** — When >5% time in transform trial logic
285-
286- ### Long-Term (Phase 4+)
287-
288- - ** GPU acceleration** — CUDA/ROCm kernels for high-throughput systems (>10 GB/s)
289- - ** Networked compression** — client/server mode with delta sync
290- - ** WASM target** — browser-based compression with SIMD.js fallback
291- - ** ML-based selection** — trained models for backend/transform selection
292-
293- ### Long-term
294-
295- - ** Approximate compression** — lossy modes for numerical data
296- - ** Neural codec integration** — learned compression for specific domains
297- - ** Distributed compression** — map/reduce across cluster
298- - ** Hardware offload** — FPGA/ASIC integration for high-throughput
299- - ** Format versioning** — backward-compatible wire format evolution
300-
301210## Requirements
302211
303212- ** Rust** 1.75+ stable (tested on 1.93)
304213- ** Platforms** : Windows x86_64 (primary), Linux x86_64/aarch64, macOS x86_64/aarch64
305214- ** Optional** : Gnuplot (for Criterion HTML reports)
306215
307- ## Project Files
216+ ## Agent Quick Start
217+
218+ This repository supports AI agent workflows. When starting a new conversation
219+ with an AI agent in this repository:
220+
221+ ```
222+ Read AGENTS.md and WARP.md, then verify build with .\shell.ps1 build && .\shell.ps1 test.
223+ ```
224+
225+ For full agent conventions and session behavior, see ` AGENTS.md ` .
226+
227+ ## Documentation
308228
309229- ` AGENTS.md ` — AI agent onboarding guide
310230- ` WARP.md ` — Warp IDE project rules
311- - ` BENCHMARKING.md ` — Industry benchmark results and guide
312- - ` docs/SPEC.md ` — Wire format specification
231+ - ` docs/BENCHMARKING.md ` — Benchmark results and methodology
313232- ` docs/ARCHITECTURE.md ` — System architecture
233+ - ` docs/MANUAL.md ` — User manual and CLI reference
234+ - ` docs/SPEC.md ` — Wire format specification
235+ - ` docs/TRANSFORMS.md ` — Transform pipeline status and calibration
236+ - ` docs/MSN_GUIDE.md ` — Multi-Scale Normalization user guide
237+ - ` docs/ROADMAP.md ` — Feature roadmap and known issues
238+ - ` docs/HARDWARE_ACCEL.md ` — Hardware acceleration backends
239+ - ` docs/RELEASE.md ` — Release process and CI/CD
314240- ` CONTRIBUTING.md ` — Contribution guidelines
315241- ` SECURITY.md ` — Security policy
242+ - ` CHANGELOG.md ` — Release notes
243+ - ` LEDGER.md ` — Session-by-session development record
316244
317245## License
318246
0 commit comments