Skip to content

Commit 7859d51

Browse files
tbitcsoz-agent
andcommitted
docs: comprehensive documentation overhaul for v0.3.0 (Session 31)
Update all 9 maintained docs to reflect current codebase state: - 21 crates, 12 entropy backends, 26 transforms, MSN disabled by default - README: strip inline benchmarks, link to BENCHMARKING.md - ARCHITECTURE: accurate crate tree, remove stale roadmap - MANUAL: full 12-backend table, MSN opt-in callout - MSN_SPEC: fix msn_metadata_len u16->u32, add FLAG_MSN_INLINE - SPEC: correct all 26 transform IDs and 12 backend IDs - RELEASE: 21-crate publish order, v0.3.0 examples - ROADMAP: mark transcode Phase 1 and auto-analysis Phase 1 done - BENCHMARKING: remove stale sessions, add corpora table - MSN_GUIDE: add MSN-off-by-default notice Also includes corpus download handler fixes and _safe_rglob_files() helper for Windows symlink enumeration in scripts/cpac.py. Co-Authored-By: Oz <oz-agent@warp.dev>
1 parent dd001e3 commit 7859d51

11 files changed

Lines changed: 506 additions & 1160 deletions

File tree

LEDGER.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,77 @@
22

33
Session-by-session record of significant changes, investigations, and decisions.
44

5+
## Session 31 — 2026-03-15 (Documentation Overhaul + Corpus Download Fixes)
6+
7+
### Focus
8+
Comprehensive documentation overhaul: bring all 9 maintained docs up to date
9+
with the current v0.3.0 codebase (21 crates, 12 backends, 26 transforms,
10+
MSN disabled by default). Also fix corpus download handlers and Windows
11+
symlink enumeration in `cpac.py`.
12+
13+
### Documentation Updates (9 files)
14+
15+
1. **README.md** — Stripped all inline benchmark tables; added link to
16+
`docs/BENCHMARKING.md`. Updated crate count 16→21, features (12 backends,
17+
26+ transforms, MSN opt-in). Removed stale "Completed/Planned Features".
18+
19+
2. **docs/BENCHMARKING.md** — Removed Sessions 11–20 historical results; kept
20+
Session 21 as latest comprehensive benchmark. Added Session 22–30
21+
infrastructure notes. Full corpora table (17 corpora with sizes). Updated
22+
download commands to `--profile full`.
23+
24+
3. **docs/ARCHITECTURE.md** — v0.3.0 header, accurate 21-crate tree with new
25+
crates (cpac-lab, cpac-conditioning, cpac-predict, cpac-transcode, sys
26+
crates), 12 backends in pipeline diagram. Removed stale v0.2.0/v0.3.0
27+
roadmap. Updated references section.
28+
29+
4. **docs/MANUAL.md** — v0.3.0 header. Expanded §6 Entropy Backends from 5
30+
to 12 with full table. Added MSN-disabled-by-default callout in §7.
31+
Updated `--backend` help text.
32+
33+
5. **docs/MSN_GUIDE.md** — Added prominent opt-in notice at top. Updated
34+
version history: 19 domain handlers, u32 metadata, MSN off by default.
35+
36+
6. **docs/MSN_SPEC.md** — Fixed `msn_metadata_len` from u16 (2 bytes) to u32
37+
(4 bytes) matching code (`cpac-frame/src/lib.rs`). CP2 minimum header now
38+
correctly documented as 16 bytes. Added `FLAG_MSN_INLINE` (0x0001).
39+
Updated backend ID range 0x00–0x0B.
40+
41+
7. **docs/RELEASE.md** — v0.3.0 examples throughout. Updated crate publish
42+
order from 15 to 21 crates (added sys crates, cpac-conditioning,
43+
cpac-predict, cpac-lab, cpac-transcode).
44+
45+
8. **docs/SPEC.md** — v1.1. All 12 backend IDs (0x00–0x0B). All 26
46+
transform IDs corrected to match `cpac-transforms` source (old SPEC had
47+
wrong ID assignments for Delta/ZigZag/Transpose/etc.).
48+
49+
9. **docs/ROADMAP.md** — Marked transcode Phase 1 (cpac-transcode crate) and
50+
auto-analysis Phase 1 (cpac-lab auto_analyze module) as done.
51+
52+
### Corpus Download Fixes (scripts/cpac.py)
53+
54+
- Fixed `http_targz` handler for multi-URL corpora (docker_layers)
55+
- Fixed `http_gzip_multi` handler (github_events_large)
56+
- Fixed `http_zip_nested` handler (loghub2_full)
57+
- Added `_safe_rglob_files()` helper to skip Windows symlinks/reparse points
58+
(Alpine minirootfs untrusted reparse points caused WinError 448)
59+
- Applied safe enumeration to benchmark file collection, download summary,
60+
and "already present" check
61+
62+
### Files Modified
63+
- `README.md`
64+
- `docs/ARCHITECTURE.md`
65+
- `docs/BENCHMARKING.md`
66+
- `docs/MANUAL.md`
67+
- `docs/MSN_GUIDE.md`
68+
- `docs/MSN_SPEC.md`
69+
- `docs/RELEASE.md`
70+
- `docs/ROADMAP.md`
71+
- `docs/SPEC.md`
72+
- `scripts/cpac.py`
73+
74+
---
75+
576
## Session 30 — 2026-03-15 (Benchmark Infrastructure Fixes + YAML Parser Bug)
677

778
### Focus

README.md

Lines changed: 73 additions & 145 deletions
Original file line numberDiff line numberDiff line change
@@ -18,76 +18,56 @@ drop-in CLI for gzip/zstd/brotli workflows. Written in Rust.
1818
## Features
1919

2020
- **Adaptive pipeline** — SSR analysis auto-selects transforms and entropy backend per-file
21-
- **11 transforms** — delta, zigzag, transpose, ROLZ, float-split, field-LZ, range-pack,
22-
tokenize, prefix-strip, dedup, parse-int
21+
- **26+ transforms** — BWT (SA-IS), delta, zigzag, transpose, ROLZ, float-split, field-LZ,
22+
range-pack, tokenize, prefix-strip, dedup, parse-int, normalize, conditioned-BWT, predict,
23+
byte-plane, const-elim, and more
2324
- **SIMD acceleration** — runtime dispatch: AVX-512 → AVX2 → SSE4.1 → SSE2 → NEON → scalar
2425
- **DAG profiles** — composable transform chains with auto-select and 5 built-in profiles
25-
- **Block-parallel** — rayon-based parallel compress/decompress (CPBL wire format)
26+
- **Block-parallel** — rayon-based parallel compress/decompress (CPBL v1/v2/v3 wire formats)
2627
- **Memory-mapped I/O** — auto-mmap for files > 64 MB, manual `--mmap` flag
2728
- **Streaming** — block-based streaming with progress callbacks and adaptive block sizing
2829
- **12 entropy backends** — Zstd, Brotli, Gzip, LZMA, XZ, LZ4, Snappy, LZHAM, Lizard, zlib-ng, OpenZL, Raw
2930
- **Encryption** — ChaCha20-Poly1305, AES-256-GCM, Argon2 KDF
3031
- **Post-quantum crypto** — ML-KEM-768 + X25519 hybrid encryption (CPHE), ML-DSA-65 signatures
3132
- **Archives** — multi-file `.cpar` format with per-entry compression
32-
- **Domain handlers** — CSV, JSON, XML, YAML, log file specializations
33+
- **MSN domain handlers** — CSV, JSON, XML, YAML, syslog, Apache, and log file specializations (opt-in)
3334
- **CAS analysis** — constraint inference (range, enum, constant, monotonic, functional dependency)
34-
- **Benchmarking** — built-in benchmark suite with baselines (gzip-9, zstd-3, brotli-11, lzma-6),
35-
lossless verification, memory tracking, Criterion microbenchmarks, **industry-standard corpora**
36-
(Canterbury, Silesia, Calgary), automated batch runner with CSV/Markdown reports
37-
- **Corpus management** — YAML-driven corpus configs, automatic HTTP/ZIP/TAR.GZ downloads,
38-
progress bars, 18+ curated benchmark datasets
39-
- **Host detection** — CPU, cores, RAM, SIMD tier detection with safe auto-defaults
40-
- **Cross-platform** — Windows (primary), Linux, macOS; x86_64 and aarch64
4135
- **Transcode compression** — lossless image (PNG/BMP/TIFF/WebP) pixel-domain compression
4236
- **Auto-analysis** — directory-level analysis with YAML config generation
37+
- **Benchmarking** — profile-driven benchmark suite with 17+ curated corpora, 12 baseline
38+
backends, YAML-driven corpus configs, automatic HTTP/ZIP/TAR.GZ downloads
39+
- **Host detection** — CPU, cores, RAM, SIMD tier detection with safe auto-defaults
4340
- **Hardware acceleration** — pluggable accel layer (QAT, IAA, GPU, FPGA, SVE2 stubs)
44-
- **539+ tests** — comprehensive regression suite, golden vectors, property-based tests,
45-
determinism validation, transform-specific tests
46-
47-
## Agent Quick Start
48-
49-
This repository supports AI agent workflows with session commands.
50-
51-
When starting a new conversation with an AI agent (Warp, Claude, etc.) in this
52-
repository, use the following prompt to establish context:
53-
54-
**Standard load session:**
55-
```
56-
You are in the cpac repository (CPAC compression engine). First read AGENTS.md
57-
and WARP.md, then verify build with cargo build --workspace && cargo test --workspace.
58-
```
59-
60-
On startup, an agent SHOULD:
61-
1. Read `AGENTS.md` (architecture, entry points, conventions)
62-
2. Read `WARP.md` (build commands, presubmit checklist, commit style)
63-
3. Set Windows PATH if needed: `$env:PATH = "$env:USERPROFILE\.cargo\bin;$env:PATH"`
64-
4. Verify build: `cargo build --workspace && cargo test --workspace`
65-
5. Run presubmit before any commits: build, test, clippy, fmt (see WARP.md)
66-
67-
For full agent conventions and session behavior, see `AGENTS.md`.
41+
- **Cross-platform** — Windows (primary), Linux, macOS; x86_64 and aarch64
6842

6943
## Quick Start
7044

7145
```bash
7246
# Build
73-
cargo build --workspace
47+
cargo build --workspace --release
7448

75-
# Run tests (289+)
49+
# Run tests
7650
cargo test --workspace
7751

78-
# Install the CLI
79-
cargo install --path crates/cpac-cli
80-
8152
# Compress a file
8253
cpac compress myfile.txt
8354

8455
# Decompress
8556
cpac decompress myfile.txt.cpac
8657
```
8758

88-
### Windows note
59+
### Windows (PowerShell)
60+
61+
All commands go through the unified build system:
62+
63+
```powershell
64+
.\shell.ps1 build --release
65+
.\shell.ps1 test
66+
.\shell.ps1 benchmark-all # balanced profile (default)
67+
.\shell.ps1 benchmark-all --profile full # full profile (17 corpora)
68+
```
8969

90-
If `cargo` is not on your PATH in PowerShell:
70+
If `cargo` is not on your PATH:
9171

9272
```powershell
9373
$env:PATH = "$env:USERPROFILE\.cargo\bin;$env:PATH"
@@ -143,15 +123,15 @@ cpac completions powershell > cpac.ps1
143123

144124
## Architecture
145125

146-
CPAC is a 16-crate Cargo workspace. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
126+
CPAC is a 21-crate Cargo workspace. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
147127
for the full design.
148128

149129
```
150130
cpac-types Shared types, CpacError, ResourceConfig
151131
cpac-ssr Structural Summary Record analysis
152-
cpac-transforms 11 encoding transforms + SIMD kernels
132+
cpac-transforms 26+ encoding transforms + SIMD kernels
153133
cpac-dag DAG composition, profiles, auto-select
154-
cpac-entropy Zstd / Brotli / Gzip / LZMA / Raw backends
134+
cpac-entropy 12 entropy backends (Zstd, Brotli, Gzip, LZMA, XZ, LZ4, ...)
155135
cpac-frame Wire format encode/decode (CP frame)
156136
cpac-engine Top-level API, host detection, parallel, benchmarks
157137
cpac-cli Command-line interface (clap)
@@ -162,6 +142,12 @@ cpac-domains Domain-aware handlers
162142
cpac-cas Constraint-Aware Schema inference
163143
cpac-archive Multi-file .cpar archive format
164144
cpac-dict Dictionary training (Zstd)
145+
cpac-lab Benchmarking, calibration, auto-analysis
146+
cpac-conditioning Data conditioning / partitioning
147+
cpac-predict Prediction transforms
148+
cpac-transcode Lossless image transcode compression (CPTC)
149+
cpac-lizard-sys Lizard C library sys crate
150+
cpac-lzham-sys LZHAM C library sys crate
165151
cpac-ffi C/C++ FFI bindings
166152
```
167153

@@ -172,7 +158,7 @@ Compression pipeline: `SSR → Preprocess (transforms) → Entropy coding → Fr
172158
See [docs/SPEC.md](docs/SPEC.md) for complete wire format specifications.
173159

174160
- **CP** — standard CPAC frame (single-block)
175-
- **CPBL** — block-parallel frame (multi-block, rayon)
161+
- **CPBL** — block-parallel frame (v1 basic, v2 shared MSN metadata, v3 auto-dictionary)
176162
- **TP** — transform preprocess frame
177163
- **CS** — streaming frame
178164
- **CPHE** — hybrid post-quantum encryption frame
@@ -193,126 +179,68 @@ cargo build --profile release-small
193179

194180
## Performance Benchmarks
195181

196-
### Latest Results (Session 21 — Post Phase 1–6 Optimizations)
197-
198-
**Date:** March 2026 | **Version:** 0.3.0 | **Platform:** Windows x86_64, Rust 1.93+
199-
**Methodology:** Balanced profile, 3 iterations, 12 entropy backends, 777 files across 8 standard corpora. All results verified lossless.
200-
201-
#### Headline Numbers (Best Ratio per Corpus)
202-
203-
| Corpus | Files | Avg Best | Range |
204-
|--------|-------|----------|-------|
205-
| loghub2_2k | 14 | **16.63×** | 6.97–32.63× |
206-
| nasa_logs | 4 | **8.56×** | 1.00–16.23× |
207-
| canterbury | 11 | **5.84×** | 2.86–20.96× |
208-
| silesia | 12 | **4.30×** | 1.63–12.42× |
209-
| calgary | 18 | **4.03×** | 1.85–12.56× |
210-
| enwik8 | 1 | **3.75×** ||
211-
| cloud_configs | 691 | **3.63×** | 0.00–17.98× |
212-
213-
#### Zstd Ratio by CPAC Level (corpus average)
214-
215-
| Corpus | Fast | Default | High | Best |
216-
|--------|------|---------|------|------|
217-
| loghub2_2k | 11.04× | 12.87× | 13.61× | **15.25×** |
218-
| nasa_logs | 5.41× | 6.30× | 7.09× | **8.41×** |
219-
| silesia | 3.17× | 3.46× | 3.71× | **4.04×** |
220-
| canterbury | 3.96× | 4.17× | 4.22× | **5.06×** |
221-
| enwik8 | 2.82× | 3.06× | 3.27× | **3.64×** |
222-
223-
#### Key Achievements
182+
For full benchmark results, methodology, and corpus descriptions, see
183+
**[docs/BENCHMARKING.md](docs/BENCHMARKING.md)**.
224184

225-
- **16.63×** on loghub2_2k — MSN semantic extraction + BWT + auto-dictionary
226-
- **777 files benchmarked** across 8 standard corpora (773 OK, 4 timeout)
227-
- **12 entropy backends** — Zstd, Brotli, Gzip, LZMA, XZ, LZ4, Snappy, LZHAM, Lizard, zlib-ng, OpenZL, Raw
228-
- **100% lossless** — verified roundtrip across all measurements
229-
- **Parallel + streaming** — auto-parallel >256 KB, streaming with bounded memory
230-
- **Post-quantum encryption** — ML-KEM-768 + X25519 hybrid, ML-DSA-65 signatures
231-
232-
#### Optimization Phases
185+
**Headline numbers** (Session 21, balanced profile, 8 corpora, 777 files):
233186

234-
**Phases 1–2:** Re-enabled parallel smart transforms, CPBL v2 shared MSN header
235-
**Phases 3–4:** Auto-dictionary (CPBL v3), ConditionedBwtTransform (ID 26)
236-
**Phases 5–6:** Per-block backend selection, CAS bridge for MSN fields
187+
- **loghub2_2k**: 16.63× average best ratio (log data)
188+
- **nasa_logs**: 8.56× (HTTP access logs)
189+
- **silesia**: 4.30× (mixed content)
190+
- **enwik8**: 3.75× (Wikipedia XML)
191+
- **776/776 files OK** on full profile (Session 30, 0 timeouts)
192+
- **12 entropy backends**, 100% lossless verified
237193

238-
### Benchmark Profiles
194+
Benchmark profiles: quick (1 iter), balanced (3 iter), full (10 iter, 17 corpora).
239195

240-
```bash
241-
# Single file benchmark with baselines
242-
cpac benchmark myfile.txt
196+
## Compression Presets
243197

244-
# Profile options (matches Python engine)
245-
# Quick: 1 iteration (fast validation)
246-
# Balanced: 3 iterations (default, reliable)
247-
# Full: 10 iterations (publication-grade)
248-
```
249-
250-
### Criterion Microbenchmarks
198+
| Preset | Level | Smart | Use Case |
199+
|--------|-------|-------|----------|
200+
| `turbo` | Fast | off | Maximum throughput, real-time pipelines |
201+
| `balanced` | Default | on | General purpose, good ratio/speed balance |
202+
| `maximum` | High | on | Best ratio with reasonable speed |
203+
| `archive` | Best | on | Cold storage, archival workloads |
251204

252205
```bash
253-
# Full Criterion suite
254-
cargo bench -p cpac-engine
255-
256-
# Individual bench suites
257-
cargo bench -p cpac-engine --bench compress # pipeline + backends
258-
cargo bench -p cpac-engine --bench simd # SIMD vs scalar
259-
cargo bench -p cpac-engine --bench dag # DAG compile + execute
206+
cpac compress --preset archive big_dataset.tar
207+
cpac compress --preset turbo streaming_logs.jsonl
260208
```
261209

262-
## Completed Features (Phase 1+2) ✓
263-
264-
-**Dictionary training** — Zstd dictionary compression/decompression via stream API
265-
-**SIMD acceleration** — AVX2 kernels for delta encoding with runtime CPU detection
266-
-**Streaming API** — Block-based streaming with progress callbacks (CS format)
267-
-**C/C++ FFI** — Complete bindings in `cpac-ffi` crate with cbindgen headers
268-
-**Python bindings** — PyO3-based bindings in `cpac-py` (submodule)
269-
-**Additional transforms** — BWT, MTF added to transform library
270-
-**ARM SIMD** — NEON scaffolding and SVE/SVE2 infrastructure
271-
-**Memory pool** — Buffer pool infrastructure (signal-driven activation)
272-
-**Parallel compression** — Block-parallel CPBL format with auto-enable >1MB
273-
274-
## Planned Features
275-
276-
### Near-Term (Signal-Driven, Phase 3+)
277-
278-
All future optimizations are **bottleneck signal-driven**. See `BENCHMARKING.md` for the full corpus results.
279-
280-
**Top Priorities** (when signals indicate):
281-
- **Memory pool activation** — When profiling shows >10% time in allocator
282-
- **Dictionary caching** — When training overhead >1s on repeated corpora
283-
- **ARM NEON implementation** — When profiling shows significant scalar fallback time
284-
- **Preprocessing cache** — When >5% time in transform trial logic
285-
286-
### Long-Term (Phase 4+)
287-
288-
- **GPU acceleration** — CUDA/ROCm kernels for high-throughput systems (>10 GB/s)
289-
- **Networked compression** — client/server mode with delta sync
290-
- **WASM target** — browser-based compression with SIMD.js fallback
291-
- **ML-based selection** — trained models for backend/transform selection
292-
293-
### Long-term
294-
295-
- **Approximate compression** — lossy modes for numerical data
296-
- **Neural codec integration** — learned compression for specific domains
297-
- **Distributed compression** — map/reduce across cluster
298-
- **Hardware offload** — FPGA/ASIC integration for high-throughput
299-
- **Format versioning** — backward-compatible wire format evolution
300-
301210
## Requirements
302211

303212
- **Rust** 1.75+ stable (tested on 1.93)
304213
- **Platforms**: Windows x86_64 (primary), Linux x86_64/aarch64, macOS x86_64/aarch64
305214
- **Optional**: Gnuplot (for Criterion HTML reports)
306215

307-
## Project Files
216+
## Agent Quick Start
217+
218+
This repository supports AI agent workflows. When starting a new conversation
219+
with an AI agent in this repository:
220+
221+
```
222+
Read AGENTS.md and WARP.md, then verify build with .\shell.ps1 build && .\shell.ps1 test.
223+
```
224+
225+
For full agent conventions and session behavior, see `AGENTS.md`.
226+
227+
## Documentation
308228

309229
- `AGENTS.md` — AI agent onboarding guide
310230
- `WARP.md` — Warp IDE project rules
311-
- `BENCHMARKING.md` — Industry benchmark results and guide
312-
- `docs/SPEC.md` — Wire format specification
231+
- `docs/BENCHMARKING.md` — Benchmark results and methodology
313232
- `docs/ARCHITECTURE.md` — System architecture
233+
- `docs/MANUAL.md` — User manual and CLI reference
234+
- `docs/SPEC.md` — Wire format specification
235+
- `docs/TRANSFORMS.md` — Transform pipeline status and calibration
236+
- `docs/MSN_GUIDE.md` — Multi-Scale Normalization user guide
237+
- `docs/ROADMAP.md` — Feature roadmap and known issues
238+
- `docs/HARDWARE_ACCEL.md` — Hardware acceleration backends
239+
- `docs/RELEASE.md` — Release process and CI/CD
314240
- `CONTRIBUTING.md` — Contribution guidelines
315241
- `SECURITY.md` — Security policy
242+
- `CHANGELOG.md` — Release notes
243+
- `LEDGER.md` — Session-by-session development record
316244

317245
## License
318246

0 commit comments

Comments
 (0)