cpsc-computing
diff --git a/‎LEDGER.md‎
Lines changed: 71 additions & 0 deletions b/‎LEDGER.md‎
Lines changed: 71 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 73 additions & 145 deletions b/‎README.md‎
Lines changed: 73 additions & 145 deletions
@@ -2,6 +2,77 @@
 
 Session-by-session record of significant changes, investigations, and decisions.
 
+## Session 31 — 2026-03-15 (Documentation Overhaul + Corpus Download Fixes)
+
+### Focus
+Comprehensive documentation overhaul: bring all 9 maintained docs up to date
+with the current v0.3.0 codebase (21 crates, 12 backends, 26 transforms,
+MSN disabled by default).  Also fix corpus download handlers and Windows
+symlink enumeration in `cpac.py`.
+
+### Documentation Updates (9 files)
+
+1. **README.md** — Stripped all inline benchmark tables; added link to
+   `docs/BENCHMARKING.md`.  Updated crate count 16→21, features (12 backends,
+   26+ transforms, MSN opt-in).  Removed stale "Completed/Planned Features".
+
+2. **docs/BENCHMARKING.md** — Removed Sessions 11–20 historical results; kept
+   Session 21 as latest comprehensive benchmark.  Added Session 22–30
+   infrastructure notes.  Full corpora table (17 corpora with sizes).  Updated
+   download commands to `--profile full`.
+
+3. **docs/ARCHITECTURE.md** — v0.3.0 header, accurate 21-crate tree with new
+   crates (cpac-lab, cpac-conditioning, cpac-predict, cpac-transcode, sys
+   crates), 12 backends in pipeline diagram.  Removed stale v0.2.0/v0.3.0
+   roadmap.  Updated references section.
+
+4. **docs/MANUAL.md** — v0.3.0 header.  Expanded §6 Entropy Backends from 5
+   to 12 with full table.  Added MSN-disabled-by-default callout in §7.
+   Updated `--backend` help text.
+
+5. **docs/MSN_GUIDE.md** — Added prominent opt-in notice at top.  Updated
+   version history: 19 domain handlers, u32 metadata, MSN off by default.
+
+6. **docs/MSN_SPEC.md** — Fixed `msn_metadata_len` from u16 (2 bytes) to u32
+   (4 bytes) matching code (`cpac-frame/src/lib.rs`).  CP2 minimum header now
+   correctly documented as 16 bytes.  Added `FLAG_MSN_INLINE` (0x0001).
+   Updated backend ID range 0x00–0x0B.
+
+7. **docs/RELEASE.md** — v0.3.0 examples throughout.  Updated crate publish
+   order from 15 to 21 crates (added sys crates, cpac-conditioning,
+   cpac-predict, cpac-lab, cpac-transcode).
+
+8. **docs/SPEC.md** — v1.1.  All 12 backend IDs (0x00–0x0B).  All 26
+   transform IDs corrected to match `cpac-transforms` source (old SPEC had
+   wrong ID assignments for Delta/ZigZag/Transpose/etc.).
+
+9. **docs/ROADMAP.md** — Marked transcode Phase 1 (cpac-transcode crate) and
+   auto-analysis Phase 1 (cpac-lab auto_analyze module) as done.
+
+### Corpus Download Fixes (scripts/cpac.py)
+
+- Fixed `http_targz` handler for multi-URL corpora (docker_layers)
+- Fixed `http_gzip_multi` handler (github_events_large)
+- Fixed `http_zip_nested` handler (loghub2_full)
+- Added `_safe_rglob_files()` helper to skip Windows symlinks/reparse points
+  (Alpine minirootfs untrusted reparse points caused WinError 448)
+- Applied safe enumeration to benchmark file collection, download summary,
+  and "already present" check
+
+### Files Modified
+- `README.md`
+- `docs/ARCHITECTURE.md`
+- `docs/BENCHMARKING.md`
+- `docs/MANUAL.md`
+- `docs/MSN_GUIDE.md`
+- `docs/MSN_SPEC.md`
+- `docs/RELEASE.md`
+- `docs/ROADMAP.md`
+- `docs/SPEC.md`
+- `scripts/cpac.py`
+
+---
+
 ## Session 30 — 2026-03-15 (Benchmark Infrastructure Fixes + YAML Parser Bug)
 
 ### Focus
 
@@ -18,76 +18,56 @@ drop-in CLI for gzip/zstd/brotli workflows. Written in Rust.
 ## Features
 
 - **Adaptive pipeline** — SSR analysis auto-selects transforms and entropy backend per-file
-- **11 transforms** — delta, zigzag, transpose, ROLZ, float-split, field-LZ, range-pack,
-  tokenize, prefix-strip, dedup, parse-int
+- **26+ transforms** — BWT (SA-IS), delta, zigzag, transpose, ROLZ, float-split, field-LZ,
+  range-pack, tokenize, prefix-strip, dedup, parse-int, normalize, conditioned-BWT, predict,
+  byte-plane, const-elim, and more
 - **SIMD acceleration** — runtime dispatch: AVX-512 → AVX2 → SSE4.1 → SSE2 → NEON → scalar
 - **DAG profiles** — composable transform chains with auto-select and 5 built-in profiles
-- **Block-parallel** — rayon-based parallel compress/decompress (CPBL wire format)
+- **Block-parallel** — rayon-based parallel compress/decompress (CPBL v1/v2/v3 wire formats)
 - **Memory-mapped I/O** — auto-mmap for files > 64 MB, manual `--mmap` flag
 - **Streaming** — block-based streaming with progress callbacks and adaptive block sizing
 - **12 entropy backends** — Zstd, Brotli, Gzip, LZMA, XZ, LZ4, Snappy, LZHAM, Lizard, zlib-ng, OpenZL, Raw
 - **Encryption** — ChaCha20-Poly1305, AES-256-GCM, Argon2 KDF
 - **Post-quantum crypto** — ML-KEM-768 + X25519 hybrid encryption (CPHE), ML-DSA-65 signatures
 - **Archives** — multi-file `.cpar` format with per-entry compression
-- **Domain handlers** — CSV, JSON, XML, YAML, log file specializations
+- **MSN domain handlers** — CSV, JSON, XML, YAML, syslog, Apache, and log file specializations (opt-in)
 - **CAS analysis** — constraint inference (range, enum, constant, monotonic, functional dependency)
-- **Benchmarking** — built-in benchmark suite with baselines (gzip-9, zstd-3, brotli-11, lzma-6),
-  lossless verification, memory tracking, Criterion microbenchmarks, **industry-standard corpora**
-  (Canterbury, Silesia, Calgary), automated batch runner with CSV/Markdown reports
-- **Corpus management** — YAML-driven corpus configs, automatic HTTP/ZIP/TAR.GZ downloads,
-  progress bars, 18+ curated benchmark datasets
-- **Host detection** — CPU, cores, RAM, SIMD tier detection with safe auto-defaults
-- **Cross-platform** — Windows (primary), Linux, macOS; x86_64 and aarch64
 - **Transcode compression** — lossless image (PNG/BMP/TIFF/WebP) pixel-domain compression
 - **Auto-analysis** — directory-level analysis with YAML config generation
+- **Benchmarking** — profile-driven benchmark suite with 17+ curated corpora, 12 baseline
+  backends, YAML-driven corpus configs, automatic HTTP/ZIP/TAR.GZ downloads
+- **Host detection** — CPU, cores, RAM, SIMD tier detection with safe auto-defaults
 - **Hardware acceleration** — pluggable accel layer (QAT, IAA, GPU, FPGA, SVE2 stubs)
-- **539+ tests** — comprehensive regression suite, golden vectors, property-based tests,
-  determinism validation, transform-specific tests
-
-## Agent Quick Start
-
-This repository supports AI agent workflows with session commands.
-
-When starting a new conversation with an AI agent (Warp, Claude, etc.) in this
-repository, use the following prompt to establish context:
-
-**Standard load session:**
-```
-You are in the cpac repository (CPAC compression engine). First read AGENTS.md
-and WARP.md, then verify build with cargo build --workspace && cargo test --workspace.
-```
-
-On startup, an agent SHOULD:
-1. Read `AGENTS.md` (architecture, entry points, conventions)
-2. Read `WARP.md` (build commands, presubmit checklist, commit style)
-3. Set Windows PATH if needed: `$env:PATH = "$env:USERPROFILE\.cargo\bin;$env:PATH"`
-4. Verify build: `cargo build --workspace && cargo test --workspace`
-5. Run presubmit before any commits: build, test, clippy, fmt (see WARP.md)
-
-For full agent conventions and session behavior, see `AGENTS.md`.
+- **Cross-platform** — Windows (primary), Linux, macOS; x86_64 and aarch64
 
 ## Quick Start
 
 ```bash
 # Build
-cargo build --workspace
+cargo build --workspace --release
 
-# Run tests (289+)
+# Run tests
 cargo test --workspace
 
-# Install the CLI
-cargo install --path crates/cpac-cli
-
 # Compress a file
 cpac compress myfile.txt
 
 # Decompress
 cpac decompress myfile.txt.cpac
 ```
 
-### Windows note
+### Windows (PowerShell)
+
+All commands go through the unified build system:
+
+```powershell
+.\shell.ps1 build --release
+.\shell.ps1 test
+.\shell.ps1 benchmark-all                  # balanced profile (default)
+.\shell.ps1 benchmark-all --profile full   # full profile (17 corpora)
+```
 
-If `cargo` is not on your PATH in PowerShell:
+If `cargo` is not on your PATH:
 
 ```powershell
 $env:PATH = "$env:USERPROFILE\.cargo\bin;$env:PATH"
@@ -143,15 +123,15 @@ cpac completions powershell > cpac.ps1
 
 ## Architecture
 
-CPAC is a 16-crate Cargo workspace. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
+CPAC is a 21-crate Cargo workspace. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
 for the full design.
 
 ```
 cpac-types          Shared types, CpacError, ResourceConfig
 cpac-ssr            Structural Summary Record analysis
-cpac-transforms     11 encoding transforms + SIMD kernels
+cpac-transforms     26+ encoding transforms + SIMD kernels
 cpac-dag            DAG composition, profiles, auto-select
-cpac-entropy        Zstd / Brotli / Gzip / LZMA / Raw backends
+cpac-entropy        12 entropy backends (Zstd, Brotli, Gzip, LZMA, XZ, LZ4, ...)
 cpac-frame          Wire format encode/decode (CP frame)
 cpac-engine         Top-level API, host detection, parallel, benchmarks
 cpac-cli            Command-line interface (clap)
@@ -162,6 +142,12 @@ cpac-domains        Domain-aware handlers
 cpac-cas            Constraint-Aware Schema inference
 cpac-archive        Multi-file .cpar archive format
 cpac-dict           Dictionary training (Zstd)
+cpac-lab            Benchmarking, calibration, auto-analysis
+cpac-conditioning   Data conditioning / partitioning
+cpac-predict        Prediction transforms
+cpac-transcode      Lossless image transcode compression (CPTC)
+cpac-lizard-sys     Lizard C library sys crate
+cpac-lzham-sys      LZHAM C library sys crate
 cpac-ffi            C/C++ FFI bindings
 ```
 
@@ -172,7 +158,7 @@ Compression pipeline: `SSR → Preprocess (transforms) → Entropy coding → Fr
 See [docs/SPEC.md](docs/SPEC.md) for complete wire format specifications.
 
 - **CP** — standard CPAC frame (single-block)
-- **CPBL** — block-parallel frame (multi-block, rayon)
+- **CPBL** — block-parallel frame (v1 basic, v2 shared MSN metadata, v3 auto-dictionary)
 - **TP** — transform preprocess frame
 - **CS** — streaming frame
 - **CPHE** — hybrid post-quantum encryption frame
@@ -193,126 +179,68 @@ cargo build --profile release-small
 
 ## Performance Benchmarks
 
-### Latest Results (Session 21 — Post Phase 1–6 Optimizations)
-
-**Date:** March 2026 | **Version:** 0.3.0 | **Platform:** Windows x86_64, Rust 1.93+  
-**Methodology:** Balanced profile, 3 iterations, 12 entropy backends, 777 files across 8 standard corpora. All results verified lossless.
-
-#### Headline Numbers (Best Ratio per Corpus)
-
-| Corpus | Files | Avg Best | Range |
-|--------|-------|----------|-------|
-| loghub2_2k | 14 | **16.63×** | 6.97–32.63× |
-| nasa_logs | 4 | **8.56×** | 1.00–16.23× |
-| canterbury | 11 | **5.84×** | 2.86–20.96× |
-| silesia | 12 | **4.30×** | 1.63–12.42× |
-| calgary | 18 | **4.03×** | 1.85–12.56× |
-| enwik8 | 1 | **3.75×** | — |
-| cloud_configs | 691 | **3.63×** | 0.00–17.98× |
-
-#### Zstd Ratio by CPAC Level (corpus average)
-
-| Corpus | Fast | Default | High | Best |
-|--------|------|---------|------|------|
-| loghub2_2k | 11.04× | 12.87× | 13.61× | **15.25×** |
-| nasa_logs | 5.41× | 6.30× | 7.09× | **8.41×** |
-| silesia | 3.17× | 3.46× | 3.71× | **4.04×** |
-| canterbury | 3.96× | 4.17× | 4.22× | **5.06×** |
-| enwik8 | 2.82× | 3.06× | 3.27× | **3.64×** |
-
-#### Key Achievements
+For full benchmark results, methodology, and corpus descriptions, see
+**[docs/BENCHMARKING.md](docs/BENCHMARKING.md)**.
 
-- **16.63×** on loghub2_2k — MSN semantic extraction + BWT + auto-dictionary
-- **777 files benchmarked** across 8 standard corpora (773 OK, 4 timeout)
-- **12 entropy backends** — Zstd, Brotli, Gzip, LZMA, XZ, LZ4, Snappy, LZHAM, Lizard, zlib-ng, OpenZL, Raw
-- **100% lossless** — verified roundtrip across all measurements
-- **Parallel + streaming** — auto-parallel >256 KB, streaming with bounded memory
-- **Post-quantum encryption** — ML-KEM-768 + X25519 hybrid, ML-DSA-65 signatures
-
-#### Optimization Phases
+**Headline numbers** (Session 21, balanced profile, 8 corpora, 777 files):
 
-**Phases 1–2:** Re-enabled parallel smart transforms, CPBL v2 shared MSN header  
-**Phases 3–4:** Auto-dictionary (CPBL v3), ConditionedBwtTransform (ID 26)  
-**Phases 5–6:** Per-block backend selection, CAS bridge for MSN fields
+- **loghub2_2k**: 16.63× average best ratio (log data)
+- **nasa_logs**: 8.56× (HTTP access logs)
+- **silesia**: 4.30× (mixed content)
+- **enwik8**: 3.75× (Wikipedia XML)
+- **776/776 files OK** on full profile (Session 30, 0 timeouts)
+- **12 entropy backends**, 100% lossless verified
 
-### Benchmark Profiles
+Benchmark profiles: quick (1 iter), balanced (3 iter), full (10 iter, 17 corpora).
 
-```bash
-# Single file benchmark with baselines
-cpac benchmark myfile.txt
+## Compression Presets
 
-# Profile options (matches Python engine)
-# Quick: 1 iteration (fast validation)
-# Balanced: 3 iterations (default, reliable)
-# Full: 10 iterations (publication-grade)
-```
-
-### Criterion Microbenchmarks
+| Preset | Level | Smart | Use Case |
+|--------|-------|-------|----------|
+| `turbo` | Fast | off | Maximum throughput, real-time pipelines |
+| `balanced` | Default | on | General purpose, good ratio/speed balance |
+| `maximum` | High | on | Best ratio with reasonable speed |
+| `archive` | Best | on | Cold storage, archival workloads |
 
 ```bash
-# Full Criterion suite
-cargo bench -p cpac-engine
-
-# Individual bench suites
-cargo bench -p cpac-engine --bench compress    # pipeline + backends
-cargo bench -p cpac-engine --bench simd        # SIMD vs scalar
-cargo bench -p cpac-engine --bench dag         # DAG compile + execute
+cpac compress --preset archive big_dataset.tar
+cpac compress --preset turbo streaming_logs.jsonl
 ```
 
-## Completed Features (Phase 1+2) ✓
-
-- ✓ **Dictionary training** — Zstd dictionary compression/decompression via stream API
-- ✓ **SIMD acceleration** — AVX2 kernels for delta encoding with runtime CPU detection
-- ✓ **Streaming API** — Block-based streaming with progress callbacks (CS format)
-- ✓ **C/C++ FFI** — Complete bindings in `cpac-ffi` crate with cbindgen headers
-- ✓ **Python bindings** — PyO3-based bindings in `cpac-py` (submodule)
-- ✓ **Additional transforms** — BWT, MTF added to transform library
-- ✓ **ARM SIMD** — NEON scaffolding and SVE/SVE2 infrastructure
-- ✓ **Memory pool** — Buffer pool infrastructure (signal-driven activation)
-- ✓ **Parallel compression** — Block-parallel CPBL format with auto-enable >1MB
-
-## Planned Features
-
-### Near-Term (Signal-Driven, Phase 3+)
-
-All future optimizations are **bottleneck signal-driven**. See `BENCHMARKING.md` for the full corpus results.
-
-**Top Priorities** (when signals indicate):
-- **Memory pool activation** — When profiling shows >10% time in allocator
-- **Dictionary caching** — When training overhead >1s on repeated corpora
-- **ARM NEON implementation** — When profiling shows significant scalar fallback time
-- **Preprocessing cache** — When >5% time in transform trial logic
-
-### Long-Term (Phase 4+)
-
-- **GPU acceleration** — CUDA/ROCm kernels for high-throughput systems (>10 GB/s)
-- **Networked compression** — client/server mode with delta sync
-- **WASM target** — browser-based compression with SIMD.js fallback
-- **ML-based selection** — trained models for backend/transform selection
-
-### Long-term
-
-- **Approximate compression** — lossy modes for numerical data
-- **Neural codec integration** — learned compression for specific domains
-- **Distributed compression** — map/reduce across cluster
-- **Hardware offload** — FPGA/ASIC integration for high-throughput
-- **Format versioning** — backward-compatible wire format evolution
-
 ## Requirements
 
 - **Rust** 1.75+ stable (tested on 1.93)
 - **Platforms**: Windows x86_64 (primary), Linux x86_64/aarch64, macOS x86_64/aarch64
 - **Optional**: Gnuplot (for Criterion HTML reports)
 
-## Project Files
+## Agent Quick Start
+
+This repository supports AI agent workflows. When starting a new conversation
+with an AI agent in this repository:
+
+```
+Read AGENTS.md and WARP.md, then verify build with .\shell.ps1 build && .\shell.ps1 test.
+```
+
+For full agent conventions and session behavior, see `AGENTS.md`.
+
+## Documentation
 
 - `AGENTS.md` — AI agent onboarding guide
 - `WARP.md` — Warp IDE project rules
-- `BENCHMARKING.md` — Industry benchmark results and guide
-- `docs/SPEC.md` — Wire format specification
+- `docs/BENCHMARKING.md` — Benchmark results and methodology
 - `docs/ARCHITECTURE.md` — System architecture
+- `docs/MANUAL.md` — User manual and CLI reference
+- `docs/SPEC.md` — Wire format specification
+- `docs/TRANSFORMS.md` — Transform pipeline status and calibration
+- `docs/MSN_GUIDE.md` — Multi-Scale Normalization user guide
+- `docs/ROADMAP.md` — Feature roadmap and known issues
+- `docs/HARDWARE_ACCEL.md` — Hardware acceleration backends
+- `docs/RELEASE.md` — Release process and CI/CD
 - `CONTRIBUTING.md` — Contribution guidelines
 - `SECURITY.md` — Security policy
+- `CHANGELOG.md` — Release notes
+- `LEDGER.md` — Session-by-session development record
 
 ## License