-
Notifications
You must be signed in to change notification settings - Fork 1
Ixon serialization rewrite with large speed and size reduction #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Major rewrite of the Ixon serialization format optimized for space efficiency.
Now Ix + Lean in Lean environment compiles in 10.21 seconds with
serialized size of 1.6 GB
Key changes:
- Split ixon module into separate files (constant, env, expr, univ, metadata,
serialize, sharing, tag, comm, error)
- New tag-based encoding (Tag0, Tag2, Tag4) for compact representation
- Refs table: deduplicates constant references using indices instead of
inline 32-byte addresses
- Univs table: deduplicates universe terms per constant
- Global name indexing: shares names across entire environment, metadata
uses u64 indices instead of 32-byte addresses
- Sharing analysis: identifies shared subexpressions within constants
- Pre-order traversal indexing for expression metadata
The old ixon module preserved as ixon_old.rs for migration reference.
Main fix: build_sharing_vec now re-sorts shared hashes in topological order (leaves first) instead of using gross-benefit order from decide_sharing. Previously, large compound expressions came first and couldn't reference their children via Share(idx), causing each entry to inline entire subtrees (30-35KB per entry instead of 3-13 bytes). This caused 1140x blow-up for pathological blocks like denote_blastDivSubtractShift_q. Also fixes hash-consed size calculation to include structural overhead for Inductive/Constructor/Recursor data structures, not just expressions. This makes the comparison fair - previously simple enums like SymbolKind showed Ixon as worse than hash-consing (1.91x) when it's actually 21x better. Results after fix: - Overall ratio: 0.071x (Ixon is 14x more efficient than hash-consing) - 0 blocks with ratio >= 1.0x (previously 4) - 11,722 blocks with 20x+ compression - Total serialized: 159MB vs 2.24GB hash-consed Other changes: - Remove early-break bug in decide_sharing that skipped profitable candidates - Add TRACK_HASH_CONSED_SIZE and ANALYZE_SHARING debug flags - Add analyze_sharing_stats() for debugging pathological cases - Add test_early_break_bug unit test - Add granular ratio distribution buckets in size analysis output
…tion,
refactor FFI layer into focused modules, and migrate pipeline to Ix types
Introduce a pure Lean compiler (CompileM) that produces bit-identical
Ixon output to the existing Rust compiler, enabling cross-implementation
verification. The compiler supports both sequential and wave-based
parallel compilation with optional fail-fast mismatch detection against
the Rust reference implementation.
Lean-side additions:
- Ix/CompileM.lean: ~2000-line pure Lean compiler with reader/state/except
monad stack. Supports parallel compilation via wave-based scheduling with
dedicated worker threads and channels.
- Ix/DecompileV2.lean: Decompiler from Ixon format back to Lean expressions,
resolving refs/univs tables and expanding sharing references.
- Ix/ShardMap.lean: Concurrent hashmap with power-of-2 sharding, cache-line
padding, double-checked locking, and bulk insert parallelism.
- Ix/Sharing.lean: Pure Lean sharing analysis using alpha-invariant
Merkle-tree hashing (blake3). Two-phase O(n) algorithm with DAG
construction via post-order traversal and profitability heuristic.
Lean-side migrations:
- GraphM, CondenseM, Mutual: migrated from Lean.Name/Lean.Expr to
Ix.Name/Ix.Expr throughout the compilation pipeline.
- Ix/Ixon.lean: expanded with metadata types (ExprMetaData, ExprMetaArena,
ConstantMeta, Named, Comm), full serialization for metadata, and
array-based RawEnv types for FFI boundaries.
- MutCtx definition moved from Common.lean to Environment.lean.
Rust-side refactoring:
- Split monolithic src/lean/ffi/ixon.rs (1959 lines) into ixon/ module
with focused files: compare, constant, enums, env, expr, meta,
serialize, sharing, univ.
- Split src/lean/ffi/lean_env_builder.rs (2551 lines) into builder.rs,
compile.rs, graph.rs, and ix/ submodule (address, constant, data, env,
expr, level, name).
- Added primitives.rs for basic Lean type FFI roundtrips.
- Preserved legacy FFI in ixon_legacy.rs.
- Updated compile.rs and decompile.rs for new metadata and canonicalization
pipeline.
Test infrastructure:
- Reorganized test generators into Tests/Gen/{Basic,Ix,Ixon}.lean.
- Reorganized FFI tests into Tests/FFI/{Basic,Ix,Ixon}.lean.
- Added property-based roundtrip tests for all FFI types using SlimCheck.
- Added cross-implementation serialization comparison tests (Lean vs Rust
byte-level equality for univs, exprs, constants, and envs).
- New domain tests: Tests/Ix/{CondenseM,GraphM,Sharing}.lean,
Tests/ShardMap.lean.
- Removed obsolete Tests/FFIConsistency.lean.
Key design decisions:
- Arena-based expression metadata separates semantic content (alpha-invariant
constants) from non-semantic metadata (binder names, reducibility hints).
- Array-based "Raw" types bridge the FFI boundary, avoiding hash function
mismatches between Lean and Rust.
- All serialized sections sorted by address for deterministic output.
- Name indices use DFS topological sort with address-based tie-breaking.
| The sharing vector is built incrementally: | ||
| - Each entry can only reference earlier entries (no forward references) | ||
| - Entries are sorted by profitability (most savings first) | ||
| - Root expressions are rewritten using all available share indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if someone does not follow these rules? Specifically
- what if the entries are not sorted by profitability? Is that really a problem?
- what if early entries have forward references?
- what if there's an expression in the sharing vector that is not used anywhere. Will that be type checked? Or perhaps it will just be garbage?
The second one can be mitigated by a simple check, and I see that indeed it might be a problem since it could lead to infinite expressions without using mutual blocks. However, the first one seems to be expensive to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If entries aren't sorted, or aren't in topological order, then the constant is non-optimal or degenerate. That will result in a different hash for the expression, but might be fine semantically. Or it might result in exponential blow-up.
The practical implication is that if someone manually constructs a degenerate constant, they can roll a different hash. That doesn't affect proofs of typechecking at all, but can create confusion on proofs of evaluation, since now there can be multiple hashes for the normalized output.
That said, I don't think this is such a big deal, since the commitment feature does this already, so we can treat degenerate constants as a kind of ad-hoc commitment if the preimage is private. And if the pre-image is public its trivial to check that the constant is optimal by re-running the sharing analysis.
This suggests we might want to also be able to prove non-degeneracy for evaluation, and I think that's doable and a good idea, but not something that has to be an immediate priority
To your specific questions:
what if the entries are not sorted by profitability? Is that really a problem?
Shouldn't really be a problem other than rolling the hash
what if early entries have forward references?
could cause a blow-up on expansion, like a zip-bomb. We probably should have special detection for this
what if there's an expression in the sharing vector that is not used anywhere. Will that be type checked? Or perhaps it will just be garbage?
I think it's just garbage that rolls the hash and increases the size, otherwise neutral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding my third question, I think indeed it should be treated as garbage. We can't even typecheck since there's no context to give meaning to its potential free variables. Plus, decompiling the constant will automatically remove these references
- Fix 219 clippy warnings across 21 Rust files (ptr_as_ptr,
unnecessary_unsafe, unused imports, redundant closures, etc.)
- Delete ixon_old.rs (2431 lines), IxonOld.lean (952 lines),
CompileMOld.lean (1161 lines), DecompileMOld.lean (704 lines)
- Migrate all dependents to new Ixon APIs: Store.lean, Serde.lean,
IxVM.lean, Prover.lean, Verifier.lean, mutual.rs, compile.rs
- Relocate ixon_legacy.rs FFI functions to proper modules:
sharing analysis → ixon/sharing.rs, env serde → ixon/serialize.rs,
compilation FFI → ffi/compile.rs
- Remove old-format FFI section from serialize.rs (~480 lines)
- Delete unused put_u32/get_u32 from metadata.rs
- Delete TODO:DELETEME block from address.rs
…n README Annotate all public types, fields, functions, and constants in the Ix pipeline (env, address, graph, condense, ground, mutual, store) with rustdoc and Lean doc comments. Add a Testing section to the README covering lake test suites and cargo test. Rename DecompileV2 to DecompileM.
Rewrite DecompileM to decompile Ixon format to Ix types (not Lean types), enabling cheap Blake3 hash comparison for roundtrip verification. Add arena-based metadata restoration (binder names, binder info, mdata), Syntax/DataValue blob deserialization, and chunked parallel decompilation. Extend Rust FFI to pass the full names table (1.3M entries including binder names and level params) through RawEnv, fixing lookup failures for non-constant-name addresses. Add standalone `lake test -- decompile` suite: 187K constants roundtrip with 0 errors and 0 mismatches.
arthurpaulino
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still reviewing the code, learning about the new Ixon. But I've already left some comments.
There's also #297, which avoids several unnecessary copies.
…e endpoints Unify CompileError, DecompileError, and SerializeError across Lean and Rust so constructor tag order matches exactly, enabling direct FFI roundtripping. Replace catch-all MissingName variant with specific error types (BadConstantFormat, BlobNotFound, BadBlobFormat). Change CompileError fields from Name/Expr to String for FFI compatibility and add UnknownUnivParam variant. Add new Rust FFI endpoints: rs_ser_env/rs_des_env for Ixon.Env serialization, and rs_decompile_env for full decompilation. Consolidate duplicated RawEnv→Env conversion into RawEnv.toEnv, include full names table for round-trip fidelity. Add roundtrip tests for all three error types (unit + property-based) and integration tests for Rust serialization and decompilation pipelines.
…nerators The staged diff covers 11 instances improved with real shrinking logic (sub-term extraction, index halving, array popping, variant simplification), plus the RawEnv fix for the missing names field. The Shrinkable Ix.Name instance was also moved earlier in Tests/Gen/Ix.lean so SyntaxPreresolved can delegate to it
The Lean compiler's `storeNameStrings` only stored string blobs but never populated a names map during compilation. Names were derived solely from `named` entries at aggregation time, producing ~239K names vs Rust's ~1.3M (which tracks every name encountered during compilation). Replace `storeNameStrings` with `compileName` that both stores string blobs and records each name component in a new `BlockState.blockNames` map. Add missing `compileName` calls for binder names, level params, MData keys, DataValue names, Syntax node kinds, ident values, and preresolved namespace/decl names. Merge per-block names during aggregation in both sequential and parallel compilation paths. Also fix `SerializeError.toString` to use `String.ofList` instead of the removed `List.asString`.
…ce checks Introduce Ix.Claim (claim types + serde) and Ix.Commit (compilation, commitment, and claim construction). The commitment pipeline compiles a definition, creates a blinded commitment via Comm, and includes a runtime alpha-invariance guard that recompiles under the commit name to verify the content address is stable. - Claim types: eval, check, reveal with tagged binary encoding (0xE4/E3/E6) - RevealConstantInfo: selective field revelation for all constant variants - compileDef: single-definition compilation with optional name parameter - commitDef: full pipeline with alpha-invariance recompilation check - openConstantInfo/openCommitment: extract revealed fields from compiled constants - Comm serde support in Ixon (tag 0xE5) on both Lean and Rust sides - Property-based and unit test suites for claims and commits - Migrate ZKVoting Prover/Verifier to new Commit API - Format byte counts with units (B/kB/MB) in compile test output
* cargo fmt * Clippy and fix Nix build * Move slow tests to separate binaries * chore: cargo-deny
- Add unified Tag4 reference table to docs/Ixon.md showing all flags 0x0-0xF
- Fix ExprMetaData serialization table (was completely wrong: Leaf/App missing,
tag numbers off by 2) and update to arena-based representation
- Add ConstantMeta serialization table (was missing entirely)
- Add module headers to CondenseM.lean (Tarjan's SCC algorithm),
Claim.lean (ZK claim types), and CanonM.lean (canonicalization)
- Document CondenseState fields and visit function with algorithm explanation
- Remove commented-out debug traces and dead counters in CanonM.lean
- Add doc comments to 14 public comparison/sorting functions in compile.rs
- Add module-level docs to src/ix.rs, nat.rs, sarray.rs, ctor.rs,
lean_env.rs, compile.rs (FFI), and proof.rs
- Fix duplicate comment and wrong range in expr.rs flag constants
- Replace panic!() with unreachable!() in FFI match arms (compile.rs)
- Delete empty todo.md
- Rename all mod.rs to modern x.rs convention (8 files)
…ming Test graph (9), ground (12), mutual (12), ixon::env (10), and strong_ordering (17) — 60 new tests covering the reference graph construction, groundedness checking, mutual block context indexing, Ixon environment CRUD, and comparison algebra. Also add compile phase timing to rs_tmp_decode_const_map.
Correctness: - Fix mutual block projection indexing for alpha-equivalent constants. Only push one representative per equivalence class into Muts array, matching the class-based indexing used by MutConst::ctx and projections. Previously, all constants were pushed flat, causing index misalignment when an equivalence class had >1 member. Fixed on both Rust and Lean sides. - Fix compare_const cross-kind ordering to use MutConstKind discriminant instead of always returning Lt. - Fix compare_const/compare_ctor cache key inversion by tracking swap state. - Fix check_decompile to return Err on mismatches instead of always Ok. - Fix CString::new().unwrap() panics by adding safe_cstring helper that filters null bytes instead of panicking. - Fix Box::leak memory leaks in decode_serialize_error by changing SerializeError fields from &'static str to String. - Fix buildSharingVec call sites in Tests/Ix/Sharing.lean after removing unused _topoOrder parameter. Performance: - Replace busy-wait polling (sleep 100us) with Condvar in compile_env work-stealing loop. - Eliminate double serialization in Lean compilation by adding blockBytes and blockAddr to BlockResult, computed once via BlockResult.mk'. - Deduplicate DefnInfo/ThmInfo/OpaqueInfo compilation into compile_single_def helper. Cleanup: - Remove debug leftovers: BLE' dbg_trace filter, ~90 lines of commented-out code in Common.lean, unused progress tracking variables. - Remove unused _topoOrder parameter from buildSharingVec (Lean). - Deduplicate u64ByteCount by referencing Ixon.u64ByteCount in CompileM. - Guard division by zero in SharingStats::Display. - Improve put_idx panic message with index size context. - Fix doc comment typo: Bytes1 -> Bytes2 in bytes2.rs. Tests: - Add test_compile_mutual_alpha_equivalent_defs (2 alpha-equiv defs → 1 Muts entry). - Add test_compile_mutual_alpha_equiv_with_different_third (2 classes → 2 Muts entries).
Address redundant field names, unnecessary let binding, needless pass-by-value, and rustfmt formatting across the compilation pipeline.
- Fix compare_const antisymmetry bug: cross-kind MutConst pairs always returned Lt regardless of order, violating total ordering. Now uses kind ordinals. Also fix cache key normalization that didn't flip results. - Rewrite unshare_expr from recursive to iterative stack-based traversal to prevent stack overflow on deep expressions. - Validate QuotKind deserialization tags instead of silently mapping invalid values to .ind. - Return CheckResult from check_decompile with match/mismatch/missing counts instead of always Ok(()). - Fix println to eprintln for diagnostic output in FFI compile. - Remove dead code: Tests/Ix.lean (empty suite), Tests/Ix/Common.lean, Tests/Ix/IR.lean, c_lean_cstr_to_nat C wrapper. - Clean up Tests/Main.lean commented-out entries, re-enable sharing-io. - Fix garbled comment in src/lean/ffi.rs. - Update README test suite documentation to match actual suites.
- Add ffi_io_guard to catch panics across extern "C" boundary and return IO errors instead of triggering undefined behavior - Return proper IO errors from compilation FFI instead of empty results - Add null-pointer guards on RustCompiledEnv FFI functions - Cap Vec pre-allocation during deserialization to prevent OOM from malformed input (capped_capacity helper) - Sort blobs, consts, named, comms, and names by address before serialization for deterministic output matching Lean - Serialize ReducibilityHints.regular as u32 instead of u64 to match its actual Lean type - Throw errors on missing name/parent addresses during deserialization instead of silently falling back to anonymous names - Rename RustCompiledEnv → RustBlockEnv to avoid confusion with the full compilation environment - Remove dead commented-out code and unused imports
Variable-length Tag0 encoding is 1 byte for typical heights (< 128) vs the previous fixed 4-byte u32 LE encoding.
- Remove all eprintln!/println! debug output from FFI functions (compile.rs, lean_env.rs) — errors propagated via return values - Make serialization fallible: Env::put, put_idx, metadata serialization now return Result instead of panicking on missing name indices - Fix unsafe memory access: use read_unaligned for u64 casts in ctor.rs and primitives.rs, add bounds check in rs_bytearray_to_u64_le - Replace get! with safe get? via Option.bind in CanonM.lean lookups - Add SAFETY comments to get! sites in CondenseM.lean - Strengthen test equality: ConstantInfo uses BEq (all fields), add BEq deriving; document size-only comparison limitations - Tighten test assertions: sharing tests, decompile success checks - Remove HashMap.find? shim, rename genString to genIxString, deduplicate genAddress, set panic=abort in Cargo profiles - Fix CompileM: revert incorrect find? → get? on Batteries.RBMap - Use correct error variant (UnknownUnivParam) for level comparison - Normalize compare_const cache to use ordered key pairs
Major rewrite of the Ixon serialization format optimized for space efficiency.
Now Ix + Lean in Lean environment compiles in 10.21 seconds with
serialized size of 1.6 GB
Key changes:
The old ixon module preserved as ixon_old.rs for migration reference.