Skip to content

Conversation

@johnchandlerburnham
Copy link
Member

Major rewrite of the Ixon serialization format optimized for space efficiency.
Now Ix + Lean in Lean environment compiles in 10.21 seconds with
serialized size of 1.6 GB

Key changes:

  • Split ixon module into separate files (constant, env, expr, univ, metadata, serialize, sharing, tag, comm, error)
  • New tag-based encoding (Tag0, Tag2, Tag4) for compact representation
  • Refs table: deduplicates constant references using indices instead of inline 32-byte addresses
  • Univs table: deduplicates universe terms per constant
  • Global name indexing: shares names across entire environment, metadata uses u64 indices instead of 32-byte addresses
  • Sharing analysis: identifies shared subexpressions within constants
  • Pre-order traversal indexing for expression metadata

The old ixon module preserved as ixon_old.rs for migration reference.

  Major rewrite of the Ixon serialization format optimized for space efficiency.
  Now Ix + Lean in Lean environment compiles in 10.21 seconds with
  serialized size of 1.6 GB

  Key changes:
  - Split ixon module into separate files (constant, env, expr, univ, metadata,
    serialize, sharing, tag, comm, error)
  - New tag-based encoding (Tag0, Tag2, Tag4) for compact representation
  - Refs table: deduplicates constant references using indices instead of
    inline 32-byte addresses
  - Univs table: deduplicates universe terms per constant
  - Global name indexing: shares names across entire environment, metadata
    uses u64 indices instead of 32-byte addresses
  - Sharing analysis: identifies shared subexpressions within constants
  - Pre-order traversal indexing for expression metadata

  The old ixon module preserved as ixon_old.rs for migration reference.
@johnchandlerburnham johnchandlerburnham changed the title WIP: Ixon serialization rewrite with 85% size reduction WIP: Ixon serialization rewrite with large speed and size reduction Jan 14, 2026
  Main fix: build_sharing_vec now re-sorts shared hashes in topological order
  (leaves first) instead of using gross-benefit order from decide_sharing.
  Previously, large compound expressions came first and couldn't reference
  their children via Share(idx), causing each entry to inline entire subtrees
  (30-35KB per entry instead of 3-13 bytes). This caused 1140x blow-up for
  pathological blocks like denote_blastDivSubtractShift_q.

  Also fixes hash-consed size calculation to include structural overhead for
  Inductive/Constructor/Recursor data structures, not just expressions. This
  makes the comparison fair - previously simple enums like SymbolKind showed
  Ixon as worse than hash-consing (1.91x) when it's actually 21x better.

  Results after fix:
  - Overall ratio: 0.071x (Ixon is 14x more efficient than hash-consing)
  - 0 blocks with ratio >= 1.0x (previously 4)
  - 11,722 blocks with 20x+ compression
  - Total serialized: 159MB vs 2.24GB hash-consed

  Other changes:
  - Remove early-break bug in decide_sharing that skipped profitable candidates
  - Add TRACK_HASH_CONSED_SIZE and ANALYZE_SHARING debug flags
  - Add analyze_sharing_stats() for debugging pathological cases
  - Add test_early_break_bug unit test
  - Add granular ratio distribution buckets in size analysis output
…tion,

  refactor FFI layer into focused modules, and migrate pipeline to Ix types

  Introduce a pure Lean compiler (CompileM) that produces bit-identical
  Ixon output to the existing Rust compiler, enabling cross-implementation
  verification. The compiler supports both sequential and wave-based
  parallel compilation with optional fail-fast mismatch detection against
  the Rust reference implementation.

  Lean-side additions:
  - Ix/CompileM.lean: ~2000-line pure Lean compiler with reader/state/except
    monad stack. Supports parallel compilation via wave-based scheduling with
    dedicated worker threads and channels.
  - Ix/DecompileV2.lean: Decompiler from Ixon format back to Lean expressions,
    resolving refs/univs tables and expanding sharing references.
  - Ix/ShardMap.lean: Concurrent hashmap with power-of-2 sharding, cache-line
    padding, double-checked locking, and bulk insert parallelism.
  - Ix/Sharing.lean: Pure Lean sharing analysis using alpha-invariant
    Merkle-tree hashing (blake3). Two-phase O(n) algorithm with DAG
    construction via post-order traversal and profitability heuristic.

  Lean-side migrations:
  - GraphM, CondenseM, Mutual: migrated from Lean.Name/Lean.Expr to
    Ix.Name/Ix.Expr throughout the compilation pipeline.
  - Ix/Ixon.lean: expanded with metadata types (ExprMetaData, ExprMetaArena,
    ConstantMeta, Named, Comm), full serialization for metadata, and
    array-based RawEnv types for FFI boundaries.
  - MutCtx definition moved from Common.lean to Environment.lean.

  Rust-side refactoring:
  - Split monolithic src/lean/ffi/ixon.rs (1959 lines) into ixon/ module
    with focused files: compare, constant, enums, env, expr, meta,
    serialize, sharing, univ.
  - Split src/lean/ffi/lean_env_builder.rs (2551 lines) into builder.rs,
    compile.rs, graph.rs, and ix/ submodule (address, constant, data, env,
    expr, level, name).
  - Added primitives.rs for basic Lean type FFI roundtrips.
  - Preserved legacy FFI in ixon_legacy.rs.
  - Updated compile.rs and decompile.rs for new metadata and canonicalization
    pipeline.

  Test infrastructure:
  - Reorganized test generators into Tests/Gen/{Basic,Ix,Ixon}.lean.
  - Reorganized FFI tests into Tests/FFI/{Basic,Ix,Ixon}.lean.
  - Added property-based roundtrip tests for all FFI types using SlimCheck.
  - Added cross-implementation serialization comparison tests (Lean vs Rust
    byte-level equality for univs, exprs, constants, and envs).
  - New domain tests: Tests/Ix/{CondenseM,GraphM,Sharing}.lean,
    Tests/ShardMap.lean.
  - Removed obsolete Tests/FFIConsistency.lean.

  Key design decisions:
  - Arena-based expression metadata separates semantic content (alpha-invariant
    constants) from non-semantic metadata (binder names, reducibility hints).
  - Array-based "Raw" types bridge the FFI boundary, avoiding hash function
    mismatches between Lean and Rust.
  - All serialized sections sorted by address for deterministic output.
  - Name indices use DFS topological sort with address-based tie-breaking.
The sharing vector is built incrementally:
- Each entry can only reference earlier entries (no forward references)
- Entries are sorted by profitability (most savings first)
- Root expressions are rewritten using all available share indices
Copy link
Member

@gabriel-barrett gabriel-barrett Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if someone does not follow these rules? Specifically

  1. what if the entries are not sorted by profitability? Is that really a problem?
  2. what if early entries have forward references?
  3. what if there's an expression in the sharing vector that is not used anywhere. Will that be type checked? Or perhaps it will just be garbage?

The second one can be mitigated by a simple check, and I see that indeed it might be a problem since it could lead to infinite expressions without using mutual blocks. However, the first one seems to be expensive to check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If entries aren't sorted, or aren't in topological order, then the constant is non-optimal or degenerate. That will result in a different hash for the expression, but might be fine semantically. Or it might result in exponential blow-up.

The practical implication is that if someone manually constructs a degenerate constant, they can roll a different hash. That doesn't affect proofs of typechecking at all, but can create confusion on proofs of evaluation, since now there can be multiple hashes for the normalized output.

That said, I don't think this is such a big deal, since the commitment feature does this already, so we can treat degenerate constants as a kind of ad-hoc commitment if the preimage is private. And if the pre-image is public its trivial to check that the constant is optimal by re-running the sharing analysis.

This suggests we might want to also be able to prove non-degeneracy for evaluation, and I think that's doable and a good idea, but not something that has to be an immediate priority

To your specific questions:

what if the entries are not sorted by profitability? Is that really a problem?

Shouldn't really be a problem other than rolling the hash

what if early entries have forward references?

could cause a blow-up on expansion, like a zip-bomb. We probably should have special detection for this

what if there's an expression in the sharing vector that is not used anywhere. Will that be type checked? Or perhaps it will just be garbage?

I think it's just garbage that rolls the hash and increases the size, otherwise neutral

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding my third question, I think indeed it should be treated as garbage. We can't even typecheck since there's no context to give meaning to its potential free variables. Plus, decompiling the constant will automatically remove these references

  - Fix 219 clippy warnings across 21 Rust files (ptr_as_ptr,
    unnecessary_unsafe, unused imports, redundant closures, etc.)
  - Delete ixon_old.rs (2431 lines), IxonOld.lean (952 lines),
    CompileMOld.lean (1161 lines), DecompileMOld.lean (704 lines)
  - Migrate all dependents to new Ixon APIs: Store.lean, Serde.lean,
    IxVM.lean, Prover.lean, Verifier.lean, mutual.rs, compile.rs
  - Relocate ixon_legacy.rs FFI functions to proper modules:
    sharing analysis → ixon/sharing.rs, env serde → ixon/serialize.rs,
    compilation FFI → ffi/compile.rs
  - Remove old-format FFI section from serialize.rs (~480 lines)
  - Delete unused put_u32/get_u32 from metadata.rs
  - Delete TODO:DELETEME block from address.rs
…n README

   Annotate all public types, fields, functions, and constants in the Ix
   pipeline (env, address, graph, condense, ground, mutual, store) with
   rustdoc and Lean doc comments. Add a Testing section to the README
   covering lake test suites and cargo test. Rename DecompileV2 to
   DecompileM.
   Rewrite DecompileM to decompile Ixon format to Ix types (not Lean types),
   enabling cheap Blake3 hash comparison for roundtrip verification. Add
   arena-based metadata restoration (binder names, binder info, mdata),
   Syntax/DataValue blob deserialization, and chunked parallel decompilation.

   Extend Rust FFI to pass the full names table (1.3M entries including
   binder names and level params) through RawEnv, fixing lookup failures
   for non-constant-name addresses.

   Add standalone `lake test -- decompile` suite: 187K constants roundtrip
   with 0 errors and 0 mismatches.
Copy link
Member

@arthurpaulino arthurpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reviewing the code, learning about the new Ixon. But I've already left some comments.
There's also #297, which avoids several unnecessary copies.

arthurpaulino and others added 10 commits February 2, 2026 23:35
…e endpoints

  Unify CompileError, DecompileError, and SerializeError across Lean and Rust
  so constructor tag order matches exactly, enabling direct FFI roundtripping.
  Replace catch-all MissingName variant with specific error types (BadConstantFormat,
  BlobNotFound, BadBlobFormat). Change CompileError fields from Name/Expr to String
  for FFI compatibility and add UnknownUnivParam variant.

  Add new Rust FFI endpoints: rs_ser_env/rs_des_env for Ixon.Env serialization,
  and rs_decompile_env for full decompilation. Consolidate duplicated RawEnv→Env
  conversion into RawEnv.toEnv, include full names table for round-trip fidelity.

  Add roundtrip tests for all three error types (unit + property-based) and
  integration tests for Rust serialization and decompilation pipelines.
…nerators

  The staged diff covers 11 instances improved with real shrinking logic (sub-term extraction, index halving, array popping,
  variant simplification), plus the RawEnv fix for the missing names field. The Shrinkable Ix.Name instance was also moved
  earlier in Tests/Gen/Ix.lean so SyntaxPreresolved can delegate to it
The Lean compiler's `storeNameStrings` only stored string blobs but
never populated a names map during compilation. Names were derived
solely from `named` entries at aggregation time, producing ~239K names
vs Rust's ~1.3M (which tracks every name encountered during compilation).

Replace `storeNameStrings` with `compileName` that both stores string
blobs and records each name component in a new `BlockState.blockNames`
map. Add missing `compileName` calls for binder names, level params,
MData keys, DataValue names, Syntax node kinds, ident values, and
preresolved namespace/decl names. Merge per-block names during
aggregation in both sequential and parallel compilation paths.

Also fix `SerializeError.toString` to use `String.ofList` instead of
the removed `List.asString`.
…ce checks

  Introduce Ix.Claim (claim types + serde) and Ix.Commit (compilation,
  commitment, and claim construction). The commitment pipeline compiles a
  definition, creates a blinded commitment via Comm, and includes a runtime
  alpha-invariance guard that recompiles under the commit name to verify the
  content address is stable.

  - Claim types: eval, check, reveal with tagged binary encoding (0xE4/E3/E6)
  - RevealConstantInfo: selective field revelation for all constant variants
  - compileDef: single-definition compilation with optional name parameter
  - commitDef: full pipeline with alpha-invariance recompilation check
  - openConstantInfo/openCommitment: extract revealed fields from compiled constants
  - Comm serde support in Ixon (tag 0xE5) on both Lean and Rust sides
  - Property-based and unit test suites for claims and commits
  - Migrate ZKVoting Prover/Verifier to new Commit API
  - Format byte counts with units (B/kB/MB) in compile test output
* cargo fmt

* Clippy and fix Nix build

* Move slow tests to separate binaries

* chore: cargo-deny
  - Add unified Tag4 reference table to docs/Ixon.md showing all flags 0x0-0xF
  - Fix ExprMetaData serialization table (was completely wrong: Leaf/App missing,
    tag numbers off by 2) and update to arena-based representation
  - Add ConstantMeta serialization table (was missing entirely)
  - Add module headers to CondenseM.lean (Tarjan's SCC algorithm),
    Claim.lean (ZK claim types), and CanonM.lean (canonicalization)
  - Document CondenseState fields and visit function with algorithm explanation
  - Remove commented-out debug traces and dead counters in CanonM.lean
  - Add doc comments to 14 public comparison/sorting functions in compile.rs
  - Add module-level docs to src/ix.rs, nat.rs, sarray.rs, ctor.rs,
    lean_env.rs, compile.rs (FFI), and proof.rs
  - Fix duplicate comment and wrong range in expr.rs flag constants
  - Replace panic!() with unreachable!() in FFI match arms (compile.rs)
  - Delete empty todo.md
  - Rename all mod.rs to modern x.rs convention (8 files)
…ming

  Test graph (9), ground (12), mutual (12), ixon::env (10), and
  strong_ordering (17) — 60 new tests covering the reference graph
  construction, groundedness checking, mutual block context indexing,
  Ixon environment CRUD, and comparison algebra. Also add compile
  phase timing to rs_tmp_decode_const_map.
@johnchandlerburnham johnchandlerburnham changed the title WIP: Ixon serialization rewrite with large speed and size reduction Ixon serialization rewrite with large speed and size reduction Feb 5, 2026
Correctness:
- Fix mutual block projection indexing for alpha-equivalent constants.
 Only push one representative per equivalence class into Muts array,
 matching the class-based indexing used by MutConst::ctx and projections.
 Previously, all constants were pushed flat, causing index misalignment
 when an equivalence class had >1 member. Fixed on both Rust and Lean sides.
- Fix compare_const cross-kind ordering to use MutConstKind discriminant
 instead of always returning Lt.
- Fix compare_const/compare_ctor cache key inversion by tracking swap state.
- Fix check_decompile to return Err on mismatches instead of always Ok.
- Fix CString::new().unwrap() panics by adding safe_cstring helper that
 filters null bytes instead of panicking.
- Fix Box::leak memory leaks in decode_serialize_error by changing
 SerializeError fields from &'static str to String.
- Fix buildSharingVec call sites in Tests/Ix/Sharing.lean after removing
 unused _topoOrder parameter.

Performance:
- Replace busy-wait polling (sleep 100us) with Condvar in compile_env
 work-stealing loop.
- Eliminate double serialization in Lean compilation by adding blockBytes
 and blockAddr to BlockResult, computed once via BlockResult.mk'.
- Deduplicate DefnInfo/ThmInfo/OpaqueInfo compilation into
 compile_single_def helper.

Cleanup:
- Remove debug leftovers: BLE' dbg_trace filter, ~90 lines of
 commented-out code in Common.lean, unused progress tracking variables.
- Remove unused _topoOrder parameter from buildSharingVec (Lean).
- Deduplicate u64ByteCount by referencing Ixon.u64ByteCount in CompileM.
- Guard division by zero in SharingStats::Display.
- Improve put_idx panic message with index size context.
- Fix doc comment typo: Bytes1 -> Bytes2 in bytes2.rs.

Tests:
- Add test_compile_mutual_alpha_equivalent_defs (2 alpha-equiv defs → 1 Muts entry).
- Add test_compile_mutual_alpha_equiv_with_different_third (2 classes → 2 Muts entries).
Address redundant field names, unnecessary let binding, needless
pass-by-value, and rustfmt formatting across the compilation pipeline.
arthurpaulino
arthurpaulino previously approved these changes Feb 6, 2026
- Fix compare_const antisymmetry bug: cross-kind MutConst pairs always
 returned Lt regardless of order, violating total ordering. Now uses
 kind ordinals. Also fix cache key normalization that didn't flip results.
- Rewrite unshare_expr from recursive to iterative stack-based traversal
 to prevent stack overflow on deep expressions.
- Validate QuotKind deserialization tags instead of silently mapping
 invalid values to .ind.
- Return CheckResult from check_decompile with match/mismatch/missing
 counts instead of always Ok(()).
- Fix println to eprintln for diagnostic output in FFI compile.
- Remove dead code: Tests/Ix.lean (empty suite), Tests/Ix/Common.lean,
 Tests/Ix/IR.lean, c_lean_cstr_to_nat C wrapper.
- Clean up Tests/Main.lean commented-out entries, re-enable sharing-io.
- Fix garbled comment in src/lean/ffi.rs.
- Update README test suite documentation to match actual suites.
- Add ffi_io_guard to catch panics across extern "C" boundary and
  return IO errors instead of triggering undefined behavior
- Return proper IO errors from compilation FFI instead of empty results
- Add null-pointer guards on RustCompiledEnv FFI functions
- Cap Vec pre-allocation during deserialization to prevent OOM from
  malformed input (capped_capacity helper)
- Sort blobs, consts, named, comms, and names by address before
  serialization for deterministic output matching Lean
- Serialize ReducibilityHints.regular as u32 instead of u64 to match
  its actual Lean type
- Throw errors on missing name/parent addresses during deserialization
  instead of silently falling back to anonymous names
- Rename RustCompiledEnv → RustBlockEnv to avoid confusion with the
  full compilation environment
- Remove dead commented-out code and unused imports
Variable-length Tag0 encoding is 1 byte for typical heights (< 128)
vs the previous fixed 4-byte u32 LE encoding.
- Remove all eprintln!/println! debug output from FFI functions
  (compile.rs, lean_env.rs) — errors propagated via return values
- Make serialization fallible: Env::put, put_idx, metadata serialization
  now return Result instead of panicking on missing name indices
- Fix unsafe memory access: use read_unaligned for u64 casts in
  ctor.rs and primitives.rs, add bounds check in rs_bytearray_to_u64_le
- Replace get! with safe get? via Option.bind in CanonM.lean lookups
- Add SAFETY comments to get! sites in CondenseM.lean
- Strengthen test equality: ConstantInfo uses BEq (all fields),
  add BEq deriving; document size-only comparison limitations
- Tighten test assertions: sharing tests, decompile success checks
- Remove HashMap.find? shim, rename genString to genIxString,
  deduplicate genAddress, set panic=abort in Cargo profiles
- Fix CompileM: revert incorrect find? → get? on Batteries.RBMap
- Use correct error variant (UnknownUnivParam) for level comparison
- Normalize compare_const cache to use ordered key pairs
@johnchandlerburnham johnchandlerburnham merged commit d8d7af5 into main Feb 6, 2026
19 checks passed
@johnchandlerburnham johnchandlerburnham deleted the jcb/lacon branch February 6, 2026 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants