LOOM Architecture

Deep dive into LOOM's internal architecture, optimization pipeline, and implementation details.

Overview
Component Architecture
Optimization Pipeline
ISLE Integration
Verification System
Data Structures
Performance Considerations
Extension Points

Overview

LOOM is a WebAssembly optimizer built around three core principles:

Correctness: Formal verification ensures optimizations preserve semantics
Performance: Ultra-fast optimization (10-30 µs) with excellent results
Maintainability: Declarative ISLE rules make optimizations easy to understand and extend

Architecture Diagram

┌────────────────────────────────────────────────────────────────┐
│                         User Input                              │
│                    (WAT or WASM file)                          │
└──────────────────────┬─────────────────────────────────────────┘
                       │
                       ▼
┌────────────────────────────────────────────────────────────────┐
│                    Parser Module                                │
│  ┌──────────────┐        ┌─────────────────┐                  │
│  │ WAT Parser   │──────▶ │ wasmparser      │                  │
│  │ (wat crate)  │        │ Validation      │                  │
│  └──────────────┘        └─────────────────┘                  │
│                                 │                               │
│                                 ▼                               │
│                          ┌──────────────┐                      │
│                          │  AST Builder │                      │
│                          └──────────────┘                      │
└──────────────────────────┬────────────────────────────────────┘
                           │
                           ▼
┌────────────────────────────────────────────────────────────────┐
│                  Internal Representation                        │
│                                                                 │
│   Module {                                                      │
│     types: Vec<FuncType>,                                      │
│     functions: Vec<Function>,                                  │
│     tables: Vec<Table>,                                        │
│     memories: Vec<Memory>,                                     │
│     globals: Vec<Global>,                                      │
│     exports: Vec<Export>,                                      │
│   }                                                            │
│                                                                 │
│   Function {                                                    │
│     signature: FuncType,                                       │
│     locals: Vec<(count, ValueType)>,                          │
│     instructions: Vec<Instruction>,                            │
│   }                                                            │
└──────────────────────────┬────────────────────────────────────┘
                           │
                           ▼
┌────────────────────────────────────────────────────────────────┐
│              10-Phase CLI Optimization Pipeline                 │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  Phase 1: Inline                                        │  │
│  │  - Build call graph                                      │  │
│  │  - Inline small, frequently-called functions            │  │
│  │  - Substitute parameters with arguments                 │  │
│  └─────────────────────────────────────────────────────────┘  │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  Phase 2: Precompute                                    │  │
│  │  - Global constant propagation                          │  │
│  │  - Replace immutable global.get with constants          │  │
│  └─────────────────────────────────────────────────────────┘  │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  Phase 3: Constant Folding                              │  │
│  │  - Convert to ISLE terms                                │  │
│  │  - Apply pattern matching rules                         │  │
│  │  - Fold constants: 10 + 20 → 30                        │  │
│  └─────────────────────────────────────────────────────────┘  │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  Phase 4: CSE (Common Subexpression Elimination)       │  │
│  │  - Hash expressions for duplicate detection             │  │
│  │  - Cache expensive computations in locals               │  │
│  │  - Skip caching simple constants                        │  │
│  └─────────────────────────────────────────────────────────┘  │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  Phase 5: Advanced (Strength Reduction)                 │  │
│  │  - x * 8 → x << 3  (2-3x speedup)                      │  │
│  │  - x / 4 → x >> 2  (2-3x speedup)                      │  │
│  │  - x % 32 → x & 31 (2-3x speedup)                      │  │
│  └─────────────────────────────────────────────────────────┘  │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  Phase 6: Branches                                      │  │
│  │  - Simplify constant conditions                         │  │
│  │  - Remove redundant branches                            │  │
│  └─────────────────────────────────────────────────────────┘  │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  Phase 7: DCE (Dead Code Elimination)                  │  │
│  │  - Remove unreachable code after terminators            │  │
│  │  - Clean up code following return/br/unreachable        │  │
│  └─────────────────────────────────────────────────────────┘  │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  Phase 8: Merge-blocks                                  │  │
│  │  - Merge consecutive blocks                             │  │
│  │  - Reduce control flow overhead                         │  │
│  └─────────────────────────────────────────────────────────┘  │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  Phase 9: Vacuum                                        │  │
│  │  - Remove empty blocks                                   │  │
│  │  - Final structural cleanup                             │  │
│  └─────────────────────────────────────────────────────────┘  │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  Phase 10: Simplify-locals                              │  │
│  │  - Eliminate unused local variables                     │  │
│  │  - Final cleanup pass                                    │  │
│  └─────────────────────────────────────────────────────────┘  │
└──────────────────────────┬────────────────────────────────────┘
                           │
                           ▼
┌────────────────────────────────────────────────────────────────┐
│                  Verification (Optional)                        │
│                                                                 │
│  if --verify flag:                                             │
│                                                                 │
│    ┌──────────────────────────────────────────────┐           │
│    │  Z3 SMT Verification (if feature enabled)    │           │
│    │  - Encode original function to SMT           │           │
│    │  - Encode optimized function to SMT          │           │
│    │  - Prove equivalence via SAT solver          │           │
│    │  - Generate counterexample if failed         │           │
│    └──────────────────────────────────────────────┘           │
│                         │                                       │
│                         ▼                                       │
│    ┌──────────────────────────────────────────────┐           │
│    │  ISLE Property-Based Verification            │           │
│    │  - Check idempotence                         │           │
│    │  - Validate constant folding                 │           │
│    │  - Test algebraic properties                 │           │
│    └──────────────────────────────────────────────┘           │
└──────────────────────────┬────────────────────────────────────┘
                           │
                           ▼
┌────────────────────────────────────────────────────────────────┐
│                      Encoder Module                             │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  WAT Encoder (if --wat flag)                            │  │
│  │  - Convert AST to text format                           │  │
│  │  - Pretty-print with indentation                        │  │
│  └─────────────────────────────────────────────────────────┘  │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  WASM Binary Encoder (default)                          │  │
│  │  - Encode to binary format                              │  │
│  │  - Ultra-fast: ~183 nanoseconds!                        │  │
│  └─────────────────────────────────────────────────────────┘  │
└──────────────────────────┬────────────────────────────────────┘
                           │
                           ▼
┌────────────────────────────────────────────────────────────────┐
│                      Output File                                │
│                  (Optimized WASM/WAT)                          │
└────────────────────────────────────────────────────────────────┘

Component Architecture

1. loom-core

The core optimization library. Contains all optimization passes and the main pipeline.

Key Modules:

parse: WAT/WASM parsing using wat and wasmparser crates
optimize: 10-phase optimization pipeline
encode: WAT/WASM encoding using wasmprinter and wasm-encoder
verify: Z3 SMT-based formal verification
component: WebAssembly Component Model support
terms: Conversion between instructions and ISLE terms

File Structure:

loom-core/
├── src/
│   ├── lib.rs              # Main module, optimization pipeline
│   └── verify.rs           # Z3 verification module
├── tests/
│   └── optimization_tests.rs  # Comprehensive test suite (20 tests)
└── benches/
    └── optimization_benchmarks.rs  # Criterion benchmarks

2. loom-isle

ISLE (Instruction Selection and Lowering Engine) integration. Defines term types and optimization rules.

Key Files:

isle/wasm_terms.isle: Term type definitions (I32Add, I32Const, etc.)
isle/wasm_opts.isle: Optimization rules (constant folding, algebraic simplification)
src/lib.rs: Rust bindings and term conversion

Term System:

pub enum ValueData {
    I32Const { val: Imm32 },
    I32Add { lhs: Value, rhs: Value },
    I32Sub { lhs: Value, rhs: Value },
    LocalGet { idx: u32 },
    LocalSet { idx: u32, value: Value },
    // ... 40+ operations
}

3. loom-cli

Command-line interface. Provides user-friendly access to optimization and verification.

Commands:

loom optimize <input> [options]
  --output, -o <file>   # Output path
  --wat                 # Output as WAT text
  --stats               # Show statistics
  --verify              # Run verification

Optimization Pipeline

Phase Order Rationale

The 10-phase pipeline is carefully ordered to maximize optimization opportunities:

Why Inline First (Phase 1)?

Inlining runs first so that subsequent passes (constant folding, CSE, strength reduction) can optimise across the now-visible inlined code.

Why Constant Folding Before CSE (Phases 3 & 4)?

Problem: If CSE runs before constant folding:

i32.const 42
i32.const 42
i32.add

CSE would cache the constant 42 in a local:

i32.const 42
local.tee $0
local.get $0
i32.add

Now constant folding can't work because it's no longer two constants!

Solution: Run constant folding first:

i32.const 42
i32.const 42
i32.add
→ i32.const 84  # Folded before CSE sees it

Why Strength Reduction After Constant Folding (Phase 5)?

Constants might create optimization opportunities:

i32.const 2
i32.const 3
i32.add       # Folds to: i32.const 5
i32.const 8
i32.mul       # Now: 5 * 8, but 8 is power of 2!
              # Becomes: 5 << 3

Optimization Pass Details

Strength Reduction

Replaces expensive operations with cheaper equivalents.

Implementation (optimize_advanced_instructions):

match (op, const_val) {
    // Multiplication by power of 2 → Shift left
    (Instruction::I32Mul, n) if n.is_power_of_two() => {
        // x * 8 → x << 3
        instructions.push(Instruction::I32Const(n.trailing_zeros() as i32));
        instructions.push(Instruction::I32Shl);
    }

    // Division by power of 2 → Shift right
    (Instruction::I32DivU, n) if n.is_power_of_two() => {
        // x / 4 → x >> 2
        instructions.push(Instruction::I32Const(n.trailing_zeros() as i32));
        instructions.push(Instruction::I32ShrU);
    }

    // Modulo by power of 2 → Bitwise AND
    (Instruction::I32RemU, n) if n.is_power_of_two() => {
        // x % 32 → x & 31
        instructions.push(Instruction::I32Const(n - 1));
        instructions.push(Instruction::I32And);
    }
}

Speedup: 2-3x for power-of-2 operations

Common Subexpression Elimination (CSE)

Caches duplicate expensive computations.

Implementation:

Hash all expressions to detect duplicates
Filter out simple constants (they should be folded, not cached)
Allocate new locals for cached values
Replace first occurrence with local.tee $temp
Replace duplicates with local.get $temp

Example:

# Before
local.get $x
i32.const 4
i32.mul
local.get $x
i32.const 4
i32.mul  # Duplicate!

# After
local.get $x
i32.const 4
i32.mul
local.tee $temp
local.get $temp  # Reuse

Key Design Decision: Don't cache constants!

let const_duplicates: Vec<_> = duplicates
    .iter()
    .filter(|(orig_pos, _dup_pos, _type)| {
        // Skip simple constants - they're cheap and prevent constant folding
        !matches!(func.instructions.get(*orig_pos),
            Some(Instruction::I32Const(_)) | Some(Instruction::I64Const(_)))
    })
    .collect();

Function Inlining

Inlines small, frequently-called functions to reduce call overhead and enable cross-function optimizations.

Algorithm:

Build call graph
Identify inlining candidates (small functions, multiple call sites)
For each call site:
- Substitute parameters with arguments
- Insert function body
- Update local indices
Remove inlined functions if no remaining callers

Heuristics:

Inline if: body_size < 20 instructions && call_count > 2
Or: body_size < 5 instructions (always inline tiny functions)

ISLE Integration

Term Conversion

Instructions to Terms:

pub fn instructions_to_terms(instructions: &[Instruction]) -> Result<Vec<Value>> {
    let mut stack: Vec<Value> = Vec::new();

    for instr in instructions {
        match instr {
            Instruction::I32Const(val) => {
                stack.push(iconst32(Imm32::from(*val)));
            }
            Instruction::I32Add => {
                let rhs = stack.pop().ok_or(...)?;
                let lhs = stack.pop().ok_or(...)?;
                stack.push(iadd32(lhs, rhs));
            }
            // ...
        }
    }

    Ok(stack)
}

Terms to Instructions:

pub fn terms_to_instructions(terms: &[Value]) -> Result<Vec<Instruction>> {
    let mut instructions = Vec::new();

    for term in terms {
        term_to_instructions_recursive(term, &mut instructions)?;
    }

    Ok(instructions)
}

fn term_to_instructions_recursive(term: &Value, out: &mut Vec<Instruction>) {
    match term.data() {
        ValueData::I32Const { val } => {
            out.push(Instruction::I32Const(val.value()));
        }
        ValueData::I32Add { lhs, rhs } => {
            term_to_instructions_recursive(lhs, out);  # Stack-based!
            term_to_instructions_recursive(rhs, out);
            out.push(Instruction::I32Add);
        }
        // ...
    }
}

ISLE Rules

Constant Folding:

;; From wasm_opts.isle
(rule (simplify (I32Add (I32Const a) (I32Const b)))
      (I32Const (iadd_imm32 a b)))

(rule (simplify (I32Mul (I32Const a) (I32Const b)))
      (I32Const (imul_imm32 a b)))

Algebraic Simplification:

;; x + 0 → x
(rule (simplify (I32Add x (I32Const (imm32_zero))))
      x)

;; x * 1 → x
(rule (simplify (I32Mul x (I32Const (imm32_one))))
      x)

;; x | 0 → x
(rule (simplify (I32Or x (I32Const (imm32_zero))))
      x)

Verification System

Z3 SMT Encoding

Translation Validation Approach:

Encode original function as SMT formula
Encode optimized function as SMT formula
Assert: original(inputs) ≠ optimized(inputs)
Ask Z3: Is this SAT (satisfiable)?
- If UNSAT: Functions are equivalent! ✅
- If SAT: Found counterexample! ❌
- If UNKNOWN: Too complex or timeout

Example Encoding:

fn encode_function_to_smt<'ctx>(ctx: &'ctx Context, func: &Function) -> Result<BV<'ctx>> {
    let mut stack: Vec<BV<'ctx>> = Vec::new();
    let mut locals: Vec<BV<'ctx>> = Vec::new();

    // Parameters as symbolic inputs
    for (idx, param_type) in func.signature.params.iter().enumerate() {
        let width = match param_type {
            ValueType::I32 => 32,
            ValueType::I64 => 64,
            _ => bail!("Unsupported type"),
        };
        let param = BV::new_const(ctx, format!("param{}", idx), width);
        locals.push(param);
    }

    // Symbolically execute
    for instr in &func.instructions {
        match instr {
            Instruction::I32Add => {
                let rhs = stack.pop().unwrap();
                let lhs = stack.pop().unwrap();
                stack.push(lhs.bvadd(&rhs));  // SMT bit-vector addition
            }
            Instruction::I32Mul => {
                let rhs = stack.pop().unwrap();
                let lhs = stack.pop().unwrap();
                stack.push(lhs.bvmul(&rhs));  // SMT bit-vector multiplication
            }
            // ...
        }
    }

    Ok(stack.pop().unwrap())
}

Verification Query:

pub fn verify_optimization(original: &Module, optimized: &Module) -> Result<bool> {
    let cfg = Config::new();
    let ctx = Context::new(&cfg);
    let solver = Solver::new(&ctx);

    for (orig_func, opt_func) in original.functions.iter().zip(optimized.functions.iter()) {
        let orig_formula = encode_function_to_smt(&ctx, orig_func)?;
        let opt_formula = encode_function_to_smt(&ctx, opt_func)?;

        // Assert: original ≠ optimized
        solver.assert(&orig_formula._eq(&opt_formula).not());

        match solver.check() {
            SatResult::Unsat => continue,  // Equivalent! ✅
            SatResult::Sat => {
                // Counterexample found!
                let model = solver.get_model()?;
                eprintln!("Counterexample: {}", model);
                return Ok(false);
            }
            SatResult::Unknown => bail!("SMT solver timeout"),
        }
    }

    Ok(true)  // All functions verified!
}

Data Structures

Module Representation

pub struct Module {
    pub types: Vec<FuncType>,           // Function signatures
    pub functions: Vec<Function>,       // Function definitions
    pub tables: Vec<Table>,             // Indirect function tables
    pub memories: Vec<Memory>,          // Linear memory definitions
    pub globals: Vec<Global>,           // Global variables
    pub exports: Vec<Export>,           // Exported functions/memory
    pub start: Option<u32>,             // Start function
    pub elements: Vec<Element>,         // Table initializers
    pub code: Vec<FunctionBody>,        // Function bodies
    pub data: Vec<Data>,                // Data segments
}

Instruction Enum

pub enum Instruction {
    // Constants
    I32Const(i32),
    I64Const(i64),
    F32Const(f32),
    F64Const(f64),

    // Arithmetic
    I32Add, I32Sub, I32Mul, I32DivS, I32DivU, I32RemS, I32RemU,
    I64Add, I64Sub, I64Mul, I64DivS, I64DivU, I64RemS, I64RemU,

    // Bitwise
    I32And, I32Or, I32Xor, I32Shl, I32ShrS, I32ShrU, I32Rotl, I32Rotr,
    I64And, I64Or, I64Xor, I64Shl, I64ShrS, I64ShrU, I64Rotl, I64Rotr,

    // Comparison
    I32Eqz, I32Eq, I32Ne, I32LtS, I32LtU, I32GtS, I32GtU, I32LeS, I32LeU, I32GeS, I32GeU,
    I64Eqz, I64Eq, I64Ne, I64LtS, I64LtU, I64GtS, I64GtU, I64LeS, I64LeU, I64GeS, I64GeU,

    // Local variables
    LocalGet(u32),
    LocalSet(u32),
    LocalTee(u32),

    // Global variables
    GlobalGet(u32),
    GlobalSet(u32),

    // Memory
    I32Load { offset: u32, align: u32 },
    I64Load { offset: u32, align: u32 },
    I32Store { offset: u32, align: u32 },
    I64Store { offset: u32, align: u32 },
    MemorySize,
    MemoryGrow,

    // Control flow
    Block { block_type: BlockType, body: Vec<Instruction> },
    Loop { block_type: BlockType, body: Vec<Instruction> },
    If { block_type: BlockType, then_body: Vec<Instruction>, else_body: Vec<Instruction> },
    Br(u32),
    BrIf(u32),
    BrTable { targets: Vec<u32>, default: u32 },
    Return,
    Call(u32),
    CallIndirect { type_idx: u32 },

    // Other
    Drop,
    Select,
    Unreachable,
    Nop,
    End,
}

Performance Considerations

Why So Fast?

Benchmark Results: 10-30 µs for most modules

Key Optimizations:

Minimal allocations: Reuse vectors where possible
Single-pass parsing: No intermediate representations
Lazy evaluation: Only convert to ISLE terms when needed
Efficient encoding: wasm-encoder is extremely fast (183ns!)
No file I/O in hot path: Read once, optimize in memory

Bottlenecks

Measured via profiling:

Parsing: ~7 µs (fast!)
ISLE conversion: ~2-3 µs per function
Optimization passes: ~5-15 µs total
Encoding: ~0.18 µs (183ns - negligible!)

Slowest phases:

CSE (expression hashing)
Function inlining (call graph analysis)
Loop optimization (modified local tracking)

Scalability

Tested on:

Small modules (10 functions): <10 µs
Medium modules (100 functions): ~50-100 µs
Large modules (1000 functions): ~500 µs - 1ms

Linear scaling: O(n) where n = number of instructions

Extension Points

Adding New Optimizations

Add to pipeline (optimize_module in lib.rs):

pub fn optimize_module(module: &mut Module) -> Result<()> {
    precompute(module)?;
    // ... existing phases ...
    my_new_optimization(module)?;  // Add here!
    vacuum(module)?;
    Ok(())
}

Implement the pass:

pub fn my_new_optimization(module: &mut Module) -> Result<()> {
    for func in &mut module.functions {
        func.instructions = optimize_instructions(&func.instructions);
    }
    Ok(())
}

fn optimize_instructions(instructions: &[Instruction]) -> Vec<Instruction> {
    // Your optimization logic here
}

Adding New ISLE Rules

Add rule to wasm_opts.isle:

;; New optimization: x - x → 0
(rule (simplify (I32Sub x x))
      (I32Const (imm32_zero)))

Rebuild: ISLE compiler regenerates Rust code automatically

Adding New Verification Checks

Add to run_verification in loom-cli/src/main.rs:

fn run_verification(original: &Module, optimized: &Module) -> Result<()> {
    // Z3 verification ...
    // ISLE verification ...

    // Your new check:
    println!("🔍 Running custom verification...");
    my_verification_check(original, optimized)?;

    Ok(())
}

Future Directions

Planned Improvements

More aggressive LICM:
- Hoist pure arithmetic operations
- Hoist global.get of immutable globals
- Track memory dependencies
Profile-guided optimization:
- Collect runtime profiles
- Inline hot functions
- Specialize for common inputs
SIMD optimizations:
- Auto-vectorize loops
- SIMD-specific patterns
Inter-procedural analysis:
- Global CSE across functions
- Whole-program optimization
Custom optimization passes:
- User-defined ISLE rules
- Plugin architecture

Research Directions

Machine learning for optimization ordering:
- Learn optimal phase ordering
- Predict which optimizations will help
Probabilistic verification:
- Random testing with counterexample generation
- Property-based testing framework
Component Model integration:
- Optimize at component boundaries
- Cross-component inlining

For more information:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LOOM Architecture

Table of Contents

Overview

Architecture Diagram

Component Architecture

1. loom-core

2. loom-isle

3. loom-cli

Optimization Pipeline

Phase Order Rationale

Why Inline First (Phase 1)?

Why Constant Folding Before CSE (Phases 3 & 4)?

Why Strength Reduction After Constant Folding (Phase 5)?

Optimization Pass Details

Strength Reduction

Common Subexpression Elimination (CSE)

Function Inlining

ISLE Integration

Term Conversion

ISLE Rules

Verification System

Z3 SMT Encoding

Data Structures

Module Representation

Instruction Enum

Performance Considerations

Why So Fast?

Bottlenecks

Scalability

Extension Points

Adding New Optimizations

Adding New ISLE Rules

Adding New Verification Checks

Future Directions

Planned Improvements

Research Directions

FilesExpand file tree

architecture.md

Latest commit

History

architecture.md

File metadata and controls

LOOM Architecture

Table of Contents

Overview

Architecture Diagram

Component Architecture

1. loom-core

2. loom-isle

3. loom-cli

Optimization Pipeline

Phase Order Rationale

Why Inline First (Phase 1)?

Why Constant Folding Before CSE (Phases 3 & 4)?

Why Strength Reduction After Constant Folding (Phase 5)?

Optimization Pass Details

Strength Reduction

Common Subexpression Elimination (CSE)

Function Inlining

ISLE Integration

Term Conversion

ISLE Rules

Verification System

Z3 SMT Encoding

Data Structures

Module Representation

Instruction Enum

Performance Considerations

Why So Fast?

Bottlenecks

Scalability

Extension Points

Adding New Optimizations

Adding New ISLE Rules

Adding New Verification Checks

Future Directions

Planned Improvements

Research Directions