Deep dive into LOOM's internal architecture, optimization pipeline, and implementation details.
- Overview
- Component Architecture
- Optimization Pipeline
- ISLE Integration
- Verification System
- Data Structures
- Performance Considerations
- Extension Points
LOOM is a WebAssembly optimizer built around three core principles:
- Correctness: Formal verification ensures optimizations preserve semantics
- Performance: Ultra-fast optimization (10-30 µs) with excellent results
- Maintainability: Declarative ISLE rules make optimizations easy to understand and extend
┌────────────────────────────────────────────────────────────────┐
│ User Input │
│ (WAT or WASM file) │
└──────────────────────┬─────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Parser Module │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ WAT Parser │──────▶ │ wasmparser │ │
│ │ (wat crate) │ │ Validation │ │
│ └──────────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ AST Builder │ │
│ └──────────────┘ │
└──────────────────────────┬────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Internal Representation │
│ │
│ Module { │
│ types: Vec<FuncType>, │
│ functions: Vec<Function>, │
│ tables: Vec<Table>, │
│ memories: Vec<Memory>, │
│ globals: Vec<Global>, │
│ exports: Vec<Export>, │
│ } │
│ │
│ Function { │
│ signature: FuncType, │
│ locals: Vec<(count, ValueType)>, │
│ instructions: Vec<Instruction>, │
│ } │
└──────────────────────────┬────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ 10-Phase CLI Optimization Pipeline │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Phase 1: Inline │ │
│ │ - Build call graph │ │
│ │ - Inline small, frequently-called functions │ │
│ │ - Substitute parameters with arguments │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Phase 2: Precompute │ │
│ │ - Global constant propagation │ │
│ │ - Replace immutable global.get with constants │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Phase 3: Constant Folding │ │
│ │ - Convert to ISLE terms │ │
│ │ - Apply pattern matching rules │ │
│ │ - Fold constants: 10 + 20 → 30 │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Phase 4: CSE (Common Subexpression Elimination) │ │
│ │ - Hash expressions for duplicate detection │ │
│ │ - Cache expensive computations in locals │ │
│ │ - Skip caching simple constants │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Phase 5: Advanced (Strength Reduction) │ │
│ │ - x * 8 → x << 3 (2-3x speedup) │ │
│ │ - x / 4 → x >> 2 (2-3x speedup) │ │
│ │ - x % 32 → x & 31 (2-3x speedup) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Phase 6: Branches │ │
│ │ - Simplify constant conditions │ │
│ │ - Remove redundant branches │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Phase 7: DCE (Dead Code Elimination) │ │
│ │ - Remove unreachable code after terminators │ │
│ │ - Clean up code following return/br/unreachable │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Phase 8: Merge-blocks │ │
│ │ - Merge consecutive blocks │ │
│ │ - Reduce control flow overhead │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Phase 9: Vacuum │ │
│ │ - Remove empty blocks │ │
│ │ - Final structural cleanup │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Phase 10: Simplify-locals │ │
│ │ - Eliminate unused local variables │ │
│ │ - Final cleanup pass │ │
│ └─────────────────────────────────────────────────────────┘ │
└──────────────────────────┬────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Verification (Optional) │
│ │
│ if --verify flag: │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Z3 SMT Verification (if feature enabled) │ │
│ │ - Encode original function to SMT │ │
│ │ - Encode optimized function to SMT │ │
│ │ - Prove equivalence via SAT solver │ │
│ │ - Generate counterexample if failed │ │
│ └──────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ ISLE Property-Based Verification │ │
│ │ - Check idempotence │ │
│ │ - Validate constant folding │ │
│ │ - Test algebraic properties │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────┬────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Encoder Module │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ WAT Encoder (if --wat flag) │ │
│ │ - Convert AST to text format │ │
│ │ - Pretty-print with indentation │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ WASM Binary Encoder (default) │ │
│ │ - Encode to binary format │ │
│ │ - Ultra-fast: ~183 nanoseconds! │ │
│ └─────────────────────────────────────────────────────────┘ │
└──────────────────────────┬────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Output File │
│ (Optimized WASM/WAT) │
└────────────────────────────────────────────────────────────────┘
The core optimization library. Contains all optimization passes and the main pipeline.
Key Modules:
parse: WAT/WASM parsing usingwatandwasmparsercratesoptimize: 10-phase optimization pipelineencode: WAT/WASM encoding usingwasmprinterandwasm-encoderverify: Z3 SMT-based formal verificationcomponent: WebAssembly Component Model supportterms: Conversion between instructions and ISLE terms
File Structure:
loom-core/
├── src/
│ ├── lib.rs # Main module, optimization pipeline
│ └── verify.rs # Z3 verification module
├── tests/
│ └── optimization_tests.rs # Comprehensive test suite (20 tests)
└── benches/
└── optimization_benchmarks.rs # Criterion benchmarks
ISLE (Instruction Selection and Lowering Engine) integration. Defines term types and optimization rules.
Key Files:
isle/wasm_terms.isle: Term type definitions (I32Add, I32Const, etc.)isle/wasm_opts.isle: Optimization rules (constant folding, algebraic simplification)src/lib.rs: Rust bindings and term conversion
Term System:
pub enum ValueData {
I32Const { val: Imm32 },
I32Add { lhs: Value, rhs: Value },
I32Sub { lhs: Value, rhs: Value },
LocalGet { idx: u32 },
LocalSet { idx: u32, value: Value },
// ... 40+ operations
}Command-line interface. Provides user-friendly access to optimization and verification.
Commands:
loom optimize <input> [options]
--output, -o <file> # Output path
--wat # Output as WAT text
--stats # Show statistics
--verify # Run verificationThe 10-phase pipeline is carefully ordered to maximize optimization opportunities:
Inlining runs first so that subsequent passes (constant folding, CSE, strength reduction) can optimise across the now-visible inlined code.
Problem: If CSE runs before constant folding:
i32.const 42
i32.const 42
i32.addCSE would cache the constant 42 in a local:
i32.const 42
local.tee $0
local.get $0
i32.addNow constant folding can't work because it's no longer two constants!
Solution: Run constant folding first:
i32.const 42
i32.const 42
i32.add
→ i32.const 84 # Folded before CSE sees itConstants might create optimization opportunities:
i32.const 2
i32.const 3
i32.add # Folds to: i32.const 5
i32.const 8
i32.mul # Now: 5 * 8, but 8 is power of 2!
# Becomes: 5 << 3Replaces expensive operations with cheaper equivalents.
Implementation (optimize_advanced_instructions):
match (op, const_val) {
// Multiplication by power of 2 → Shift left
(Instruction::I32Mul, n) if n.is_power_of_two() => {
// x * 8 → x << 3
instructions.push(Instruction::I32Const(n.trailing_zeros() as i32));
instructions.push(Instruction::I32Shl);
}
// Division by power of 2 → Shift right
(Instruction::I32DivU, n) if n.is_power_of_two() => {
// x / 4 → x >> 2
instructions.push(Instruction::I32Const(n.trailing_zeros() as i32));
instructions.push(Instruction::I32ShrU);
}
// Modulo by power of 2 → Bitwise AND
(Instruction::I32RemU, n) if n.is_power_of_two() => {
// x % 32 → x & 31
instructions.push(Instruction::I32Const(n - 1));
instructions.push(Instruction::I32And);
}
}Speedup: 2-3x for power-of-2 operations
Caches duplicate expensive computations.
Implementation:
- Hash all expressions to detect duplicates
- Filter out simple constants (they should be folded, not cached)
- Allocate new locals for cached values
- Replace first occurrence with
local.tee $temp - Replace duplicates with
local.get $temp
Example:
# Before
local.get $x
i32.const 4
i32.mul
local.get $x
i32.const 4
i32.mul # Duplicate!
# After
local.get $x
i32.const 4
i32.mul
local.tee $temp
local.get $temp # ReuseKey Design Decision: Don't cache constants!
let const_duplicates: Vec<_> = duplicates
.iter()
.filter(|(orig_pos, _dup_pos, _type)| {
// Skip simple constants - they're cheap and prevent constant folding
!matches!(func.instructions.get(*orig_pos),
Some(Instruction::I32Const(_)) | Some(Instruction::I64Const(_)))
})
.collect();Inlines small, frequently-called functions to reduce call overhead and enable cross-function optimizations.
Algorithm:
- Build call graph
- Identify inlining candidates (small functions, multiple call sites)
- For each call site:
- Substitute parameters with arguments
- Insert function body
- Update local indices
- Remove inlined functions if no remaining callers
Heuristics:
- Inline if:
body_size < 20 instructions && call_count > 2 - Or:
body_size < 5 instructions(always inline tiny functions)
Instructions to Terms:
pub fn instructions_to_terms(instructions: &[Instruction]) -> Result<Vec<Value>> {
let mut stack: Vec<Value> = Vec::new();
for instr in instructions {
match instr {
Instruction::I32Const(val) => {
stack.push(iconst32(Imm32::from(*val)));
}
Instruction::I32Add => {
let rhs = stack.pop().ok_or(...)?;
let lhs = stack.pop().ok_or(...)?;
stack.push(iadd32(lhs, rhs));
}
// ...
}
}
Ok(stack)
}Terms to Instructions:
pub fn terms_to_instructions(terms: &[Value]) -> Result<Vec<Instruction>> {
let mut instructions = Vec::new();
for term in terms {
term_to_instructions_recursive(term, &mut instructions)?;
}
Ok(instructions)
}
fn term_to_instructions_recursive(term: &Value, out: &mut Vec<Instruction>) {
match term.data() {
ValueData::I32Const { val } => {
out.push(Instruction::I32Const(val.value()));
}
ValueData::I32Add { lhs, rhs } => {
term_to_instructions_recursive(lhs, out); # Stack-based!
term_to_instructions_recursive(rhs, out);
out.push(Instruction::I32Add);
}
// ...
}
}Constant Folding:
;; From wasm_opts.isle
(rule (simplify (I32Add (I32Const a) (I32Const b)))
(I32Const (iadd_imm32 a b)))
(rule (simplify (I32Mul (I32Const a) (I32Const b)))
(I32Const (imul_imm32 a b)))Algebraic Simplification:
;; x + 0 → x
(rule (simplify (I32Add x (I32Const (imm32_zero))))
x)
;; x * 1 → x
(rule (simplify (I32Mul x (I32Const (imm32_one))))
x)
;; x | 0 → x
(rule (simplify (I32Or x (I32Const (imm32_zero))))
x)Translation Validation Approach:
- Encode original function as SMT formula
- Encode optimized function as SMT formula
- Assert:
original(inputs) ≠ optimized(inputs) - Ask Z3: Is this SAT (satisfiable)?
- If UNSAT: Functions are equivalent! ✅
- If SAT: Found counterexample! ❌
- If UNKNOWN: Too complex or timeout
Example Encoding:
fn encode_function_to_smt<'ctx>(ctx: &'ctx Context, func: &Function) -> Result<BV<'ctx>> {
let mut stack: Vec<BV<'ctx>> = Vec::new();
let mut locals: Vec<BV<'ctx>> = Vec::new();
// Parameters as symbolic inputs
for (idx, param_type) in func.signature.params.iter().enumerate() {
let width = match param_type {
ValueType::I32 => 32,
ValueType::I64 => 64,
_ => bail!("Unsupported type"),
};
let param = BV::new_const(ctx, format!("param{}", idx), width);
locals.push(param);
}
// Symbolically execute
for instr in &func.instructions {
match instr {
Instruction::I32Add => {
let rhs = stack.pop().unwrap();
let lhs = stack.pop().unwrap();
stack.push(lhs.bvadd(&rhs)); // SMT bit-vector addition
}
Instruction::I32Mul => {
let rhs = stack.pop().unwrap();
let lhs = stack.pop().unwrap();
stack.push(lhs.bvmul(&rhs)); // SMT bit-vector multiplication
}
// ...
}
}
Ok(stack.pop().unwrap())
}Verification Query:
pub fn verify_optimization(original: &Module, optimized: &Module) -> Result<bool> {
let cfg = Config::new();
let ctx = Context::new(&cfg);
let solver = Solver::new(&ctx);
for (orig_func, opt_func) in original.functions.iter().zip(optimized.functions.iter()) {
let orig_formula = encode_function_to_smt(&ctx, orig_func)?;
let opt_formula = encode_function_to_smt(&ctx, opt_func)?;
// Assert: original ≠ optimized
solver.assert(&orig_formula._eq(&opt_formula).not());
match solver.check() {
SatResult::Unsat => continue, // Equivalent! ✅
SatResult::Sat => {
// Counterexample found!
let model = solver.get_model()?;
eprintln!("Counterexample: {}", model);
return Ok(false);
}
SatResult::Unknown => bail!("SMT solver timeout"),
}
}
Ok(true) // All functions verified!
}pub struct Module {
pub types: Vec<FuncType>, // Function signatures
pub functions: Vec<Function>, // Function definitions
pub tables: Vec<Table>, // Indirect function tables
pub memories: Vec<Memory>, // Linear memory definitions
pub globals: Vec<Global>, // Global variables
pub exports: Vec<Export>, // Exported functions/memory
pub start: Option<u32>, // Start function
pub elements: Vec<Element>, // Table initializers
pub code: Vec<FunctionBody>, // Function bodies
pub data: Vec<Data>, // Data segments
}pub enum Instruction {
// Constants
I32Const(i32),
I64Const(i64),
F32Const(f32),
F64Const(f64),
// Arithmetic
I32Add, I32Sub, I32Mul, I32DivS, I32DivU, I32RemS, I32RemU,
I64Add, I64Sub, I64Mul, I64DivS, I64DivU, I64RemS, I64RemU,
// Bitwise
I32And, I32Or, I32Xor, I32Shl, I32ShrS, I32ShrU, I32Rotl, I32Rotr,
I64And, I64Or, I64Xor, I64Shl, I64ShrS, I64ShrU, I64Rotl, I64Rotr,
// Comparison
I32Eqz, I32Eq, I32Ne, I32LtS, I32LtU, I32GtS, I32GtU, I32LeS, I32LeU, I32GeS, I32GeU,
I64Eqz, I64Eq, I64Ne, I64LtS, I64LtU, I64GtS, I64GtU, I64LeS, I64LeU, I64GeS, I64GeU,
// Local variables
LocalGet(u32),
LocalSet(u32),
LocalTee(u32),
// Global variables
GlobalGet(u32),
GlobalSet(u32),
// Memory
I32Load { offset: u32, align: u32 },
I64Load { offset: u32, align: u32 },
I32Store { offset: u32, align: u32 },
I64Store { offset: u32, align: u32 },
MemorySize,
MemoryGrow,
// Control flow
Block { block_type: BlockType, body: Vec<Instruction> },
Loop { block_type: BlockType, body: Vec<Instruction> },
If { block_type: BlockType, then_body: Vec<Instruction>, else_body: Vec<Instruction> },
Br(u32),
BrIf(u32),
BrTable { targets: Vec<u32>, default: u32 },
Return,
Call(u32),
CallIndirect { type_idx: u32 },
// Other
Drop,
Select,
Unreachable,
Nop,
End,
}Benchmark Results: 10-30 µs for most modules
Key Optimizations:
- Minimal allocations: Reuse vectors where possible
- Single-pass parsing: No intermediate representations
- Lazy evaluation: Only convert to ISLE terms when needed
- Efficient encoding: wasm-encoder is extremely fast (183ns!)
- No file I/O in hot path: Read once, optimize in memory
Measured via profiling:
- Parsing: ~7 µs (fast!)
- ISLE conversion: ~2-3 µs per function
- Optimization passes: ~5-15 µs total
- Encoding: ~0.18 µs (183ns - negligible!)
Slowest phases:
- CSE (expression hashing)
- Function inlining (call graph analysis)
- Loop optimization (modified local tracking)
Tested on:
- Small modules (10 functions): <10 µs
- Medium modules (100 functions): ~50-100 µs
- Large modules (1000 functions): ~500 µs - 1ms
Linear scaling: O(n) where n = number of instructions
- Add to pipeline (
optimize_moduleinlib.rs):
pub fn optimize_module(module: &mut Module) -> Result<()> {
precompute(module)?;
// ... existing phases ...
my_new_optimization(module)?; // Add here!
vacuum(module)?;
Ok(())
}- Implement the pass:
pub fn my_new_optimization(module: &mut Module) -> Result<()> {
for func in &mut module.functions {
func.instructions = optimize_instructions(&func.instructions);
}
Ok(())
}
fn optimize_instructions(instructions: &[Instruction]) -> Vec<Instruction> {
// Your optimization logic here
}- Add rule to
wasm_opts.isle:
;; New optimization: x - x → 0
(rule (simplify (I32Sub x x))
(I32Const (imm32_zero)))- Rebuild: ISLE compiler regenerates Rust code automatically
- Add to
run_verificationinloom-cli/src/main.rs:
fn run_verification(original: &Module, optimized: &Module) -> Result<()> {
// Z3 verification ...
// ISLE verification ...
// Your new check:
println!("🔍 Running custom verification...");
my_verification_check(original, optimized)?;
Ok(())
}-
More aggressive LICM:
- Hoist pure arithmetic operations
- Hoist global.get of immutable globals
- Track memory dependencies
-
Profile-guided optimization:
- Collect runtime profiles
- Inline hot functions
- Specialize for common inputs
-
SIMD optimizations:
- Auto-vectorize loops
- SIMD-specific patterns
-
Inter-procedural analysis:
- Global CSE across functions
- Whole-program optimization
-
Custom optimization passes:
- User-defined ISLE rules
- Plugin architecture
-
Machine learning for optimization ordering:
- Learn optimal phase ordering
- Predict which optimizations will help
-
Probabilistic verification:
- Random testing with counterexample generation
- Property-based testing framework
-
Component Model integration:
- Optimize at component boundaries
- Cross-component inlining
For more information: