Actionable guidelines for the iceberg-rust project, optimized for AI coding assistants.
- Project Architecture
- Deep vs Shallow Modules
- LSP-Based Codebase Navigation
- Functional Programming Patterns
- Trait Design Patterns
- Builder Pattern & Configuration
- Error Handling
- Module Organization
- Complexity Management
- Quick Reference: Decision Trees
- Critical Metrics
- Key Takeaways
Layered Design:
datafusion_iceberg (query engine integration)
↓
iceberg-rust (table operations, catalogs)
↓
iceberg-rust-spec (pure specification types)
Core Philosophy: Deep modules with simple interfaces (John Ousterhout's "A Philosophy of Software Design")
Deep Modules = Powerful functionality + Simple interface
- Best modules hide significant complexity behind clean APIs
- Goal: Minimize interface size relative to implementation size (1:10+ ratio ideal)
- Example: Catalog trait has ~20 methods hiding 6 implementations with 5000+ lines (1:12 ratio)
Shallow Modules to Avoid:
- Many small methods that just wrap other calls
- Interfaces that expose internal complexity
- Documentation longer than implementation
IMPORTANT: When an LSP (Language Server Protocol) MCP server is available (such as rust-analyzer), ALWAYS prefer LSP tools over text-based search for code navigation and analysis.
Use LSP tools for:
- Finding definitions:
get_symbol_definitionsinstead of grepping for function/type names - Finding references:
get_symbol_referencesinstead of searching for usage - Type information:
get_hoverfor accurate type and documentation - Code structure:
get_symbolsfor understanding module organization - Implementations:
get_implementationsfor finding trait implementations - Call hierarchy:
get_call_hierarchyfor understanding call relationships - Diagnostics:
get_diagnosticsfor compiler errors and warnings - Completions:
get_completionsfor valid code suggestions
Need to understand code structure? → get_symbols
Need to find where something is defined? → get_symbol_definitions
Need to find all usages? → get_symbol_references
Need to understand types? → get_hover
Need to find trait impls? → get_implementations
Searching for text/patterns? → Grep/text search
- Use Iterator Methods:
map,filter,flat_map,foldoverforloops - Lazy When Possible: Return
impl Iteratorfor large transformations - Combinators:
ok_or,and_then,unwrap_or_defaultforOption/Result - Strategic collect(): Only use
.collect::<Vec<_>>()when needed - Chain Iterators: Use
.chain()instead of extending vecs
- Complex state machines (use explicit loops)
- Performance-critical hot paths needing specific optimizations
- When mutation in place is clearer
Decision Tree:
Is it used by 3+ types? → YES → Consider trait
↓ NO
Does it hide significant complexity? → YES → Consider trait
↓ NO
Would From/Into/standard trait work? → YES → Use standard trait
↓ NO
→ Don't create trait, use generic functions or enum
- Prefer Standard Traits: Use
From,TryFrom,Into,Display,Debug,Errorover custom traits - Interface/Implementation Ratio: Aim for 1:10+ lines (interface:implementation)
- Required Bounds: Always include
Send + Sync + Debugfor shared types - Async Traits: Use
#[async_trait]for I/O operations - Arc Receivers: Use
Arc<Self>for async trait methods needing shared ownership
Every public trait method must document:
- Summary: One-line behavior description
- Arguments: Each parameter explained
- Returns: Success case
- Errors: All failure modes
Decision Tree:
5+ fields OR complex optional config OR needs async setup
→ derive_builder
Otherwise
→ Regular struct with new()
- Use derive_builder: Don't hand-roll builders
- Ergonomics: Use
setter(into)forStringand common types - Optional is Explicit: Use
Option<T>+strip_optionfor clarity - Required Fields: No defaults - force user to provide
- Validation Separate: Use
TryIntofor validation logic - Custom build(): Add async
build()taking external dependencies
- Use thiserror: Always derive
Errortrait, never hand-roll - Contextual Messages: Include what failed and why (column name + schema name)
- Transparent Wrapping: Use
#[from]for automatic conversion from external errors - Box Large Errors:
Box<T>for errors >128 bytes to reduce enum size - Bidirectional When Needed: Implement
From<Error>for external types (e.g.,ArrowError) - Group Related Failures: One variant with parameter (e.g.,
NotFound(String)) vs many variants
Dependency Graph:
datafusion_iceberg → iceberg-rust → iceberg-rust-spec
(DataFusion) (async, I/O) (pure data, serde)
- Spec layer: No external dependencies except serde/uuid
- Implementation layer: Adds object_store, async, catalogs
- Integration layer: Adds datafusion-specific code
- Is this type part of public API? → Re-export
- Is this complexity internal? →
pub(crate)or private mod - Can spec types be separate? → Move to iceberg-rust-spec
- Does this add dependencies? → Check layer appropriateness
Ask these questions:
- Can it be hidden behind existing interface?
- Does it reduce complexity elsewhere?
- Is the interface/implementation ratio maintained?
- Can derive macros handle it? (Builder, Getters, Error)
External library error? → #[error(transparent)] + #[from]
Domain-specific error? → Custom variant with context
Multiple failure modes? → Single variant with String parameter
Size > 128 bytes? → Box<T>
I/O operation (network, disk)? → async
CPU-bound computation? → sync
Called from async context? → async
Part of Catalog/trait? → async
5+ fields OR complex optional config OR needs async setup
→ derive_builder
Otherwise
→ Regular struct with new()
These metrics indicate deep modules (desirable):
- Catalog trait: ~410 lines interface → ~5000+ lines across implementations (1:12 ratio) ✓
- Transaction: ~100 public lines → ~500 implementation lines (1:5 ratio) ✓
- Error handling: 107 lines covering all codebase errors (centralized) ✓
Target: Interface/Implementation ratio of 1:10+ for major abstractions
- Deep Modules: Hide complexity behind simple interfaces (Catalog trait: 20 methods, 6 implementations)
- Standard Traits First: Prefer
From/TryFrom/Errorover custom traits - Builder Pattern: Use
derive_builderfor 5+ fields or complex config - Functional Style: Iterator chains over loops, combinators over match
- Error Context: Use
thiserrorwith contextual messages - Async for I/O: All catalog/storage operations async with
Arc<Self> - Layered Architecture: Keep spec pure, implementation separate, integration isolated
- Document Everything: Public APIs need comprehensive docs
- Pull Complexity Down: Make user's life easy, hide complexity in implementation