Instructions for AI agents working on this codebase.
Unirust is a distributed temporal entity resolution engine. The primary function is to ingest records from multiple source systems and cluster them into unified entities while respecting temporal constraints.
Every ingested record MUST go through entity resolution. Never skip or bypass:
linker.link_records_batch_parallel()- batch links with parallel extractionpartitioned.process_batch_optimized()- optimized partition processingpartitioned.ingest_batch()- distributed batch processing
Any optimization that skips entity resolution is incorrect and breaks the core value proposition.
- Unit tests: May use in-memory
Store::new() - Integration tests: Must use
PersistentStore - Examples: Must demonstrate sharded/distributed mode
- Benchmarks: Should test both modes but focus on persistent
- No JSON for data storage or WAL
- Use protobuf/bincode for serialization
- JSON is only acceptable for:
- Ontology configuration files (external input)
- Graph visualization exports (external output)
src/
├── lib.rs # Public API (Unirust struct)
├── linker.rs # Core entity resolution
├── dsu.rs # Disjoint Set Union
├── store.rs # In-memory store
├── persistence.rs # RocksDB store
├── distributed.rs # gRPC services
├── partitioned.rs # Parallel processing
├── ontology.rs # Matching rules
├── conflicts.rs # Conflict detection
└── bin/
├── unirust_router.rs # Router binary
├── unirust_shard.rs # Shard binary
└── unirust_loadtest.rs # Load testing
distributed.rs:ShardNode::ingest_records()- gRPC entrydistributed.rs:dispatch_ingest_partitioned()- routes to partitioned processingpartitioned.rs:ParallelPartitionedUnirust::ingest_batch_with_partitions()- parallel partition dispatchpartitioned.rs:Partition::process_batch_optimized()- hot path: batch insert → parallel extract → sequential linklinker.rs:link_records_batch_parallel()- parallel key extraction, sequential DSU merges
distributed.rs:RouterService::query_entities()- gRPC entrylib.rs:Unirust::query_master_entities()- query executionquery.rs- query planning and execution
- Located in each source file as
#[cfg(test)]modules - May use in-memory stores
- Fast, isolated tests
- Located in
tests/directory - Must use
PersistentStorewithtempfile - Test distributed scenarios (router + shards)
- Use
unirust_loadtestbinary - Standard command:
./target/release/unirust_loadtest -r http://127.0.0.1:50060 -c 10000000 --streams 16 --batch 5000 - Baseline with 5 shards, 10% overlap: ~410K rec/sec, ~12ms batch latency
After any change, verify performance with loadtest. Current baseline with 5 shards:
- ~410K records/second (10% overlap)
- ~12ms batch latency
partitioned.rs:process_batch_optimized()- batch insert + parallel extract + sequential linklinker.rs:link_records_batch_parallel()- parallel extraction, sequential DSUlinker.rs:link_extracted_record()- DSU merges with temporal guardsdsu.rs:find()- path compression with root cache
- Unnecessary cloning of large structures
- Lock contention in hot paths
- JSON serialization in data path
- Unbounded allocations
- Update ontology if new matching rules needed
- Add to
lib.rspublic API - Add unit tests
- Add integration test in
tests/ - Run
cargo test,cargo clippy,cargo fmt
- Changes to
linker.rsrequire careful review - Must maintain temporal guard semantics
- Must not break cluster correctness
- Add regression tests for edge cases
- Update
proto/unirust.proto - Regenerate with
cargo build - Implement in
distributed.rs - Add integration test
# Development
cargo test # Run all tests
cargo clippy --all-targets # Lint
cargo fmt # Format
# Benchmarks
cargo bench --bench bench_quick # Fast (~30s)
cargo bench --bench bench_micro # Component benchmarks
# Start cluster (recommended)
SHARDS=5 ./scripts/cluster.sh start
# Load test (requires running cluster)
./target/release/unirust_loadtest \
--router http://127.0.0.1:50060 \
--count 10000000 \
--streams 16 \
--batch 5000
# Stop cluster
./scripts/cluster.sh stop- Use
Result<T, UniError>for fallible operations - Prefer
&stroverStringfor parameters - Use
#[inline]for small hot functions - Avoid
unwrap()in library code - Comments explain "why", code explains "what"