KaflowSQL Performance: Sub-millisecond joins, 100K+ events/second, minimal operational overhead
| Scenario | Events/Second | Join Hit Rate | Latency P99 | Memory Usage |
|---|---|---|---|---|
| Simple Enrichment | 150K+ | 95% | < 1ms | 500MB |
| Multi-topic Join | 100K+ | 85% | < 2ms | 1.2GB |
| High-dimensional Join | 80K+ | 90% | < 3ms | 2.5GB |
| IoT High-Volume | 250K+ | 75% | < 0.5ms | 800MB |
Tested on Mac M3 MAX (8-core CPU, 32GB RAM) - production servers typically perform 20-30% better
# Transaction enrichment pipeline
Volume: 500K transactions/minute
Join Rate: 98% (user profiles + merchant data)
P99 Latency: < 2ms
Memory: 1.8GB (10M user profiles, 50K merchants)
Deployment: Single container per region
Recovery Time: < 30 seconds# Real-time personalization
Volume: 1M+ page views/hour
Join Rate: 92% (user profiles + product catalog)
P99 Latency: < 1.5ms
Memory: 2.2GB (5M users, 100K products)
State Size: 15GB on disk (RocksDB)
Cache Hit Rate: 94%# Sensor data enrichment
Volume: 2M+ sensor readings/hour
Join Rate: 88% (device metadata + config)
P99 Latency: < 0.8ms
Memory: 600MB (50K devices)
TTL Strategy: 5-minute rolling window
Throughput Scaling: Linear up to 8 coresTotal Processing Time (P99): < 2ms
├── Kafka Consume: 0.2ms
├── Deserialization: 0.1ms
├── Hash Lookup: 0.05ms
├── Join Processing: 0.3ms
├── SQL Execution: 0.5ms
├── Serialization: 0.15ms
└── Kafka Produce: 0.7ms
- Hot State (Hash Tables): 60% of total memory
- RocksDB Cache: 25% of total memory
- Processing Buffers: 10% of total memory
- Go Runtime: 5% of total memory
1 CPU Core: ~30K events/second
2 CPU Cores: ~55K events/second
4 CPU Cores: ~100K events/second
8 CPU Cores: ~180K events/second
16 CPU Cores: ~250K events/second (diminishing returns)
KaflowSQL uses 256 internal shards for concurrent processing:
// Simplified sharding logic
shard := xxhash64(joinKey) % 256
processor := shardProcessors[shard]Benefits:
- Parallel Processing: 256 concurrent join operations
- Memory Locality: Related data co-located in same shard
- Lock-Free Design: Per-shard processing eliminates contention
Events are stored as pointers, not copied:
type JoinRow struct {
LeftPtr *Event // Pointer to original event
RightPtr *Event // Pointer to joined event
Presence uint8 // Bitmask for available fields
}Performance Impact:
- 50% less memory allocation
- 30% faster processing due to cache efficiency
- Reduces GC pressure by minimizing temporary object creation
- LSM Trees: Optimized for write-heavy workloads
- Block Cache: Configurable memory cache (default 25% of heap)
- Compression: LZ4 compression for 60% space savings
- Write Batching: Micro-batches for optimal throughput
- Vectorized Execution: SIMD optimizations for analytics
- Columnar Storage: Efficient for aggregations and analytics
- JIT Compilation: Runtime query optimization
- Memory Mapping: Zero-copy data access
# Aggressive TTL for high-volume scenarios
state:
events:
default_ttl: 5m # Short retention
overrides:
high_volume_topic: 2m # Even shorter for specific topics
dimensions:
- static_lookups # Never expire reference data# Immediate emission for high throughput
emission:
type: immediate # Don't wait for complete joins
# Smart emission for data quality
emission:
type: smart
require: [critical_field] # Only wait for essential fieldskafka:
consumer:
fetchMinBytes: 1048576 # 1MB fetch batches
fetchMaxWait: 100 # 100ms max wait
maxPollRecords: 10000 # Large poll batches
producer:
batchSize: 1048576 # 1MB producer batches
lingerMs: 5 # Small linger for latency
compression: "lz4" # Fast compressionstate:
rocksdb:
writeBufferSize: 67108864 # 64MB write buffer
maxWriteBufferNumber: 3 # 3 write buffers
blockSize: 65536 # 64KB blocks
blockCache: 268435456 # 256MB block cache# Minimize processing latency
name: low_latency_pipeline
window: 30s # Short global window
emission:
type: immediate # Don't wait for complete data
state:
events:
default_ttl: 30s # Minimal state retention- CPU: High single-thread performance > many cores
- Memory: 32GB+ for production workloads
- Storage: NVMe SSD for RocksDB (3000+ IOPS)
- Network: Low-latency connection to Kafka (< 1ms RTT)
kaflowsql_events_processed_total: Events processed counterkaflowsql_events_per_second: Current processing ratekaflowsql_join_success_rate: Percentage of successful joins
kaflowsql_processing_duration_histogram: End-to-end latencykaflowsql_join_lookup_duration: State lookup latencykaflowsql_sql_execution_duration: DuckDB query execution time
kaflowsql_memory_usage_bytes: Total memory consumptionkaflowsql_state_size_bytes: In-memory state sizekaflowsql_rocksdb_cache_usage: RocksDB cache utilization
kaflowsql_state_entries_total: Total state entrieskaflowsql_state_evictions_total: TTL-based evictionskaflowsql_join_hit_rate: State lookup success rate
# Processing rate
rate(kaflowsql_events_processed_total[5m])
# Join success rate
kaflowsql_join_success_rate * 100
# P99 latency
histogram_quantile(0.99, kaflowsql_processing_duration_histogram)
# Memory usage
kaflowsql_memory_usage_bytes / (1024^3) # GB
# Performance degradation alerts
- alert: HighLatency
expr: histogram_quantile(0.99, kaflowsql_processing_duration_histogram) > 0.005
for: 2m
- alert: LowThroughput
expr: rate(kaflowsql_events_processed_total[5m]) < 10000
for: 5m
- alert: LowJoinRate
expr: kaflowsql_join_success_rate < 0.8
for: 1m- P99 latency > 5ms
- Increasing processing delays
- Kafka consumer lag growing
# Check RocksDB statistics
curl http://localhost:8080/metrics | grep rocksdb
# Monitor memory usage
curl http://localhost:8080/metrics | grep memory
# Check join hit rates
curl http://localhost:8080/metrics | grep join_hit_rate- Increase RocksDB cache: More memory for hot data
- Optimize TTL settings: Reduce state retention
- Scale horizontally: Deploy multiple instances
- Tune Kafka consumers: Larger fetch batches
- Processing rate < expected throughput
- CPU utilization < 70%
- Memory usage stable but low performance
# Check consumer lag
kafka-consumer-groups --bootstrap-server localhost:9092 \
--group kaflowsql --describe
# Monitor CPU per shard
curl http://localhost:8080/debug/pprof/profile- Increase consumer parallelism: More Kafka partitions
- Optimize SQL queries: Simpler joins and filters
- Adjust emission strategy: Use immediate emission
- Tune batch sizes: Larger processing batches
- Memory usage growing continuously
- OOM kills in containerized environments
- Slow state lookups
# Check state size growth
curl http://localhost:8080/metrics | grep state_size
# Monitor TTL effectiveness
curl http://localhost:8080/metrics | grep evictions- Reduce TTL values: More aggressive state cleanup
- Optimize join keys: Better data distribution
- Use dimensions wisely: Only for truly static data
- Increase memory limits: Scale container resources
# Use built-in data generator
./fakegen \
--topics user_events,user_profiles \
--rate 50000 \
--duration 10m \
--brokers localhost:9092# Real-time metrics
watch -n 1 'curl -s http://localhost:8080/metrics | grep -E "(events_per_second|join_hit_rate|processing_duration)"'# Export metrics for analysis
curl http://localhost:8080/metrics > benchmark_results.txt
# Generate performance report
python scripts/analyze_performance.py benchmark_results.txt- Baseline Performance: Single-topic, no joins
- Join Performance: 2-topic, simple joins
- Multi-Join Performance: 3+ topics, complex queries
- High-Volume Testing: Maximum sustainable throughput
- Latency Testing: P99 latency under load
- Memory Stress Testing: Large state scenarios
- Recovery Testing: Performance after restart
- Long-Running Testing: 24+ hour stability
- Use immediate emission for maximum throughput
- Use smart emission for data quality requirements
- Set TTL based on business requirements, not technical limits
- Configure RocksDB cache to 25-50% of available memory
- Deploy one instance per Kafka partition for optimal parallelism
- Use NVMe storage for RocksDB data
- Monitor join hit rates to validate pipeline effectiveness
- Set up comprehensive alerting on key performance metrics
- Profile regularly using built-in pprof endpoints
- Test performance before production deployment
- Monitor state growth and adjust TTL accordingly
- Scale horizontally rather than vertically when possible
Need help optimizing your KaflowSQL deployment? Check our troubleshooting guide or join the discussion.