Semantic search over 100M NYC taxi records with <100ms P99 latency on a $45/month Azure budget.
Challenge: Python (GIL) handles 50 qps max. Production needs 500+ qps.
Solution:
- C# API: High-concurrency orchestration
- Python sidecar: ML inference only
- gRPC: 5ms inter-service latency
Result: 10x throughput vs pure Python.
Math: 100M × 384 floats × 4 bytes = 147GB
Optimization:
- IVF: Cluster into 100 Voronoi cells
- PQ: Quantize 384 dims → 8 bytes
- Result: 4.8GB (97% compression)
Trade-off: 5% recall loss acceptable for semantic search.
var results = await Search();
_ = Task.Run(() => cache.Set(key, results)); // Non-blocking
return results;Impact: 85% cache hit rate, 0ms cache write penalty.
- P99: 425ms (cold) → 3ms (cached)
- Throughput: 500 qps sustained
- Cost: $45/mo vs $900/mo Pinecone (95% savings)
- GPU from Day 1: Would've saved weeks debugging CPU bottlenecks
- Chaos testing early: Found thread-safety bug in production
- Resilience > Accuracy: Circuit breakers matter more than 99% recall