This document records the design decisions made during implementation and questions that arose, along with the chosen solutions.
Question: Which programming language to use for the implementation?
Decision: Rust
Rationale:
- Performance: Rust provides near-C performance without garbage collection overhead, crucial for a high-throughput caching service that Xcode will call thousands of times per build.
- Safety: Memory safety without GC prevents common bugs in long-running services.
- Ecosystem: Excellent gRPC support via
tonic, async runtime viatokio, and mature SQLite bindings viarusqlite. - Distribution: Single static binary makes deployment trivial on macOS.
- BLAKE3: Native Rust implementation available for fast hashing.
Alternatives considered:
- Go: Good performance but GC pauses could affect latency under high load
- C++: Too much manual memory management for this use case
- Python: Too slow for production use
Question: Which hash algorithm to use for content-addressable storage?
Decision: BLAKE3 with domain separation
Rationale:
- BLAKE3 is one of the fastest cryptographic hash functions available
- Used by LLVM's ActionCache according to their source code
- 256-bit output is sufficient for collision resistance
- Parallelizable, which helps with large files
Implementation:
- Blobs:
BLAKE3("cas-build-cache:blob:v1:" || data) - Objects:
BLAKE3("cas-build-cache:object:v1:" || data || num_refs || ref1 || ref2 || ...)
The domain separation ensures blobs and objects with identical data produce different hashes, preventing accidental collisions.
Question: How to structure the storage backends?
Decision: Trait-based abstraction with multiple backend implementations
Rationale:
- Allows swapping backends without changing the service layer
- Enables in-memory backend for testing
- Future-proofs for Redis, S3, or other backends
Backend choices:
- File-based CAS (default): Best for large blobs, leverages OS page cache
- SQLite KV (default): Excellent for small entries, atomic transactions
- SQLite CAS (optional): Single-file storage, easier backup
- In-memory (for testing): Fast, no persistence
Question: How to organize files in the filesystem?
Decision: Multi-level directory hierarchy (configurable 1-3 levels)
Structure:
cas/
aa/ <- first byte of hash
bb/ <- second byte of hash
aabbcc...full_hash.blob
aabbdd...full_hash.obj
Rationale:
- Prevents millions of files in a single directory (macOS/HFS+ performance issue)
- 256 subdirectories per level is manageable
- 2 levels (65536 directories) is the default - good balance for most caches
Question: How to handle when the same action key gets different values?
Decision: Lenient mode by default, with optional strict mode
Options:
- Strict: Reject new value, keep old one, return error
- Lenient (default): Keep first value, log warning, no error
- Overwrite: Last write wins (dangerous for builds)
Rationale: Strict mode can cause build failures for legitimately non-deterministic actions (timestamps, build IDs, etc.). Lenient mode is safer for production while still providing the same caching benefit.
Question: How to handle cache eviction?
Decision: LRU-based eviction with optional TTL, using high/low water marks
Implementation:
- Check cache size periodically (configurable interval, default 5 minutes)
- If size > high_water_mark (95%), start evicting
- Evict KV entries first (action cache), then CAS entries
- Stop when size < low_water_mark (85%)
- Optionally apply TTL-based eviction for old entries
Rationale:
- Evicting KV entries first makes CAS entries "orphaned" and thus safe to evict
- Water marks prevent constant eviction churn
- LRU is simple and effective; reachability-based GC is more complex and requires parsing all KV values to find CAS references
Question: How to implement the write_to_disk optimization in proto?
Decision:
- Accept
file_pathin requests (read file, never store path) - Optionally return
file_pathin responses when client setswrite_to_disk=true - Export files to a dedicated directory with unique names
- Client is responsible for moving/deleting the exported file
Rationale: The proto spec notes this is optional ("the service can still choose to return bytes"). For local Unix socket access, file_path can be a significant performance win for large blobs by avoiding gRPC message serialization.
Question: What permissions should the Unix socket have?
Decision: Mode 0600 (owner read/write only)
Rationale: Build caches may contain proprietary compiled code. Restricting access to the owner prevents other users on shared systems from accessing cached artifacts.
Question: How to handle configuration?
Decision: TOML file with sensible defaults for zero-configuration local use
Default locations checked:
~/.config/cas-build-cache/config.toml/etc/cas-build-cache/config.toml./config.toml
Defaults:
- Socket:
~/.local/state/cas-build-cache/cache.sock - Storage:
~/Library/Caches/cas-build-cache/(macOS standard) - Max size: 50GB
- CAS backend: File
- KV backend: SQLite
During implementation, the following requirements were added based on the spec analysis:
-
Tagging system: Added
tagfield to all entries to support batch operations (e.g., delete all entries for a specific project or user) -
Verification on read: Optional hash verification when loading entries, useful for detecting corruption
-
Export directory cleanup: The spec mentions cleaning up abandoned exports; implemented via TTL-based cleanup
-
Concurrent access: SQLite WAL mode and proper locking for concurrent builds
-
Corruption handling: Delete corrupt entries on read and return "not found" (treat as cache miss, let build system recompute)
-
Reachability-based GC: Currently using simple LRU. Full reachability tracking would require parsing KV values to identify which are CAS IDs - the protocol doesn't define a standard format for this.
-
Remote backend: The current implementation is local-only. A Redis or S3 backend would enable distributed caching.
-
Authentication: No authentication is implemented. For shared caches, this might be needed.
-
Metrics/Observability: Basic logging is implemented; Prometheus metrics could be added for production monitoring.
-
Prefetching: The spec mentions Tuist is exploring prefetching artifacts before they're needed - this could be a future optimization.