CAS Build Cache - Design Decisions and Questions

This document records the design decisions made during implementation and questions that arose, along with the chosen solutions.

Language Choice

Question: Which programming language to use for the implementation?

Decision: Rust

Rationale:

Performance: Rust provides near-C performance without garbage collection overhead, crucial for a high-throughput caching service that Xcode will call thousands of times per build.
Safety: Memory safety without GC prevents common bugs in long-running services.
Ecosystem: Excellent gRPC support via tonic, async runtime via tokio, and mature SQLite bindings via rusqlite.
Distribution: Single static binary makes deployment trivial on macOS.
BLAKE3: Native Rust implementation available for fast hashing.

Alternatives considered:

Go: Good performance but GC pauses could affect latency under high load
C++: Too much manual memory management for this use case
Python: Too slow for production use

Hashing Algorithm

Question: Which hash algorithm to use for content-addressable storage?

Decision: BLAKE3 with domain separation

Rationale:

BLAKE3 is one of the fastest cryptographic hash functions available
Used by LLVM's ActionCache according to their source code
256-bit output is sufficient for collision resistance
Parallelizable, which helps with large files

Implementation:

Blobs: BLAKE3("cas-build-cache:blob:v1:" || data)
Objects: BLAKE3("cas-build-cache:object:v1:" || data || num_refs || ref1 || ref2 || ...)

The domain separation ensures blobs and objects with identical data produce different hashes, preventing accidental collisions.

Storage Architecture

Question: How to structure the storage backends?

Decision: Trait-based abstraction with multiple backend implementations

Rationale:

Allows swapping backends without changing the service layer
Enables in-memory backend for testing
Future-proofs for Redis, S3, or other backends

Backend choices:

File-based CAS (default): Best for large blobs, leverages OS page cache
SQLite KV (default): Excellent for small entries, atomic transactions
SQLite CAS (optional): Single-file storage, easier backup
In-memory (for testing): Fast, no persistence

File-based CAS Structure

Question: How to organize files in the filesystem?

Decision: Multi-level directory hierarchy (configurable 1-3 levels)

Structure:

cas/
  aa/          <- first byte of hash
    bb/        <- second byte of hash
      aabbcc...full_hash.blob
      aabbdd...full_hash.obj

Rationale:

Prevents millions of files in a single directory (macOS/HFS+ performance issue)
256 subdirectories per level is manageable
2 levels (65536 directories) is the default - good balance for most caches

KV Value Conflict Handling (Poisoning)

Question: How to handle when the same action key gets different values?

Decision: Lenient mode by default, with optional strict mode

Options:

Strict: Reject new value, keep old one, return error
Lenient (default): Keep first value, log warning, no error
Overwrite: Last write wins (dangerous for builds)

Rationale: Strict mode can cause build failures for legitimately non-deterministic actions (timestamps, build IDs, etc.). Lenient mode is safer for production while still providing the same caching benefit.

Eviction Strategy

Question: How to handle cache eviction?

Decision: LRU-based eviction with optional TTL, using high/low water marks

Implementation:

Check cache size periodically (configurable interval, default 5 minutes)
If size > high_water_mark (95%), start evicting
Evict KV entries first (action cache), then CAS entries
Stop when size < low_water_mark (85%)
Optionally apply TTL-based eviction for old entries

Rationale:

Evicting KV entries first makes CAS entries "orphaned" and thus safe to evict
Water marks prevent constant eviction churn
LRU is simple and effective; reachability-based GC is more complex and requires parsing all KV values to find CAS references

write_to_disk Optimization

Question: How to implement the write_to_disk optimization in proto?

Decision:

Accept file_path in requests (read file, never store path)
Optionally return file_path in responses when client sets write_to_disk=true
Export files to a dedicated directory with unique names
Client is responsible for moving/deleting the exported file

Rationale: The proto spec notes this is optional ("the service can still choose to return bytes"). For local Unix socket access, file_path can be a significant performance win for large blobs by avoiding gRPC message serialization.

Socket Permissions

Question: What permissions should the Unix socket have?

Decision: Mode 0600 (owner read/write only)

Rationale: Build caches may contain proprietary compiled code. Restricting access to the owner prevents other users on shared systems from accessing cached artifacts.

Configuration

Question: How to handle configuration?

Decision: TOML file with sensible defaults for zero-configuration local use

Default locations checked:

~/.config/cas-build-cache/config.toml
/etc/cas-build-cache/config.toml
./config.toml

Defaults:

Socket: ~/.local/state/cas-build-cache/cache.sock
Storage: ~/Library/Caches/cas-build-cache/ (macOS standard)
Max size: 50GB
CAS backend: File
KV backend: SQLite

Additional Requirements Identified

During implementation, the following requirements were added based on the spec analysis:

Tagging system: Added tag field to all entries to support batch operations (e.g., delete all entries for a specific project or user)
Verification on read: Optional hash verification when loading entries, useful for detecting corruption
Export directory cleanup: The spec mentions cleaning up abandoned exports; implemented via TTL-based cleanup
Concurrent access: SQLite WAL mode and proper locking for concurrent builds
Corruption handling: Delete corrupt entries on read and return "not found" (treat as cache miss, let build system recompute)

Open Questions (For Future Work)

Reachability-based GC: Currently using simple LRU. Full reachability tracking would require parsing KV values to identify which are CAS IDs - the protocol doesn't define a standard format for this.
Remote backend: The current implementation is local-only. A Redis or S3 backend would enable distributed caching.
Authentication: No authentication is implemented. For shared caches, this might be needed.
Metrics/Observability: Basic logging is implemented; Prometheus metrics could be added for production monitoring.
Prefetching: The spec mentions Tuist is exploring prefetching artifacts before they're needed - this could be a future optimization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAS Build Cache - Design Decisions and Questions

Language Choice

Hashing Algorithm

Storage Architecture

File-based CAS Structure

KV Value Conflict Handling (Poisoning)

Eviction Strategy

write_to_disk Optimization

Socket Permissions

Configuration

Additional Requirements Identified

Open Questions (For Future Work)

FilesExpand file tree

DECISIONS.md

Latest commit

History

DECISIONS.md

File metadata and controls

CAS Build Cache - Design Decisions and Questions

Language Choice

Hashing Algorithm

Storage Architecture

File-based CAS Structure

KV Value Conflict Handling (Poisoning)

Eviction Strategy

write_to_disk Optimization

Socket Permissions

Configuration

Additional Requirements Identified

Open Questions (For Future Work)