Skip to content

Latest commit

 

History

History
214 lines (146 loc) · 6.99 KB

File metadata and controls

214 lines (146 loc) · 6.99 KB

CAS Build Cache - Design Decisions and Questions

This document records the design decisions made during implementation and questions that arose, along with the chosen solutions.

Language Choice

Question: Which programming language to use for the implementation?

Decision: Rust

Rationale:

  1. Performance: Rust provides near-C performance without garbage collection overhead, crucial for a high-throughput caching service that Xcode will call thousands of times per build.
  2. Safety: Memory safety without GC prevents common bugs in long-running services.
  3. Ecosystem: Excellent gRPC support via tonic, async runtime via tokio, and mature SQLite bindings via rusqlite.
  4. Distribution: Single static binary makes deployment trivial on macOS.
  5. BLAKE3: Native Rust implementation available for fast hashing.

Alternatives considered:

  • Go: Good performance but GC pauses could affect latency under high load
  • C++: Too much manual memory management for this use case
  • Python: Too slow for production use

Hashing Algorithm

Question: Which hash algorithm to use for content-addressable storage?

Decision: BLAKE3 with domain separation

Rationale:

  1. BLAKE3 is one of the fastest cryptographic hash functions available
  2. Used by LLVM's ActionCache according to their source code
  3. 256-bit output is sufficient for collision resistance
  4. Parallelizable, which helps with large files

Implementation:

  • Blobs: BLAKE3("cas-build-cache:blob:v1:" || data)
  • Objects: BLAKE3("cas-build-cache:object:v1:" || data || num_refs || ref1 || ref2 || ...)

The domain separation ensures blobs and objects with identical data produce different hashes, preventing accidental collisions.


Storage Architecture

Question: How to structure the storage backends?

Decision: Trait-based abstraction with multiple backend implementations

Rationale:

  1. Allows swapping backends without changing the service layer
  2. Enables in-memory backend for testing
  3. Future-proofs for Redis, S3, or other backends

Backend choices:

  • File-based CAS (default): Best for large blobs, leverages OS page cache
  • SQLite KV (default): Excellent for small entries, atomic transactions
  • SQLite CAS (optional): Single-file storage, easier backup
  • In-memory (for testing): Fast, no persistence

File-based CAS Structure

Question: How to organize files in the filesystem?

Decision: Multi-level directory hierarchy (configurable 1-3 levels)

Structure:

cas/
  aa/          <- first byte of hash
    bb/        <- second byte of hash
      aabbcc...full_hash.blob
      aabbdd...full_hash.obj

Rationale:

  1. Prevents millions of files in a single directory (macOS/HFS+ performance issue)
  2. 256 subdirectories per level is manageable
  3. 2 levels (65536 directories) is the default - good balance for most caches

KV Value Conflict Handling (Poisoning)

Question: How to handle when the same action key gets different values?

Decision: Lenient mode by default, with optional strict mode

Options:

  1. Strict: Reject new value, keep old one, return error
  2. Lenient (default): Keep first value, log warning, no error
  3. Overwrite: Last write wins (dangerous for builds)

Rationale: Strict mode can cause build failures for legitimately non-deterministic actions (timestamps, build IDs, etc.). Lenient mode is safer for production while still providing the same caching benefit.


Eviction Strategy

Question: How to handle cache eviction?

Decision: LRU-based eviction with optional TTL, using high/low water marks

Implementation:

  1. Check cache size periodically (configurable interval, default 5 minutes)
  2. If size > high_water_mark (95%), start evicting
  3. Evict KV entries first (action cache), then CAS entries
  4. Stop when size < low_water_mark (85%)
  5. Optionally apply TTL-based eviction for old entries

Rationale:

  • Evicting KV entries first makes CAS entries "orphaned" and thus safe to evict
  • Water marks prevent constant eviction churn
  • LRU is simple and effective; reachability-based GC is more complex and requires parsing all KV values to find CAS references

write_to_disk Optimization

Question: How to implement the write_to_disk optimization in proto?

Decision:

  1. Accept file_path in requests (read file, never store path)
  2. Optionally return file_path in responses when client sets write_to_disk=true
  3. Export files to a dedicated directory with unique names
  4. Client is responsible for moving/deleting the exported file

Rationale: The proto spec notes this is optional ("the service can still choose to return bytes"). For local Unix socket access, file_path can be a significant performance win for large blobs by avoiding gRPC message serialization.


Socket Permissions

Question: What permissions should the Unix socket have?

Decision: Mode 0600 (owner read/write only)

Rationale: Build caches may contain proprietary compiled code. Restricting access to the owner prevents other users on shared systems from accessing cached artifacts.


Configuration

Question: How to handle configuration?

Decision: TOML file with sensible defaults for zero-configuration local use

Default locations checked:

  1. ~/.config/cas-build-cache/config.toml
  2. /etc/cas-build-cache/config.toml
  3. ./config.toml

Defaults:

  • Socket: ~/.local/state/cas-build-cache/cache.sock
  • Storage: ~/Library/Caches/cas-build-cache/ (macOS standard)
  • Max size: 50GB
  • CAS backend: File
  • KV backend: SQLite

Additional Requirements Identified

During implementation, the following requirements were added based on the spec analysis:

  1. Tagging system: Added tag field to all entries to support batch operations (e.g., delete all entries for a specific project or user)

  2. Verification on read: Optional hash verification when loading entries, useful for detecting corruption

  3. Export directory cleanup: The spec mentions cleaning up abandoned exports; implemented via TTL-based cleanup

  4. Concurrent access: SQLite WAL mode and proper locking for concurrent builds

  5. Corruption handling: Delete corrupt entries on read and return "not found" (treat as cache miss, let build system recompute)


Open Questions (For Future Work)

  1. Reachability-based GC: Currently using simple LRU. Full reachability tracking would require parsing KV values to identify which are CAS IDs - the protocol doesn't define a standard format for this.

  2. Remote backend: The current implementation is local-only. A Redis or S3 backend would enable distributed caching.

  3. Authentication: No authentication is implemented. For shared caches, this might be needed.

  4. Metrics/Observability: Basic logging is implemented; Prometheus metrics could be added for production monitoring.

  5. Prefetching: The spec mentions Tuist is exploring prefetching artifacts before they're needed - this could be a future optimization.