Skip to content

Production hardening, C2SP Phase 0, and fly.io deployment#4

Open
wolfv wants to merge 3 commits into
mainfrom
production-hardening-phase0
Open

Production hardening, C2SP Phase 0, and fly.io deployment#4
wolfv wants to merge 3 commits into
mainfrom
production-hardening-phase0

Conversation

@wolfv

@wolfv wolfv commented Jul 3, 2026

Copy link
Copy Markdown
Member

Summary

Full production-readiness pass plus Phase 0 protocol conformance work, verified end-to-end against a live fly.io deployment.

Security & correctness fixes

  • Rate limiter was completely broken: the key extractor required ConnectInfo which was never installed, so every request to every endpoint returned 500. Also fixes the token-replenish semantics (per_second(n) means one token per n seconds), adds SmartIpKeyExtractor for proxy deployments, the required cleanup task, and RATE_LIMIT_* env overrides.
  • Witness split-view TOCTOU: state updates are now compare-and-swap under a row lock, so two concurrent add-checkpoint requests can never get the witness to cosign two different roots at the same tree size.
  • External witness cosignatures are now verified against pinned keys (EXTERNAL_WITNESSES=name=url=vkey) before counting toward the publication quorum.
  • Vindex WAL crash recovery: CRC32-checksummed entries (v3 format, v2 still readable), torn-tail truncation (previously a torn write bricked startup permanently), idempotent indexing, and integration aborts before marking entries integrated if the vindex fails.

C2SP Phase 0

  • cosignature/v1 witness signatures per c2sp.org/tlog-cosignature: timestamped 76-byte blobs, alg-0x04 key IDs, parsing pinned against the spec's example vector. The witness-conformance suite passes 28/28.
  • Quorum publishing (WITNESS_QUORUM): one unavailable witness no longer halts the log.

Vindex memory fix

  • Periodic snapshots truncate the WAL (VINDEX_SNAPSHOT_INTERVAL, default 100k entries), bounding WAL growth and startup replay. Missing/corrupt/behind state auto-rebuilds from entry bundles in tile storage.

Ops

SIGTERM handling, fail-fast worker supervision, atomic fs tile writes, request timeouts, non-root containers, docker-compose fixes, overflow-checks in release.

Deployment & benchmarking

  • fly.toml for the staging deployment (deployed to conda-transparency-log.fly.dev with old data cleared)
  • scripts/bench.py end-to-end soak test; docs/REMAINING_WORK.md with the open design items (vindex root anchoring, ingest validation/dedup, CEP text fixes) and benchmark results

Test plan

  • 149 workspace tests pass, zero warnings
  • witness-conformance: 28/28
  • Live bench on fly.io: 10k entries at concurrency 96 — zero errors, all writes integrated and checkpointed, 500/500 sampled vindex lookups verified at their assigned indices

🤖 Generated with Claude Code

wolfv and others added 3 commits July 3, 2026 10:14
Security and correctness:
- Fix rate limiter: install ConnectInfo (every request previously failed
  with 500), use SmartIpKeyExtractor behind proxies, spawn the cleanup
  task, keep Retry-After headers, and configure the token replenish
  interval correctly (per_second(n) means one token per n seconds, not
  n/s) with RATE_LIMIT_* env overrides
- Witness: compare-and-swap state updates so concurrent requests can
  never cosign two different roots at the same size (split-view TOCTOU);
  reject sizes above i64::MAX that would defeat rollback protection
- Verify external witness cosignatures against pinned keys before they
  count toward publication (EXTERNAL_WITNESSES=name=url=vkey)
- Vindex WAL: CRC32-checksummed v3 format, torn-tail truncation on
  recovery, single-write entries, idempotent index_entry, and vindex
  failures abort the integration cycle before entries are marked
  integrated

C2SP conformance (Phase 0):
- cosignature/v1 witness signatures (timestamped 76-byte blobs, alg-0x04
  key IDs) per c2sp.org/tlog-cosignature, pinned against the spec's
  example vector; witness-conformance suite passes 28/28
- Quorum-based checkpoint publishing (WITNESS_QUORUM)

Vindex memory:
- Periodic snapshots (CRC'd, atomic rename) truncate the WAL and bound
  startup replay (VINDEX_SNAPSHOT_INTERVAL); missing/corrupt/behind
  state auto-rebuilds from entry bundles in tile storage

Ops:
- SIGTERM-aware graceful shutdown, fail-fast worker supervision,
  atomic filesystem tile writes (temp+rename), request timeouts,
  witness rate limiting, non-root containers, docker-compose fixes
  (vindex WAL path, secrets via environment), strict tile-path digits,
  release overflow-checks

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- fly.toml: single machine + volume (SQLite state, vindex WAL/snapshot)
  with tiles on S3, vindex enabled, quorum-ready config
- scripts/bench.py: end-to-end soak test measuring write throughput and
  latency, integration and checkpoint lag, read-path latencies, and
  per-entry vindex correctness; caught the rate-limiter replenish bug
  on its first run
- docs/REMAINING_WORK.md: vindex root anchoring design, ingest
  validation/dedup plan, CEP text fixes, scale limits, benchmark
  results from the fly.io staging deployment

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Building a log from existing data through POST /add pays incremental
integration costs (batch-by-batch acks, each partial tile rewritten up
to 256 times) — measured at ~36 entries/s against S3. The importer
builds the tree in one pass instead:

- Streams pre-normalized JSONL entries in tile-aligned chunks through
  the same integrate() tree builder as the live path, so the resulting
  tree is byte-identical to incremental integration (tested)
- Uploads tiles and entry bundles concurrently (200k entries -> 1,571
  objects at ~5,500 entries/s locally)
- Builds the vindex in the same pass, finishing with a snapshot
- Commits the database state only after all objects are durable, and
  refuses non-empty logs (never fork); --resume skips existing objects
- Signs the initial checkpoint; the live server continues incrementally
  from the imported state (verified end-to-end with real conda repodata)
- Optional --epoch-note marker entry records what the bootstrap covers

conda-log-ingest gains --jsonl-out to convert conda repodata into the
importer's input format, and --api-key for authenticated submission.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant