Production hardening, C2SP Phase 0, and fly.io deployment#4
Open
wolfv wants to merge 3 commits into
Open
Conversation
Security and correctness: - Fix rate limiter: install ConnectInfo (every request previously failed with 500), use SmartIpKeyExtractor behind proxies, spawn the cleanup task, keep Retry-After headers, and configure the token replenish interval correctly (per_second(n) means one token per n seconds, not n/s) with RATE_LIMIT_* env overrides - Witness: compare-and-swap state updates so concurrent requests can never cosign two different roots at the same size (split-view TOCTOU); reject sizes above i64::MAX that would defeat rollback protection - Verify external witness cosignatures against pinned keys before they count toward publication (EXTERNAL_WITNESSES=name=url=vkey) - Vindex WAL: CRC32-checksummed v3 format, torn-tail truncation on recovery, single-write entries, idempotent index_entry, and vindex failures abort the integration cycle before entries are marked integrated C2SP conformance (Phase 0): - cosignature/v1 witness signatures (timestamped 76-byte blobs, alg-0x04 key IDs) per c2sp.org/tlog-cosignature, pinned against the spec's example vector; witness-conformance suite passes 28/28 - Quorum-based checkpoint publishing (WITNESS_QUORUM) Vindex memory: - Periodic snapshots (CRC'd, atomic rename) truncate the WAL and bound startup replay (VINDEX_SNAPSHOT_INTERVAL); missing/corrupt/behind state auto-rebuilds from entry bundles in tile storage Ops: - SIGTERM-aware graceful shutdown, fail-fast worker supervision, atomic filesystem tile writes (temp+rename), request timeouts, witness rate limiting, non-root containers, docker-compose fixes (vindex WAL path, secrets via environment), strict tile-path digits, release overflow-checks Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- fly.toml: single machine + volume (SQLite state, vindex WAL/snapshot) with tiles on S3, vindex enabled, quorum-ready config - scripts/bench.py: end-to-end soak test measuring write throughput and latency, integration and checkpoint lag, read-path latencies, and per-entry vindex correctness; caught the rate-limiter replenish bug on its first run - docs/REMAINING_WORK.md: vindex root anchoring design, ingest validation/dedup plan, CEP text fixes, scale limits, benchmark results from the fly.io staging deployment Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Building a log from existing data through POST /add pays incremental integration costs (batch-by-batch acks, each partial tile rewritten up to 256 times) — measured at ~36 entries/s against S3. The importer builds the tree in one pass instead: - Streams pre-normalized JSONL entries in tile-aligned chunks through the same integrate() tree builder as the live path, so the resulting tree is byte-identical to incremental integration (tested) - Uploads tiles and entry bundles concurrently (200k entries -> 1,571 objects at ~5,500 entries/s locally) - Builds the vindex in the same pass, finishing with a snapshot - Commits the database state only after all objects are durable, and refuses non-empty logs (never fork); --resume skips existing objects - Signs the initial checkpoint; the live server continues incrementally from the imported state (verified end-to-end with real conda repodata) - Optional --epoch-note marker entry records what the bootstrap covers conda-log-ingest gains --jsonl-out to convert conda repodata into the importer's input format, and --api-key for authenticated submission. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full production-readiness pass plus Phase 0 protocol conformance work, verified end-to-end against a live fly.io deployment.
Security & correctness fixes
ConnectInfowhich was never installed, so every request to every endpoint returned 500. Also fixes the token-replenish semantics (per_second(n)means one token per n seconds), addsSmartIpKeyExtractorfor proxy deployments, the required cleanup task, andRATE_LIMIT_*env overrides.add-checkpointrequests can never get the witness to cosign two different roots at the same tree size.EXTERNAL_WITNESSES=name=url=vkey) before counting toward the publication quorum.C2SP Phase 0
witness-conformancesuite passes 28/28.WITNESS_QUORUM): one unavailable witness no longer halts the log.Vindex memory fix
VINDEX_SNAPSHOT_INTERVAL, default 100k entries), bounding WAL growth and startup replay. Missing/corrupt/behind state auto-rebuilds from entry bundles in tile storage.Ops
SIGTERM handling, fail-fast worker supervision, atomic fs tile writes, request timeouts, non-root containers, docker-compose fixes,
overflow-checksin release.Deployment & benchmarking
fly.tomlfor the staging deployment (deployed toconda-transparency-log.fly.devwith old data cleared)scripts/bench.pyend-to-end soak test;docs/REMAINING_WORK.mdwith the open design items (vindex root anchoring, ingest validation/dedup, CEP text fixes) and benchmark resultsTest plan
🤖 Generated with Claude Code