Skip to content

feat(lake): content-stable module archives in Lake's artifact cache#13992

Draft
marcelolynch wants to merge 12 commits into
leanprover:masterfrom
marcelolynch:lake-detraced-ltar-cache
Draft

feat(lake): content-stable module archives in Lake's artifact cache#13992
marcelolynch wants to merge 12 commits into
leanprover:masterfrom
marcelolynch:lake-detraced-ltar-cache

Conversation

@marcelolynch

@marcelolynch marcelolynch commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This PR makes Lake's module archives (.ltar) content-stable and records each module's output content hashes in cache mappings alongside the archive reference. Previously, the trace file packed into an archive carried the build's input hash, input tree, and log (including absolute paths from the lean invocation), so the same outputs produced different archive bytes for every input revision and machine.

Byte-identical module outputs now produce a byte-identical archive independent of the inputs, checkout path, or machine that produced them, so input-only changes (e.g., comment edits in an imported module) upload no new archive bytes and identical outputs deduplicate across revisions on cache services.

Consumers verify unpacked archives against the hashes recorded in their cache mapping entry, recovering from mismatches by rebuilding, and restore outputs directly from the local artifact cache without fetching or unpacking the archive when they are already present.

The archive now packs a canonical input-free trace (depHash := .nil, no inputs, empty log), swapped in only for the duration of the pack; consumers stamp the input hash from the cache mapping into the trace after unpacking (marked synthetic), preserving the invariant that the on-disk trace carries the real input hash.


Implementation details

PUSH  (lake build -o out.jsonl, artifact cache writable)
  ═══════════════════════════════════════════════════════════════════════

   lean compile ──► outputs: .olean[.server/.private] .ilean .ir .c [.bc]
        │
        ▼
   trace file = real metadata {depHash=H(inputs), inputs, log, outputs}
        │
        ▼  packLtar
   ┌───────────────────────────────────────────────┐
   │ 1. swap in canonical trace                    │
   │    {depHash=nil, inputs=[], log={}, outputs}  │   archive bytes =
   │ 2. leantar pack ───► M.ltar                   │   f(outputs only)
   │ 3. restore real trace on disk                 │
   └───────────────────────────────────────────────┘
        │
        ▼
   local CAS:  artifacts/<C>.ltar            C = content hash of bundle
        │
        ▼  trackOutputsIfEnabled
   mapping entry   [ H ,  "<C>.ltar" ,  {o,i,c,r,…} ]
                     │        │              │
                inputHash   data        recorded output hashes
                          (upload       (metadata only —
                           target)       never uploaded)
        │
        ▼  lake cache stage / put
   1. upload bundles   (collected from the data field only)
   2. upload mapping   (after artifacts: mapping present ⇒ blobs present)


  LOOKUP  (fetchFromCache?, consumer)
  ═══════════════════════════════════════════════════════════════════════

   inputHash H ──► mapping entry for H?
        │
        ├─ none ───────────────────────────► local trace up-to-date?
        │                                      yes: replay · no: BUILD
        ├─ full descrs (local entry)
        │      └─► resolve each from CAS ──► hit: DONE · miss: BUILD
        │
        └─ bundle ref "<C>.ltar" + recorded outputs
               │
               ▼  (A)  all recorded outputs in LOCAL CAS?
               ├─ yes ──► serve from CAS        (no fetch, no unpack)
               │
               └─ no
                    ▼  resolve bundle: local CAS, else download <C>.ltar
                    │                  (bytes hash-verified against C)
                    ▼  unpack into build dir
                    ▼  read unpacked trace
                    verify  trace outputs == recorded outputs ?
                    ├─ no ──► warn "cache integrity error"
                    │         discard outputs + trace ──► BUILD
                    │         (rebuild overwrites the entry: self-heals)
                    └─ yes ─► stamp H into trace (synthetic)
                              cache unpacked outputs individually
                              (next lookup takes the (A) fast path)

   Old entries (no recorded outputs): same bundle path, verification skipped.
   Old bundles (real depHash inside): stamp is a no-op pass-through.

Cache mapping lines gain an optional third element carrying the module's output descriptions ([inputHash, "<hash>.ltar", {outputs}]), and locally cached output entries gain a corresponding outputs field. Only the bundle remains an upload target; the recorded hashes are descriptive metadata. The recorded outputs serve two purposes on the consumer side:

  • Integrity: after unpacking a bundle, the outputs declared by its embedded trace are checked against the mapping entry's record, so a mapping entry whose archive reference and recorded outputs disagree (e.g., wired to another module's bundle, or corrupted) is rejected rather than silently trusted. Verification runs before the input hash is stamped, a rejected unpack is discarded and treated as a cache miss with a warning, and the subsequent rebuild overwrites the offending mapping entry, so a corrupt cache entry self-heals instead of failing the build; --wfail restores strict failure for CI.
  • Reuse: when the mapping entry records the output hashes and all of them are already present in the local artifact cache, the outputs are restored directly from it (local-only lookup, since individual outputs are not uploaded alongside bundles) and the bundle is neither downloaded nor unpacked; the bundle remains the fallback whenever any output is missing.

Compatibility

The change is compatible in both directions: older Lake versions ignore the extra mapping element and the outputs field, mappings and archives produced by older Lake versions are consumed as before (archives with an embedded input hash pass through unchanged, mapping entries without recorded outputs skip verification), and mixed-version cache sharing is unaffected since module input hashes already incorporate the toolchain.

Validation

A/B results on Batteries (187 modules, 306 MB of artifacts), feature branch vs. a stage1 built from the merge-base — same compiler, only Lake differs:

  │                    Scenario                    │   Feature    │   Baseline    │            Verdict             │
  ├────────────────────────────────────────────────┼──────────────┼───────────────┼────────────────────────────────┤
  │ Pack-only lake build -o (187 packs)            │ 2.22 s       │ 2.22 s        │ trace swap ×187: free          │
  ├────────────────────────────────────────────────┼──────────────┼───────────────┼────────────────────────────────┤
  │ Cold consume (every added check runs:          │ 0.55 s / 187 │ 0.54 s / 187  │ added cost ≈ 50 µs/module,     │
  │ probe-miss, verify, stamp)                     │  spawns      │ spawns        │ below noise                    │
  ├────────────────────────────────────────────────┼──────────────┼───────────────┼────────────────────────────────┤
  │ Adversarial worst case (cache warm except .c,  │ 0.66 s       │ —             │ identical to plain cold        │
  │ probe pays max stats then full fallback)       │              │               │ consume in the same session    │
  ├────────────────────────────────────────────────┼──────────────┼───────────────┼────────────────────────────────┤
  │ Warm consume                                   │ 0.13 s / 0   │ 0.44 s / 187  │ restore now equals no-op time  │
  │                                                │ spawns       │ spawns        │                                │
  ├────────────────────────────────────────────────┼──────────────┼───────────────┼────────────────────────────────┤
  │ No-op build                                    │ 0.131 s      │ 0.132 s       │ zero steady-state cost         │
  └────────────────────────────────────────────────┴──────────────┴───────────────┴────────────────────────────────┘

Covered by the new tests/lake/tests/ltarCache test: mapping entry shape, archive byte-stability across an input-only edit, bundle-only staging, fresh consume with verification, rejection of and self-healing from a corrupted mapping entry, backward compatibility with two-element mappings, and the no-unpack restore from a warm artifact cache.

Closes #13996
Closes #13997

marcelolynch and others added 6 commits June 9, 2026 22:19
Pack an input-free trace (`depHash := .nil`, no inputs/log) into each module's
`.ltar`, so the archive's content hash is a pure function of the module's
outputs. Byte-identical outputs then dedup across revisions instead of churning
with the input hash (which cascades to ~all modules every commit).

The input->output binding the bundle used to carry moves into the cache mapping
as an additive `outputs` field (the per-input "receipt"): `data` stays the bare
bundle reference, and the output content hashes ride alongside it (a third JSONL
element / optional `outputs` key). On consume, the bundle is unpacked, the
looked-up input hash is stamped into the nil-depHash trace, and the unpacked
outputs are verified against the receipt's record before use.

Backward/forward compatible: old readers read `data` as before and ignore the
new field; new readers tolerate old receipts (no `outputs` -> no verification).
Staging stays bundle-only because `collectOutputDescrs` walks only `data`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `tests/lake/tests/ltarCache` exercising the artifact-cache behavior:

- a cosmetic source edit (changes the input hash, not the outputs) leaves every
  bundle's content hash unchanged — the content-stability property the feature
  exists for (stock Lake, which embeds the input hash in the bundle, would churn
  here, so this would fail without the change);
- the emitted mapping pairs a bundle reference with the recorded output hashes,
  and staging from it collects only bundles (not individual artifacts);
- consuming a cache that holds only bundles + receipts fetches, unpacks, and
  verifies the bundle against the recorded outputs;
- a receipt whose recorded outputs disagree with the bundle is rejected with a
  cache integrity error;
- an older receipt (bundle reference only, no recorded outputs) is consumed
  without verification and without error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Verify the unpacked bundle against the receipt before stamping the input hash into the trace, so a rejected unpack leaves no trace claiming the input produced those outputs. Treat integrity failures as cache misses (with a warning) instead of fatal errors, letting the build recover by compiling from source and overwriting the offending receipt; `--wfail` restores strict failure for CI. Also warn on ill-formed recorded outputs instead of silently skipping verification, preserve the original trace when re-unpacking a local archive, restore the missing-trace state when packing started without one, and compare output descriptions structurally via `BEq`.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
When a cache mapping pairs an archive bundle with recorded output hashes, resolve the individual artifacts from the local artifact cache first and fall back to the bundle only when some are missing. A consumer with a warm cache then restores a revision's outputs without downloading or unpacking any bundles.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the ad-hoc 'receipt' jargon in module build comments and tests with the mapping-entry terminology the cache layer already uses (`CacheMap.Entry`, mapping lines), rename `receiptDescrs` accordingly, and expand the local-resolution comment with its rationale: a warm artifact cache satisfies a bundle entry without fetching or unpacking, and trusting the recorded outputs is equivalent to resolving an entry that holds only output descriptions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The restructured insertion is not a pure refactor: it also fixes the pre-existing-archive path, which previously dropped the platform-independence flag. Record that in the comment so the intent survives review.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@marcelolynch

Copy link
Copy Markdown
Contributor Author

Note that the new metadata is included in the mappings to avoid more roundtrips to the cache: given the bounded amount of outputs, this is fine and it doesn't grow a lot (plus there's compression later). If our processes had a huge amount of outputs it might be a better idea to store this separately, but this change can be done eventually without much hassle.

@marcelolynch

marcelolynch commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

And some experiments with mathlib (small N=18, arguably) gave me these projections:

Measured over 18 same-toolchain daily pairs across 3 toolchain windows (rc2 / v4.30.0 / rc1); compression 0.063 (bundle ≈51 KB/module); per-commit figures modeled from the depHash cascade (mean 97 MB/commit, ~950 commits/mo).

  ┌────────────────────────────────────────┬────────────┬─────────────────────┬───────────────────────┐
  │                 Metric                 │   Base     │ De-traced           │         Gain          │
  ├────────────────────────────────────────┼────────────┼─────────────────────┼───────────────────────┤
  │ Cache carryover (daily)                │ ~4%        │ ~92%                │ —                     │
  ├────────────────────────────────────────┼────────────┼─────────────────────┼───────────────────────┤
  │ New artifacts / day                    │ ~7,800     │ ~640                │ ~12× fewer            │
  ├────────────────────────────────────────┼────────────┼─────────────────────┼───────────────────────┤
  │ depHash vs output churn /day           │ 90.6%      │ 15.0%               │ —                     │
  ├────────────────────────────────────────┼────────────┼─────────────────────┼───────────────────────┤
  │ Upload — daily cadence                 │ 10.5 GB/mo │ 2.7 GB/mo           │ 7.8 GB/mo (74%, 4.0×) │
  ├────────────────────────────────────────┼────────────┼─────────────────────┼───────────────────────┤
  │ Upload — per-commit cadence            │ ~92 GB/mo  │ ~2.8 GB/mo          │ ~89 GB/mo (~97%)      │
  ├────────────────────────────────────────┼────────────┼─────────────────────┼───────────────────────┤
  │ Storage (30-day retention, daily)      │ ~10.5 GB   │ ~2.8 GB             │ 3.8×                  │
  ├────────────────────────────────────────┼────────────┼─────────────────────┼───────────────────────┤
  │ Storage (30-day retention, per-commit) │ ~92 GB     │ ~2.8 GB             │ ~30×                  │
  ├────────────────────────────────────────┼────────────┼─────────────────────┼───────────────────────┤
  │ Download / warm consumer / daily sync  │ 351 MB     │ 87 MB               │ 4.0×                  │
  └────────────────────────────────────────┴────────────┴─────────────────────┴───────────────────────┘

  Monthly upload — per-commit cadence (~950 builds/mo)
    stock  ~92 GB
    new    ~2.8 GB     ~33×

  Daily catch-up download (per warm consumer)
    stock  351 MB
    new    87 MB         4.0×

marcelolynch and others added 3 commits June 10, 2026 08:46
Remove redundant phrasing, markdown emphasis in line comments, label prefixes, and change-history narration from the comments added on this branch; reword the test narration to plain English.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The field docstring should say when `outputs?` is absent in real data; the constraint that additions to the mapping line must stay trailing and optional belongs where the line is written.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The uniform-insertion remark in trackOutputsIfEnabled described what the single insert call site already shows, and the log-replay note in mkLtarMetadata framed behavior as precedent-matching rather than stating it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions github-actions Bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Jun 10, 2026
@mathlib-lean-pr-testing

Copy link
Copy Markdown

Mathlib CI status (docs):

  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase d07c9faf4b15d3961b105b7c639dc1664aca6f4e --onto d2f48db30713019241520218d0965227bbb64eed. You can force Mathlib CI using the force-mathlib-ci label. (2026-06-10 16:42:18)

@leanprover-bot

Copy link
Copy Markdown
Collaborator

Reference manual CI status:

  • ❗ Reference manual CI will not be attempted unless your PR branches off the nightly-with-manual branch. Try git rebase d07c9faf4b15d3961b105b7c639dc1664aca6f4e --onto 803553a556fd82fa1060efb0c43eda542130cb16. You can force reference manual CI using the force-manual-ci label. (2026-06-10 16:42:19)

marcelolynch and others added 3 commits June 10, 2026 10:33
When a mapping entry's recorded outputs are all served from the local artifact cache, also attach the locally cached bundle so that a subsequent mapping-producing build rehashes it instead of repacking every module. On a 100-module project this turns the first `-o` build after a no-unpack restore from one leantar invocation per module (~2.5s) into none (no-op time).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`lake cache unstage` copies artifacts into the local cache unconditionally, and artifacts cached by builds are read-only, so unstaging over a cache that already holds them failed with a permission error. Skip files that already exist — content-addressed names guarantee identical contents — in both `stage` and `unstage`.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Assert that an unpacked archive's trace is stamped with the mapping's input hash and marked synthetic, and that a mapping entry with an ill-formed third element is reported and consumed via its bundle while an explicit null is treated as absent.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN

Projects

None yet

2 participants