feat(lake): content-stable module archives in Lake's artifact cache#13992
Draft
marcelolynch wants to merge 12 commits into
Draft
feat(lake): content-stable module archives in Lake's artifact cache#13992marcelolynch wants to merge 12 commits into
marcelolynch wants to merge 12 commits into
Conversation
Pack an input-free trace (`depHash := .nil`, no inputs/log) into each module's `.ltar`, so the archive's content hash is a pure function of the module's outputs. Byte-identical outputs then dedup across revisions instead of churning with the input hash (which cascades to ~all modules every commit). The input->output binding the bundle used to carry moves into the cache mapping as an additive `outputs` field (the per-input "receipt"): `data` stays the bare bundle reference, and the output content hashes ride alongside it (a third JSONL element / optional `outputs` key). On consume, the bundle is unpacked, the looked-up input hash is stamped into the nil-depHash trace, and the unpacked outputs are verified against the receipt's record before use. Backward/forward compatible: old readers read `data` as before and ignore the new field; new readers tolerate old receipts (no `outputs` -> no verification). Staging stays bundle-only because `collectOutputDescrs` walks only `data`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `tests/lake/tests/ltarCache` exercising the artifact-cache behavior: - a cosmetic source edit (changes the input hash, not the outputs) leaves every bundle's content hash unchanged — the content-stability property the feature exists for (stock Lake, which embeds the input hash in the bundle, would churn here, so this would fail without the change); - the emitted mapping pairs a bundle reference with the recorded output hashes, and staging from it collects only bundles (not individual artifacts); - consuming a cache that holds only bundles + receipts fetches, unpacks, and verifies the bundle against the recorded outputs; - a receipt whose recorded outputs disagree with the bundle is rejected with a cache integrity error; - an older receipt (bundle reference only, no recorded outputs) is consumed without verification and without error. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Verify the unpacked bundle against the receipt before stamping the input hash into the trace, so a rejected unpack leaves no trace claiming the input produced those outputs. Treat integrity failures as cache misses (with a warning) instead of fatal errors, letting the build recover by compiling from source and overwriting the offending receipt; `--wfail` restores strict failure for CI. Also warn on ill-formed recorded outputs instead of silently skipping verification, preserve the original trace when re-unpacking a local archive, restore the missing-trace state when packing started without one, and compare output descriptions structurally via `BEq`. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
When a cache mapping pairs an archive bundle with recorded output hashes, resolve the individual artifacts from the local artifact cache first and fall back to the bundle only when some are missing. A consumer with a warm cache then restores a revision's outputs without downloading or unpacking any bundles. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the ad-hoc 'receipt' jargon in module build comments and tests with the mapping-entry terminology the cache layer already uses (`CacheMap.Entry`, mapping lines), rename `receiptDescrs` accordingly, and expand the local-resolution comment with its rationale: a warm artifact cache satisfies a bundle entry without fetching or unpacking, and trusting the recorded outputs is equivalent to resolving an entry that holds only output descriptions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The restructured insertion is not a pure refactor: it also fixes the pre-existing-archive path, which previously dropped the platform-independence flag. Record that in the comment so the intent survives review. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
Author
|
Note that the new metadata is included in the mappings to avoid more roundtrips to the cache: given the bounded amount of outputs, this is fine and it doesn't grow a lot (plus there's compression later). If our processes had a huge amount of outputs it might be a better idea to store this separately, but this change can be done eventually without much hassle. |
Contributor
Author
|
And some experiments with mathlib (small N=18, arguably) gave me these projections: |
Remove redundant phrasing, markdown emphasis in line comments, label prefixes, and change-history narration from the comments added on this branch; reword the test narration to plain English. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The field docstring should say when `outputs?` is absent in real data; the constraint that additions to the mapping line must stay trailing and optional belongs where the line is written. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The uniform-insertion remark in trackOutputsIfEnabled described what the single insert call site already shows, and the log-replay note in mkLtarMetadata framed behavior as precedent-matching rather than stating it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Mathlib CI status (docs):
|
Collaborator
|
Reference manual CI status:
|
When a mapping entry's recorded outputs are all served from the local artifact cache, also attach the locally cached bundle so that a subsequent mapping-producing build rehashes it instead of repacking every module. On a 100-module project this turns the first `-o` build after a no-unpack restore from one leantar invocation per module (~2.5s) into none (no-op time). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`lake cache unstage` copies artifacts into the local cache unconditionally, and artifacts cached by builds are read-only, so unstaging over a cache that already holds them failed with a permission error. Skip files that already exist — content-addressed names guarantee identical contents — in both `stage` and `unstage`. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Assert that an unpacked archive's trace is stamped with the mapping's input hash and marked synthetic, and that a mapping entry with an ill-formed third element is reported and consumed via its bundle while an explicit null is treated as absent. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR makes Lake's module archives (
.ltar) content-stable and records each module's output content hashes in cache mappings alongside the archive reference. Previously, the trace file packed into an archive carried the build's input hash, input tree, and log (including absolute paths from theleaninvocation), so the same outputs produced different archive bytes for every input revision and machine.Byte-identical module outputs now produce a byte-identical archive independent of the inputs, checkout path, or machine that produced them, so input-only changes (e.g., comment edits in an imported module) upload no new archive bytes and identical outputs deduplicate across revisions on cache services.
Consumers verify unpacked archives against the hashes recorded in their cache mapping entry, recovering from mismatches by rebuilding, and restore outputs directly from the local artifact cache without fetching or unpacking the archive when they are already present.
The archive now packs a canonical input-free trace (
depHash := .nil, no inputs, empty log), swapped in only for the duration of the pack; consumers stamp the input hash from the cache mapping into the trace after unpacking (markedsynthetic), preserving the invariant that the on-disk trace carries the real input hash.Implementation details
Cache mapping lines gain an optional third element carrying the module's output descriptions (
[inputHash, "<hash>.ltar", {outputs}]), and locally cached output entries gain a correspondingoutputsfield. Only the bundle remains an upload target; the recorded hashes are descriptive metadata. The recorded outputs serve two purposes on the consumer side:--wfailrestores strict failure for CI.Compatibility
The change is compatible in both directions: older Lake versions ignore the extra mapping element and the
outputsfield, mappings and archives produced by older Lake versions are consumed as before (archives with an embedded input hash pass through unchanged, mapping entries without recorded outputs skip verification), and mixed-version cache sharing is unaffected since module input hashes already incorporate the toolchain.Validation
A/B results on Batteries (187 modules, 306 MB of artifacts), feature branch vs. a stage1 built from the merge-base — same compiler, only Lake differs:
Covered by the new
tests/lake/tests/ltarCachetest: mapping entry shape, archive byte-stability across an input-only edit, bundle-only staging, fresh consume with verification, rejection of and self-healing from a corrupted mapping entry, backward compatibility with two-element mappings, and the no-unpack restore from a warm artifact cache.Closes #13996
Closes #13997