Add persistent program cache for Program.compile by cpcloud · Pull Request #1912 · NVIDIA/cuda-python

cpcloud · 2026-04-14T22:00:23Z

Summary

Closes #176.
Closes #177.
Closes #178.
Closes #179.

Adds a persistent on-disk cache for cuda.core.Program.compile outputs. The high-level integration is one keyword on Program.compile:

from cuda.core import Program, ProgramOptions
from cuda.core.utils import FileStreamProgramCache

source = 'extern "C" __global__ void k(int *a){ *a = 1; }'
options = ProgramOptions(arch="sm_80")

with FileStreamProgramCache() as cache:  # default: $XDG_CACHE_HOME/cuda-python/program-cache
    obj = Program(source, "c++", options=options).compile("cubin", cache=cache)
    obj.get_kernel("k")

A second invocation with the same inputs short-circuits the entire NVRTC compile — cache.get(key) (one stat + one read) and an ObjectCode._init from the bytes. No Program_compile is invoked. This is the fast path the cache exists to provide:

# Fresh process / second run -- same source, same options.
with FileStreamProgramCache() as cache:
    obj = Program(source, "c++", options=options).compile("cubin", cache=cache)
    # ~10us round-trip on a warm page cache, vs hundreds of ms to seconds
    # for an actual NVRTC invocation.

Public API

Program.compile(target_type, *, cache=...) — convenience wrapper. Derives the key, returns a fresh ObjectCode on hit, stores the compile output on miss.
cuda.core.utils.ProgramCacheResource — abstract bytes-in / bytes-out interface for custom backends. Provides get, update (Mapping or pairs), clear, and the mapping mutators (__getitem__/__setitem__/__delitem__/__len__). __contains__ is intentionally omitted: cache.get(key) is the recommended idiom because the two-call if key in cache: cache[key] pattern is racy across processes.
cuda.core.utils.InMemoryProgramCache — single-process LRU on OrderedDict, threading.RLock, size-only cap. For "compile once, look up many" workflows that don't need persistence.
cuda.core.utils.FileStreamProgramCache — directory of atomic per-entry files. Safe across processes via os.replace + Windows sharing-violation retries on os.replace / read / unlink.
cuda.core.utils.make_program_cache_key — escape hatch when the compile inputs require an extra_digest (include_path, pre_include, pch, use_pch, pch_dir, NVVM use_libdevice=True, NVRTC options.name with a directory component). Program.compile(cache=...) rejects those compiles with a ValueError pointing here.

On-disk format

Each entry is the raw compiled binary verbatim — cubin / PTX / LTO-IR — with no pickle, JSON, length prefix, or framing of any kind. Cache files are directly consumable by external NVIDIA tools (cuobjdump, nvdisasm, cuda-gdb).

ObjectCode.symbol_mapping from name_expressions is not preserved across a cache round-trip; the wrapper rejects Program.compile(name_expressions=..., cache=...) outright so the first-call-works/second-call-breaks footgun can't surface. Callers that need get_kernel(name_expression) should compile without cache=.

FileStreamProgramCache

Atomic writes: stage to tmp/, fsync, os.replace into entries/<2char>/<hash>. Concurrent readers never observe partial writes. Windows os.replace retries on ERROR_ACCESS_DENIED / ERROR_SHARING_VIOLATION / ERROR_LOCK_VIOLATION (winerrors 5/32/33) within a bounded backoff (~185 ms); after the budget, the write is dropped and the next call recompiles. The same retry covers reads and path.unlink so eviction doesn't crash the writer that triggered it on win-64.
Sharing-violation predicate: _is_windows_sharing_violation(exc) filters EACCES only when winerror is absent — non-sharing winerrors are real config errors and propagate. Off-Windows PermissionError always propagates.
Transparent input forms: cache[key] = value (and cache.update({key: value, ...})) accept raw bytes, bytearray, memoryview, or any ObjectCode (path-backed too — the file is read at write time so the cached entry is the binary content, not a path that could move). Reads return the same bytes that went in.
Size-only bound: max_size_bytes is the only knob — no element-count cap. None means unbounded.
True LRU via atime: every successful read calls os.utime (fd-based on Linux/macOS via os.supports_fd, path-based on Windows) to bump st_atime regardless of mount options or NtfsDisableLastAccessUpdate. Eviction sorts by oldest st_atime first. The atime touch is stat-guarded so a racing rewriter's freshly-replaced file never has its mtime rolled back.
Stat-guarded prunes: clear(), _enforce_size_cap(), and the atime touch all snapshot (ino, size, mtime_ns) per entry and refuse to unlink / overwrite stamps if a writer replaced the file mid-operation.
Cache key derivation (make_program_cache_key): a backend-strategy pattern with one class per code_type (_NvrtcBackend / _LinkerBackend / _NvvmBackend). Each owns its own validate / encode_code / option_fingerprint / encode_name_expressions / hash_version_probe / hash_extra_payload. The orchestrator validates code_type/target_type, dispatches to the right backend, and assembles the digest in fixed order. Adding a new backend is one new class, not a five-place edit.
NVRTC options.name with a directory component: rejected without extra_digest because NVRTC resolves quoted #include directives relative to that directory — neighbour-header changes wouldn't invalidate the cache otherwise.
PTX-loadability warning on cache hit: when the active driver can't load freshly-generated PTX, the wrapper emits the same RuntimeWarning the uncached path emits — loadability depends on the driver, not on whether the bytes were freshly compiled.
Default cache directory: when path is omitted, resolves via platformdirs.user_cache_path("cuda-python", appauthor=False, opinion=False) / "program-cache":
- Linux/BSD: \$XDG_CACHE_HOME/cuda-python/program-cache (default ~/.cache/cuda-python/program-cache)
- macOS: ~/Library/Caches/cuda-python/program-cache
- Windows: %LOCALAPPDATA%\\cuda-python\\program-cache
tmp/ self-heal: if something deletes tmp/ after the cache is opened, the next write recreates it rather than crashing with FileNotFoundError.
Crashed-writer cleanup: stale temp files older than 1 hour are swept on open and on size-cap enforcement.

Test plan

tests/test_program_cache.py — abstract-class contract, update accepts mapping or pairs, transparent input-form equivalence (bytes / bytearray / memoryview / bytes-backed ObjectCode / path-backed ObjectCode all round-trip to the same on-disk bytes), make_program_cache_key semantics (deterministic, supported-target matrix mirrors Program.compile, backend probe failures fail closed but stable, env-version changes don't perturb the key on the wrong backends, options-fingerprint canonicalization for the linker path, side-effect / external-content / NVRTC options.name-dir-component guards, schema version mixing), filestream CRUD, atomic-write race coverage, stat-guarded prune / atime-touch / clear / size-cap, atime LRU promotes recently-read, default-dir uses platformdirs, _is_windows_sharing_violation predicate's truth table including the regression case (non-sharing winerror plus EACCES propagates), tmp/ recreation after external wipe.
tests/test_program_cache_multiprocess.py — concurrent writers same key, distinct keys, reader-vs-writer torn-file safety, size-cap eviction race (rewriter vs. churner) under stat-guarded eviction.
tests/test_program_compile_cache.py — Program.compile(cache=...) miss/hit/error paths against a recording stub, name_expressions rejection, extra_digest-required / side-effect / NVRTC options.name-dir-component rejection, PTX loadability warning on cache hit (positive + negative), real-NVRTC end-to-end roundtrip across reopen.

github-actions · 2026-04-14T22:32:37Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1912/
https://nvidia.github.io/cuda-python/pr-preview/pr-1912/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1912/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1912/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

_LinkerBackend.validate, option_fingerprint, and hash_version_probe each re-probed _decide_nvjitlink_or_driver(), so a flapping probe could mint a key whose option fingerprint and version probe disagreed on which linker is in use. Cache the decision (and any probe exception) on a per-instance basis, instantiate _BACKENDS_BY_CODE_TYPE entries fresh per make_program_cache_key call so the cache lives exactly one call, and thread the decision into _linker_backend_and_version() instead of letting it probe a third time. Tests that monkeypatched _linker_backend_and_version now accept the extra use_driver argument (or *args/**kwargs in the failure-path test).

code_type was already normalised at Program init, but target_type was checked case-sensitively against {"ptx", "cubin", "ltoir"} in Program_compile (so compile(target_type="PTX") used to raise) and the cache key path inherited the same asymmetry from make_program_cache_key. Lowercase target_type at the top of Program.compile and at the entry to make_program_cache_key so callers who pass "PTX" get the same dispatch and the same cache key as "ptx".

…IELD_GATES The names tuple and the gates dict had to be kept in sync by hand: a field added to one but forgotten in the other would silently slip out of the PTX fingerprint. Drop the tuple and iterate the dict directly, so the dict is the single source of truth for which ProgramOptions fields perturb a PTX cache key.

_extract_bytes used to let a bare FileNotFoundError bubble up from Path(code).read_bytes(), so cache[key] = obj failures pointed only at the missing path with no hint that the cache was reading a path-backed ObjectCode. Wrap the FileNotFoundError with a message that names both the cache operation and the missing file so debugging the case stays self-explanatory.

The reader test only checked that every read returned non-None; a half-written file with non-empty bytes would pass. Carry the seeded payload into the worker, count exact-byte mismatches, and require zero. The eviction-race test wrote a final uncontested cache[key] = payload + b"final" after the churner exited, so the post-race endswith assertion would pass even with a broken stat-guard. Drop the final write and assert the entry survives carrying a rewriter payload prefix -- if the stat-guarded eviction path is broken, the in-race write is the one that vanishes.

… overwrite test InMemoryProgramCache claims thread-safety via the RLock that wraps every method but had no concurrent-thread coverage. Add a stress test with 4 writers + 4 readers x 200 ops against a size-capped cache that verifies no exceptions, no deadlocks (RLock reentrance through __setitem__ -> _evict_to_caps -> popitem), and that internal accounting (_total_bytes, len(_entries), len(cache)) stays consistent under contention. FileStreamProgramCache had no overwrite test analogous to test_inmemory_cache_overwrite_replaces_value_and_updates_size. Add one that writes a key twice and asserts the second value reads back, len stays at 1, and exactly one entry file lives on disk -- so a leaked entry from a botched os.replace would surface here.

max_size_bytes=0 used to slip past the >=0 guard but turned the cache into a black hole: every write was immediately evicted on its own size-cap pass. There is no legitimate use for that, so tighten the guard to >0 (or None for unbounded) and update both backends and the matching tests.

The (st_ino, st_size, st_mtime_ns) triple was open-coded in _touch_atime (fd-based and path-based fallbacks), _prune_if_stat_unchanged, and _enforce_size_cap. Centralise the fingerprint as _stat_key(st) so all four readers compare the same fields and the invariant has one place to read.

Replace, stat-and-read, and unlink each carried their own copy of the _REPLACE_RETRY_DELAYS / sleep / try-op / PermissionError loop. Centralise the loop as _with_sharing_retry(op, on_exhausted=...) and let each caller plug in its own success-on-success and exhausted-budget behaviour. Net behaviour is unchanged (exhaustion semantics for each public helper are preserved via the on_exhausted callback).

FileStreamProgramCache shards cache files into entries/<digest[:2]>/<digest[2:]>, so the overwrite test's iterdir filter on entries/ saw only the digest-prefix subdir (no is_file() match) and reported 0 entries. Switch to rglob so the assertion counts actual entry files. CI from the previous push caught this.

…IM117)

…iendly `{code!r}` doubles every backslash on Windows, so the error string holds `'C:\\Users\\...\\file'` while `str(src)` only has single backslashes. Naive `str(path) in str(exc)` checks (and the `test_filestream_cache_path_backed_object_code_missing_file_message` assertion) failed on win-64 as a result. Drop the `!r` so the path appears verbatim; the message still quotes it via the surrounding sentence punctuation.

Avoid the per-write directory walk in `_enforce_size_cap` by maintaining a running byte total. The tracker is seeded once from `_compute_total_size` at open time, updated on `__setitem__` (net delta from the old entry's size), `__delitem__` (subtract the unlinked file's size), and `clear` (re-derive from the post-clear state). When a write doesn't push the running total above `max_size_bytes`, eviction stays a no-op -- writes become O(1) instead of O(n) in the cache size. Cross-process drift (other writers/deleters working on the same root) self-corrects: any time `_enforce_size_cap` actually runs its scan, it reseeds `_tracked_size_bytes` from the observed disk total, so overestimates trigger one extra scan and then settle. Skipped entirely when `max_size_bytes is None` (no cap, no need for a tracker). Mutations are guarded by `_size_lock` so multi-threaded writers in the same process don't interleave the read-modify-write on the int. Also switch `_iter_entry_paths` from `Path.iterdir` to `os.scandir`: the dirent type cache lets `is_dir`/`is_file` answer without a separate `stat` syscall on filesystems that report it (ext4, NTFS, ...). Behaviourally equivalent (the cache layout never creates symlinks, and we pass `follow_symlinks=False` to match `pathlib`'s previous no-symlink-confusion default for our paths). Tests cover three guarantees: (1) writes that stay under the cap don't call `_enforce_size_cap`, (2) writes that cross the cap do call it, and (3) external deletion behind the cache's back is reconciled by the next eviction pass.

The test parsed `_program.pyx` to extract `SUPPORTED_TARGETS` and compare against `_SUPPORTED_TARGETS_BY_CODE_TYPE` in the cache. Source parsing for a duplication check is not worth the maintenance cost -- a reviewer eyeballing both definitions catches drift just as well, and the test was already broken once by upstream's StrEnum migration.

`os.scandir` returns an iterator that holds an OS directory handle (POSIX dirent fd, Windows FindFirstFileW); the context manager closes that handle deterministically. Using `with os.scandir(...) as outer` is the conventional spelling -- splitting the call from the `with` to land the FileNotFoundError catch on the call alone added one binding for no behavioural difference. Wrap each `with` in its own `try/except FileNotFoundError` instead. The inner loop is also a plain filter-and-yield over the DirEntry iterator, which is exactly what `yield from <generator-expr>` expresses.

The same scandir-then-stat-for-size loop appeared in three places (`_compute_total_size`, `_sweep_stale_tmp_files`, `_enforce_size_cap`). `_iter_tmp_entries` covers the scandir + is_file filter + context-managed cleanup that all three need. `_sum_tmp_sizes` covers the size accumulation that both `_compute_total_size` and `_enforce_size_cap` need. Net result: each callsite now reads as one obvious line instead of seven, and a future change to the temp-walk discipline (e.g. switching to a different scan strategy) lands in one place.

leofang

Re-review after the latest round of updates (33 commits). All 21 previous review comments have been addressed — nice work.

One new finding on the incremental size tracker (_tracked_size_bytes) that can affect correctness: a __delitem__ racing with _enforce_size_cap's re-seed can make the tracker go permanently negative, which disables all future eviction. Fix is a one-line max(0, ...) clamp. See inline comment for details.

Everything else looks solid. The per-instance _decide_driver() caching is correct, the key derivation is collision-free, the os.scandir migration is clean, and the test coverage has improved substantially.

leofang · 2026-05-06T20:04:23Z

+            with self._size_lock:
+                self._tracked_size_bytes -= size


Consideration (correctness): _tracked_size_bytes can go permanently negative, disabling all future eviction.

Scenario: thread A calls __delitem__ on key K and stats the file (gets size). Concurrently, thread B runs _enforce_size_cap, scans the directory (sees K, includes its size in total), finishes eviction, and re-seeds: self._tracked_size_bytes = total (which includes K's size). Then thread A's subtract runs here: self._tracked_size_bytes -= size. The tracker is now lower than reality by size. Repeat enough times across concurrent delete+eviction races, and the tracker goes negative.

Once negative, the check self._tracked_size_bytes > self._max_size_bytes never fires, so _enforce_size_cap is never called again. The cache grows without bound despite having a configured max_size_bytes. Unlike the "soft cap temporarily exceeded" behavior (which is fine), this is "eviction permanently disabled" — there is no self-healing path because the reconciliation point lives inside _enforce_size_cap, which is the function that never gets called.

Fix: clamp to zero so the tracker can't cross into the permanently-broken state:

with self._size_lock: self._tracked_size_bytes = max(0, self._tracked_size_bytes - size)

This doesn't make the tracker perfectly accurate (it may still undercount), but the next _enforce_size_cap call will re-scan and correct the drift. The key property is that undercount eventually self-corrects (the cap fires late), while negative never self-corrects.

Good catch. Fixed in c347f3c with the clamp you suggested.

Race walkthrough confirmed: if _enforce_size_cap's scan runs AFTER __delitem__'s unlink (so its reseed value excludes the deleted entry), the subtract here would undercount by size. The clamp leaves the tracker in the self-healing range — worst case is undercounting reality, which the next eviction's reseed corrects.

Added a regression test that simulates the post-reseed state directly and asserts the tracker stays at zero (test_filestream_cache_tracker_clamps_at_zero_under_delete_race).

`__delitem__` could walk `_tracked_size_bytes` negative under a race with `_enforce_size_cap`'s reseed: if the eviction scan runs AFTER this delete unlinks (so its reseed value excludes the deleted entry) but BEFORE this delete's subtract, the subtract undercounts by `size`. Repeated under contention, the tracker crosses zero -- and once negative, the `tracker > cap` check that gates eviction never fires again, so the cache grows without bound and there is no self-healing path (the only reseed point is the function that no longer runs). Clamp `tracker = max(0, tracker - size)` so the tracker can't enter the permanently-broken state. Worst case after the race is undercounting reality, which the next eviction's reseed corrects; that's the same self-healing path the existing tests already exercise. Reported by leofang in PR review.

cpcloud added this to the cuda.core v1.0.0 milestone Apr 14, 2026

cpcloud added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Apr 14, 2026

cpcloud self-assigned this Apr 14, 2026

cpcloud force-pushed the persistent-program-cache-178 branch from de57bd8 to ac38a68 Compare April 14, 2026 22:15

cpcloud force-pushed the persistent-program-cache-178 branch 23 times, most recently from f1ae40e to b27ed2c Compare April 19, 2026 13:28

cpcloud added 21 commits May 6, 2026 14:15

docs(core): clarify cache= vs ProgramOptions.no_cache independence

569e43a

refactor(core.utils): drop dead empty-key branch in _path_for_key

67306c6

perf(core.utils): drop redundant stat in FileStreamProgramCache.__len__

e472ff0

docs(core.utils): note ptxas_options order is intentionally significant

c9c7daf

docs(core.utils): document the _keys.py <-> _program.pyx import contract

af40133

docs(core.utils): note options.name double-hash on PTX is intentional

70989e3

docs(core.utils): document burst-write over-eviction trade-off

871771c

docs(core): explain why Program.compile cache decode is provably safe

3621362

fix(test): collapse nested with into a single contextmanager group (S…

69f6438

…IM117)

cpcloud force-pushed the persistent-program-cache-178 branch 2 times, most recently from 0364eeb to 08f9f56 Compare May 6, 2026 19:17

cpcloud force-pushed the persistent-program-cache-178 branch from 08f9f56 to 4aaca63 Compare May 6, 2026 19:19

cpcloud added 2 commits May 6, 2026 15:26

leofang reviewed May 6, 2026

View reviewed changes

cpcloud force-pushed the persistent-program-cache-178 branch from c347f3c to 19d7cf7 Compare May 6, 2026 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add persistent program cache for Program.compile#1912

Add persistent program cache for Program.compile#1912
cpcloud wants to merge 34 commits intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178

cpcloud commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 14, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

leofang left a comment

Uh oh!

leofang May 6, 2026

Uh oh!

cpcloud May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cpcloud commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Public API

On-disk format

FileStreamProgramCache

Test plan

Uh oh!

github-actions Bot commented Apr 14, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

leofang May 6, 2026

Choose a reason for hiding this comment

Uh oh!

cpcloud May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cpcloud commented Apr 14, 2026 •

edited

Loading