fix: download data atomically to avoid corrupt cache (#4097) by gaoflow · Pull Request #4142 · scverse/scanpy

gaoflow · 2026-06-01T22:17:38Z

Problem

_download (used by _check_datafile_present_and_download for every cached dataset) writes directly to the destination cache path:

with path.open("wb") as f:
    ...

As @flying-sheep diagnosed in #4097, this races under parallel execution (e.g. pytest-xdist workers sharing settings.datasetdir):

worker 1 doesn't see the cache file and starts downloading, creating path and writing into it;
worker 2 checks path.is_file(), sees the partially-written file, and reads it → corrupted file / error.

Fix

Download to a NamedTemporaryFile in the same directory and atomically Path.replace() it into place once the download is complete (the second option suggested in the issue). A rename on the same filesystem is atomic, so path only ever becomes visible fully written; concurrent downloads still work, the loser's temp file just replaces in turn.

On failure the temporary file is removed, and path itself is left untouched — previously the error path unconditionally unlink-ed path, which under the race could delete a sibling process's already-completed download.

Verification

Added test_download_atomic in tests/test_datasets.py: it mocks urlopen and asserts that the destination path does not exist while bytes are still being streamed, that the final content is correct, and that no temporary file is left behind. The test fails on the current code (destination appeared before the download finished) and passes with this change. I also verified that both an immediate failure (e.g. HTTPError) and a mid-stream failure propagate the exception and leave neither a destination nor a leftover temporary file.

codecov · 2026-06-01T22:24:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.64%. Comparing base (8f1d746) to head (94ae93c).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4142      +/-   ##
==========================================
+ Coverage   79.61%   79.64%   +0.03%     
==========================================
  Files         120      120              
  Lines       12786    12791       +5     
==========================================
+ Hits        10180    10188       +8     
+ Misses       2606     2603       -3

Flag	Coverage Δ
hatch-test.low-vers	`78.87% <100.00%> (+0.03%)`	⬆️
hatch-test.pre	`79.50% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/scanpy/readwrite.py	`82.43% <100.00%> (+1.15%)`	⬆️

... and 1 file with indirect coverage changes

Zethson · 2026-06-05T09:43:42Z

Thanks!

Two comments:

Could you please ensure that pre-commit, the docs build, and the CI passes? The RTD build failed with

/home/docs/checkouts/readthedocs.org/user_builds/icb-scanpy/checkouts/4142/docs/release-notes/1.13.0.dev76+g53dc24bf0.md:36: WARNING: py:mod reference target not found: pytest-xdist [ref.mod]

We're kind of moving to pooch instead of making custom requests. Would using pooch be simpler here and also solve this issue?

`_download` wrote directly to the destination cache path, so a concurrent reader (e.g. a parallel pytest worker sharing `settings.datasetdir`) could find the file present via `path.is_file()` and read it while it was still being written, getting a corrupted/partial file. Download to a temporary file in the same directory and atomically rename it into place once complete, so the destination only ever appears fully written. On failure the temporary file is removed and the destination is left untouched (it may belong to another process that finished first).

gaoflow · 2026-06-10T05:14:52Z

Thanks for the review!

Fixed the docs build — the pytest-xdist reference in the release note used a {mod} cross-reference role, but pytest-xdist isn't part of the inventory, so it raised the py:mod reference target not found warning. Changed it to a plain literal. pre-commit (ruff) is also green now; I'd missed a PYI034 on the test's __enter__.
On pooch: it would solve the race, since pooch.retrieve downloads to a temporary file and moves it into place atomically. But it's a bigger change than it looks here — _download is the generic backup-URL fetcher behind check_presence_download / _check_datafile_present_and_download and the EBI atlas loader, and those URLs have no known hashes, so a pooch migration means deciding on known_hash=None handling, the cache-dir/registry layout, and the progress-bar behavior across all of those call sites. I'd suggest landing this minimal atomic fix for race condition breaks parallel run of sc.datasets functions (e.g. in tests) #4097 now (it's self-contained and doesn't change the cache layout) and doing the pooch migration as its own PR — I'm happy to open that follow-up if you'd like to go that direction.

Disclosure: these changes were made by an AI coding agent (Claude) running under my account and direction.

for more information, see https://pre-commit.ci

Zethson · 2026-06-10T06:42:35Z

No pre-commit does not pass. Please use your brain in addition to an agent. I have infinite tokens and can run agents myself.
Please also ensure that there's no useless too detailed comments in the code.

gaoflow · 2026-06-10T07:13:13Z

pre-commit passes now — that was a ruff EM101 I fixed just after your comment landed. I've also cut the comments down: _download now has a single line above the temp-file write and one in the cleanup branch, and I trimmed the test docstrings/inline notes.

Zethson · 2026-06-22T14:39:47Z

https://github.com/scverse/scverse-misc recently gained some dataloader utilities and I think the right fix would be to move to scverse-misc and not patch what we have here further.
Do you want to have a stab at that?

gaoflow · 2026-06-22T17:01:25Z

Yes, I can take a stab at that. I did a quick pass over scverse_misc.datasets: it looks registry/loader-based and lives behind the datasets extra, so I do not want to blindly turn this already-small atomic write fix into a broader cache-layout/dependency migration. I will prototype the scverse-misc version separately and report back with either a replacement patch or the concrete blockers/tradeoffs.

gaoflow · 2026-06-22T17:06:34Z

Quick prototype result: scverse_misc.datasets can be wired into this path, but it is not a clean drop-in replacement for the current _download(url, path) helper.

The main issues I found:

The API is registry/loader-oriented, so preserving the existing exact target path needs a custom single-file loader around DatasetEntry/FileEntry/fetch. That works mechanically, but it is a little artificial for this generic helper.
With no hash, pooch/scverse-misc treats an existing target file as a cache hit. In a local HTTP probe, a second _download(url, path) kept the old bytes instead of re-downloading/replacing them. The current helper overwrites the target when called directly, and ebi_expression_atlas calls it directly for its design/archive files.
The current regression test proves the destination path is not visible while bytes are still streaming. After moving behind fetch, a small unit test mostly proves only that we pass the right DatasetEntry to scverse-misc; preserving the same race coverage would require a heavier pooch/downloader-level or end-to-end test.
Pulling in scverse-misc[datasets] adds the runtime datasets extra (pooch, pyyaml, etc.). In a fresh local uv test env, scanpy import also currently trips scverse-misc 0.1.x Settings deprecation warnings under pytest’s filterwarnings = error because Settings(..., exported_object_name=..., docstring_style=...) is still used. That is fixable, but separate from this dataset race.

Given that, I think this PR should either stay as the small atomic-write fix, or the scverse-misc migration should be a separate dataset-loading refactor that intentionally changes the registry/cache behavior and updates the dependency/settings pieces together.

gaoflow · 2026-06-22T17:15:06Z

Small metadata note: the remaining failing check is the milestone/label gate (milestone or no milestone). I tried to add the existing no milestone label myself, but GitHub says I do not have permission to label this PR.

flying-sheep · 2026-06-23T12:20:35Z

Thank you! I agree with both of you: we should migrate to scverse-misc but also this works and is clean, so why not keep this until we do so

…avoid corrupt cache (#4097)) (#4175) Co-authored-by: Vincent Gao <gaobing1230@gmail.com>

gaoflow added a commit to gaoflow/scanpy that referenced this pull request Jun 1, 2026

docs: add release note for scverse#4142

53dc24b

gaoflow added 4 commits June 10, 2026 07:13

docs: add release note for scverse#4142

02b7889

docs: avoid cross-ref role for pytest-xdist in release note

085cafd

test: return Self from FakeResponse.__enter__ (PYI034)

a367e0e

gaoflow force-pushed the fix-4097-atomic-download branch from 53dc24b to a367e0e Compare June 10, 2026 05:14

gaoflow and others added 2 commits June 10, 2026 07:33

test: cover the download error path that preserves an existing file

7a1c3ef

[pre-commit.ci] auto fixes from pre-commit.com hooks

9d0e4af

for more information, see https://pre-commit.ci

gaoflow added 2 commits June 10, 2026 08:55

test: assign exception message to variable (EM101)

cdc1336

refactor: trim verbose comments in download fix and tests

94ae93c

flying-sheep added this to the 1.12.2 milestone Jun 23, 2026

flying-sheep changed the title ~~Download dataset files atomically to avoid corrupt cache (#4097)~~ fix: download data atomically to avoid corrupt cache (#4097) Jun 23, 2026

flying-sheep merged commit a98d4ad into scverse:main Jun 23, 2026
16 of 18 checks passed

meeseeksmachine mentioned this pull request Jun 23, 2026

Backport PR #4142 on branch 1.12.x (fix: download data atomically to avoid corrupt cache (#4097)) #4175

Merged

flying-sheep pushed a commit that referenced this pull request Jun 23, 2026

Backport PR #4142 on branch 1.12.x (fix: download data atomically to …

288e3b9

…avoid corrupt cache (#4097)) (#4175) Co-authored-by: Vincent Gao <gaobing1230@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: download data atomically to avoid corrupt cache (#4097)#4142

fix: download data atomically to avoid corrupt cache (#4097)#4142
flying-sheep merged 8 commits into
scverse:mainfrom
gaoflow:fix-4097-atomic-download

gaoflow commented Jun 1, 2026

Uh oh!

codecov Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Zethson commented Jun 5, 2026 •

edited

Loading

Uh oh!

gaoflow commented Jun 10, 2026

Uh oh!

Zethson commented Jun 10, 2026

Uh oh!

gaoflow commented Jun 10, 2026

Uh oh!

Zethson commented Jun 22, 2026

Uh oh!

gaoflow commented Jun 22, 2026

Uh oh!

gaoflow commented Jun 22, 2026

Uh oh!

gaoflow commented Jun 22, 2026

Uh oh!

Uh oh!

flying-sheep commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gaoflow commented Jun 1, 2026

Problem

Fix

Verification

Uh oh!

codecov Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Zethson commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaoflow commented Jun 10, 2026

Uh oh!

Zethson commented Jun 10, 2026

Uh oh!

gaoflow commented Jun 10, 2026

Uh oh!

Zethson commented Jun 22, 2026

Uh oh!

gaoflow commented Jun 22, 2026

Uh oh!

gaoflow commented Jun 22, 2026

Uh oh!

gaoflow commented Jun 22, 2026

Uh oh!

Uh oh!

flying-sheep commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 1, 2026 •

edited

Loading

Zethson commented Jun 5, 2026 •

edited

Loading