Skip to content

geotiff: share VRT chunked-path logic with eager, parse XML once (#1825)#1840

Merged
brendancol merged 2 commits into
mainfrom
issue-1825
May 14, 2026
Merged

geotiff: share VRT chunked-path logic with eager, parse XML once (#1825)#1840
brendancol merged 2 commits into
mainfrom
issue-1825

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Consolidates four pieces of duplicated logic between the eager and chunked VRT read paths and removes the per-task XML parse the chunked path was paying.

  • _sentinel_for_dtype lifted out of the read_vrt closure (and merged with the module-level twin _vrt_sentinel_for_dtype from the chunked path) into a single helper in _vrt.py.
  • _effective_dtype_for_bands collapses the ComplexSource ScaleRatio / ScaleOffset dtype rule that both paths re-implemented inline.
  • _apply_integer_sentinel_mask replaces the per-band integer-sentinel NaN-masking branch both paths duplicated.
  • read_vrt in _vrt.py accepts a pre-parsed VRTDataset via a new parsed= kwarg. The chunked dispatcher parses the VRT once and threads the parsed object through every per-chunk task in the dask graph; an N-chunk read now pays one XML parse and one source-path allowlist validation, not N+1.

Public signatures of read_vrt, open_geotiff, etc. are unchanged. VRTDataset is a dataclass of plain dataclasses + dtypes and round-trips through stdlib pickle and cloudpickle, so no serialise-and-reparse fallback was needed.

Fixes #1825.

Test plan

  • pytest xrspatial/geotiff/tests/ -k vrt -- 375 passed (one pre-existing failure unrelated to VRT excluded: test_tile_size_positive_works)
  • pytest xrspatial/geotiff/tests/test_vrt_single_parse_1825.py -v -- new tests cover single XML parse, single file read, picklable parsed VRT, eager/chunked parity, and no per-block re-parse
  • pytest xrspatial/geotiff/tests/ -- the only failures are pre-existing (palette/plot and big-endian GPU predictor tests, all failing on main as well)
  • Manual cloudpickle round-trip of the chunked dask graph + .compute() returns the same array

Collapses four pieces of duplicate logic between the eager and chunked
VRT read paths:

- ``_sentinel_for_dtype`` lifted from a closure inside ``read_vrt``
  (and the module-level twin ``_vrt_sentinel_for_dtype`` in the chunked
  path) into a single helper in ``_vrt.py``.
- ``_effective_dtype_for_bands`` consolidates the ComplexSource
  scale/offset dtype rule that both paths re-implemented inline.
- ``_apply_integer_sentinel_mask`` replaces the per-band integer-sentinel
  masking branch that both paths duplicated.
- ``read_vrt`` in ``_vrt.py`` now accepts a pre-parsed ``VRTDataset`` via
  ``parsed=``; the chunked dispatcher parses the VRT once and threads it
  through every per-chunk task in the dask graph. An N-chunk read now
  pays one XML parse and one source-path allowlist validation, not N+1.

Pickling: ``VRTDataset`` is a dataclass of plain dataclasses + dtypes
and round-trips through stdlib ``pickle`` (and cloudpickle), so no
serialise-and-reparse fallback is needed.

Fixes #1825
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 14, 2026
@brendancol brendancol requested a review from Copilot May 14, 2026 16:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the GeoTIFF VRT reader to remove duplicated eager/chunked logic and to avoid re-parsing VRT XML for every dask chunk task by parsing once in the dispatcher and threading a pre-parsed VRTDataset through the task graph (fixes #1825).

Changes:

  • Centralizes shared VRT helpers in xrspatial/geotiff/_vrt.py (sentinel casting, effective dtype computation, integer-sentinel NaN masking).
  • Extends internal _vrt.read_vrt to accept a pre-parsed VRT (parsed=) and updates the chunked dispatcher/tasks to reuse it (single XML read/parse + single containment validation).
  • Adds targeted tests asserting single-parse behavior, picklability of the parsed VRT, and eager/chunked numerical parity.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
xrspatial/geotiff/_vrt.py Adds shared helper functions and a parsed= fast-path to avoid repeated XML parsing in chunked reads.
xrspatial/geotiff/__init__.py Updates eager and chunked VRT paths to use shared helpers and to pass a single parsed VRT into all per-chunk tasks.
xrspatial/geotiff/tests/test_vrt_single_parse_1825.py Adds regression tests ensuring chunked reads parse/read XML once, parsed VRT is picklable, and results match eager reads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xrspatial/geotiff/_vrt.py
Comment on lines +890 to +895
if parsed is not None:
vrt = parsed
else:
xml_str = _read_vrt_xml(vrt_path)
vrt_dir = os.path.dirname(os.path.abspath(vrt_path))
vrt = parse_vrt(xml_str, vrt_dir)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in c3b4316. read_vrt now does vrt = dataclasses.replace(parsed, holes=[]) when parsed= is supplied, so the per-call appends stay on a per-task copy and never touch the dispatcher's shared VRTDataset. Bands/sources/dtypes are still shared by reference (the parsed object is otherwise immutable in the hot path).

In practice the chunked dispatcher does its own static os.path.exists sweep for attrs['vrt_holes'] and never reads vrt.holes post-compute, so the only observable consequence today would be cross-task accumulation under the threaded scheduler. But the moment a caller reused the parsed object across reads, or a future change started surfacing parsed_vrt.holes, the leakage would be a real bug — so the defensive copy is worth it.

Added test_parsed_kwarg_does_not_mutate_caller_holes to pin the no-mutation contract.

Resolve __init__.py conflict between the chunked-dispatcher refactor in
this branch (#1825) and the chunked XML-cap routing that landed on main
(#1831): keep the four helper imports plus the security cap comment, and
merge the parsed-VRT plumbing note alongside it.

Also harden ``_vrt.read_vrt`` against caller-visible mutation of the
shared ``VRTDataset``: when ``parsed=`` is supplied, work on a
``dataclasses.replace(parsed, holes=[])`` so the per-call ``holes``
appends do not leak onto the dispatcher's object across per-chunk
tasks. Adds a regression test covering the no-mutation contract.
@brendancol brendancol merged commit daf1ecd into main May 14, 2026
10 of 11 checks passed
@brendancol brendancol deleted the issue-1825 branch May 15, 2026 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: refactor read_vrt chunked path to share logic with eager

2 participants