When merge() is called with same-CRS dask-backed rasters, each output
chunk triggers a full .compute() on the entire source dask array, not
just the source window the chunk needs.
_merge_block_adapter at xrspatial/reproject/__init__.py:1796-1806
carries the dask source array into the closure via functools.partial,
and the same_crs_list[i] branch calls src_data.compute() on the full
array before passing it to _place_same_crs.
I measured this with a 256x256 source split into 64x64 chunks and 32x32
output chunks:
- total source pixels: 131,072 (2 sources x 256x256)
- pixels materialized inside the chunk fn: 8,912,896
- amplification: 68x
For an 8192x8192 source merged with 256x256 output chunks (1024 chunks),
the amplification is ~1024x and pushes driver-side data flow into
terabyte territory.
The fix is to slice the dask source to the chunk window before calling
.compute(), mirroring the pattern in _reproject_chunk_numpy
(line 273-276) and _reproject_chunk_cupy (line 425-428) which slice
first, then compute the window.
Reproducer:
import dask.array as da
import xarray as xr
import numpy as np
from xrspatial.reproject import merge
t1 = xr.DataArray(
da.from_array(np.arange(256*256, dtype='f8').reshape(256, 256),
chunks=(64, 64)),
dims=['y', 'x'],
coords={'y': np.linspace(40, 35, 256),
'x': np.linspace(-10, -5, 256)},
attrs={'crs': 'EPSG:4326'},
)
t2 = xr.DataArray(
da.from_array(np.ones((256, 256)) * 2, chunks=(64, 64)),
dims=['y', 'x'],
coords={'y': np.linspace(40, 35, 256),
'x': np.linspace(-5, 0, 256)},
attrs={'crs': 'EPSG:4326'},
)
merge([t1, t2], strategy='first', chunk_size=32).compute()
# Patch da.Array.compute to count -- 136 calls, 68x source size materialized
Surfaced by the 2026-05-10 reproject performance sweep.
When
merge()is called with same-CRS dask-backed rasters, each outputchunk triggers a full
.compute()on the entire source dask array, notjust the source window the chunk needs.
_merge_block_adapteratxrspatial/reproject/__init__.py:1796-1806carries the dask source array into the closure via
functools.partial,and the
same_crs_list[i]branch callssrc_data.compute()on the fullarray before passing it to
_place_same_crs.I measured this with a 256x256 source split into 64x64 chunks and 32x32
output chunks:
For an 8192x8192 source merged with 256x256 output chunks (1024 chunks),
the amplification is ~1024x and pushes driver-side data flow into
terabyte territory.
The fix is to slice the dask source to the chunk window before calling
.compute(), mirroring the pattern in_reproject_chunk_numpy(line 273-276) and
_reproject_chunk_cupy(line 425-428) which slicefirst, then compute the window.
Reproducer:
Surfaced by the 2026-05-10 reproject performance sweep.