Skip to content

reproject.merge: same-CRS dask path materializes full source per output chunk #1571

@brendancol

Description

@brendancol

When merge() is called with same-CRS dask-backed rasters, each output
chunk triggers a full .compute() on the entire source dask array, not
just the source window the chunk needs.

_merge_block_adapter at xrspatial/reproject/__init__.py:1796-1806
carries the dask source array into the closure via functools.partial,
and the same_crs_list[i] branch calls src_data.compute() on the full
array before passing it to _place_same_crs.

I measured this with a 256x256 source split into 64x64 chunks and 32x32
output chunks:

  • total source pixels: 131,072 (2 sources x 256x256)
  • pixels materialized inside the chunk fn: 8,912,896
  • amplification: 68x

For an 8192x8192 source merged with 256x256 output chunks (1024 chunks),
the amplification is ~1024x and pushes driver-side data flow into
terabyte territory.

The fix is to slice the dask source to the chunk window before calling
.compute(), mirroring the pattern in _reproject_chunk_numpy
(line 273-276) and _reproject_chunk_cupy (line 425-428) which slice
first, then compute the window.

Reproducer:

import dask.array as da
import xarray as xr
import numpy as np
from xrspatial.reproject import merge

t1 = xr.DataArray(
    da.from_array(np.arange(256*256, dtype='f8').reshape(256, 256),
                  chunks=(64, 64)),
    dims=['y', 'x'],
    coords={'y': np.linspace(40, 35, 256),
            'x': np.linspace(-10, -5, 256)},
    attrs={'crs': 'EPSG:4326'},
)
t2 = xr.DataArray(
    da.from_array(np.ones((256, 256)) * 2, chunks=(64, 64)),
    dims=['y', 'x'],
    coords={'y': np.linspace(40, 35, 256),
            'x': np.linspace(-5, 0, 256)},
    attrs={'crs': 'EPSG:4326'},
)
merge([t1, t2], strategy='first', chunk_size=32).compute()
# Patch da.Array.compute to count -- 136 calls, 68x source size materialized

Surfaced by the 2026-05-10 reproject performance sweep.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePR touches performance-sensitive code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions