Skip to content

Fix GPU predictor=2 decode on big-endian TIFFs (#1517)#1524

Merged
brendancol merged 3 commits intoxarray-contrib:mainfrom
brendancol:fix/1517-gpu-predictor2-be
May 8, 2026
Merged

Fix GPU predictor=2 decode on big-endian TIFFs (#1517)#1524
brendancol merged 3 commits intoxarray-contrib:mainfrom
brendancol:fix/1517-gpu-predictor2-be

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

The per-dtype GPU predictor=2 kernels (_predictor_decode_kernel_u16/u32/u64) view the raw decompressed byte buffer as native unsigned integers. On big-endian files the bytes are stored MSB-first, so the kernel reads the wrong integer interpretation and the prefix-sum differencing produces garbage. After PR #1515 stopped the BE byteswap from crashing, this remained as a silent correctness bug: the reproducer in #1517 showed ~93% pixel mismatch against the CPU baseline.

Fix

Option A from the issue: byteswap the decompressed buffer to native order BEFORE running the predictor=2 kernel, then skip the post-assembly byteswap. Predictor=1 and predictor=3 keep the existing post-decode swap (predictor=3 still drives byte-lane reordering through its own big_endian kernel flag).

A small helper, _swap_bytes_inplace, reverses bytes per sample on the flat uint8 buffer for both numpy and cupy arrays. The change is gated on byte_order == '>' and dtype.itemsize > 1, so the LE and uint8 paths are untouched.

Files

  • xrspatial/geotiff/_gpu_decode.py: pre-swap in _apply_predictor_and_assemble and gpu_decode_tiles; skip the post-swap when predictor == 2. Adds the _swap_bytes_inplace helper.
  • xrspatial/geotiff/tests/test_predictor2_big_endian_gpu_1517.py: regression tests covering the int32 tiled deflate reproducer, uint16/int16/uint32/int32 tiled, stripped-layout fallback, LE predictor=2, BE predictor=3, and two pure-numpy unit tests for the byte-swap helper.

Test plan

  • pytest xrspatial/geotiff/tests/test_predictor2_big_endian_gpu_1517.py (10 passed on a CUDA host)
  • pytest xrspatial/geotiff/tests/test_predictor2_big_endian.py (CPU regression, 6 passed)
  • pytest xrspatial/geotiff/tests/test_gpu_byteswap_1508.py xrspatial/geotiff/tests/test_predictor3_big_endian.py xrspatial/geotiff/tests/test_predictor_multisample.py (51 passed)
  • Full geotiff suite minus an unrelated pre-existing test_features deepcopy failure: 652 passed.

Closes #1517. Builds on #1515.

The per-dtype predictor=2 kernels view the raw decompressed byte buffer
as native uint16/uint32/uint64. On big-endian files the bytes are
MSB-first, so the prefix-sum runs on the wrong integer interpretation
and the differencing produces garbage (~93% pixel mismatch in the
reproducer from xarray-contrib#1517).

Swap the buffer to native byte order before running the predictor=2
kernel, then skip the post-assembly byteswap that previously tried to
fix things up. Predictor=1 and predictor=3 paths keep the existing
post-decode swap.

Closes xarray-contrib#1517. Builds on the byteswap helper added in xarray-contrib#1515.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 8, 2026
@brendancol brendancol requested a review from Copilot May 8, 2026 16:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a silent correctness bug in the GeoTIFF GPU decode pipeline for big-endian TIFFs with predictor=2, where the predictor kernels were interpreting decompressed bytes in native order and producing incorrect values. This aligns GPU results with the established CPU baseline behavior for these files and adds targeted regression coverage.

Changes:

  • Add _swap_bytes_inplace to byteswap the decompressed byte stream to native order before running predictor=2 kernels on big-endian, multi-byte dtypes.
  • Skip the post-assembly byteswap for big-endian predictor=2 since the buffer is now pre-normalized.
  • Add a GPU regression test suite for issue #1517 plus small unit tests for the new helper.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
xrspatial/geotiff/_gpu_decode.py Pre-byteswaps decompressed buffers for BE predictor=2 and conditionally skips the final output swap; adds _swap_bytes_inplace.
xrspatial/geotiff/tests/test_predictor2_big_endian_gpu_1517.py Adds regression tests for BE predictor=2 GPU correctness across dtypes/layouts plus helper unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xrspatial/geotiff/_gpu_decode.py Outdated
Comment on lines +94 to +101
if bps <= 1:
return
n = buf.size
if n % bps != 0:
raise ValueError(
f"buffer size {n} is not a multiple of bps={bps}")
view = buf.reshape(n // bps, bps)
view[:, :] = view[:, ::-1].copy()
Comment on lines +66 to +72
gpu_da = read_geotiff_gpu(str(path))
assert isinstance(gpu_da.data, cupy.ndarray), (
"expected cupy-backed DataArray; GPU path may have fallen back"
)
assert gpu_da.data.dtype == np.dtype(np.int32)
assert gpu_da.data.dtype.isnative
np.testing.assert_array_equal(gpu_da.data.get(), cpu)
Comment on lines +106 to +110
gpu_da = read_geotiff_gpu(str(path))
assert isinstance(gpu_da.data, cupy.ndarray)
assert gpu_da.data.dtype == np.dtype(dtype)
assert gpu_da.data.dtype.isnative
np.testing.assert_array_equal(gpu_da.data.get(), cpu)
brendancol added 2 commits May 8, 2026 10:11
- Rename _swap_bytes_inplace -> _swap_byte_lanes. The previous name
  implied a zero-copy operation but view[:, ::-1].copy() allocates a
  same-sized temporary. The docstring now states the temp explicitly so
  callers can size their working set, and a CUDA kernel is left as a
  follow-up if peak GPU memory becomes an issue.
- Add _block_cpu_fallback helper that monkeypatches the
  read_to_array binding inside xrspatial.geotiff. When the GPU tests
  call read_geotiff_gpu, any silent CPU fallback now raises an
  AssertionError instead of producing a passing-via-fallback result.
- Apply the guard to the four tests where the GPU decode path is what
  is being exercised (BE int32 reproducer, BE dtype matrix, LE pred2,
  BE pred3). The stripped-uint16 test still relies on the CPU
  fallback by design and stays unguarded.
Replaces the view[:, ::-1].copy() form with in-place byte reversal:

- numpy: dispatch to ndarray.byteswap(inplace=True) via a uint16/32/64
  view of the original buffer, so the original allocation is mutated
  with no temp.
- cupy: launch _byte_swap_lanes_kernel, one thread per sample, swapping
  bytes through register-resident temporaries. Peak GPU memory for the
  swap is now O(1) instead of O(buffer size).

Adds explicit validation: bps must be 2, 4, or 8 (3/5/6/7 are rejected
rather than silently corrupting), and size must be a multiple of bps.

Tests cover bps=2/4/8 on numpy and cupy, the bps=1 no-op, the
validation errors, and address-equality assertions confirming both
paths really are in-place. The existing integration tests
(test_gpu_predictor2_big_endian_*, predictor3_big_endian) still pass
unchanged.
@brendancol brendancol merged commit 44dfbbc into xarray-contrib:main May 8, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU predictor=2 produces wrong values on big-endian TIFFs

2 participants