Skip to content

read_geotiff_gpu: warn on CPU fallback, add gpu='strict' (#1516)#1523

Merged
brendancol merged 3 commits intoxarray-contrib:mainfrom
brendancol:fix/1516-gpu-strict-fallback
May 8, 2026
Merged

read_geotiff_gpu: warn on CPU fallback, add gpu='strict' (#1516)#1523
brendancol merged 3 commits intoxarray-contrib:mainfrom
brendancol:fix/1516-gpu-strict-fallback

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

read_geotiff_gpu wrapped the GPU decode call in try/except Exception: pass, so any failure on the GPU path was silently absorbed and execution fell through to the CPU path. The user-visible result stayed correct, which is why issue #1508 (an AttributeError from cupy.ndarray.byteswap going away) was invisible until tracked down by hand.

This PR keeps the fallback but makes failures observable.

Changes

  • New gpu keyword on read_geotiff_gpu:
    • gpu='auto' (default): existing fallback behaviour, plus a RuntimeWarning that names the original exception type and message. stacklevel=2 so the warning points at the caller.
    • gpu='strict': re-raise the original exception instead of falling back. Useful for tests and CI that exercise the GPU fast path.
    • Unknown values raise ValueError early.
  • The except clause still catches Exception rather than a narrower set. CUDA failures arrive as many distinct types (numba.cuda.cudadrv.driver.CudaAPIError, cupy.cuda.runtime.CUDARuntimeError, plain RuntimeError, ...). BaseException is intentionally not caught so KeyboardInterrupt and SystemExit propagate.
  • Docstring updated to describe both modes.

Tests

xrspatial/geotiff/tests/test_gpu_strict_fallback_1516.py monkeypatches gpu_decode_tiles_from_file to raise a synthetic RuntimeError and asserts:

  • gpu='auto' emits a RuntimeWarning and returns the CPU result.
  • gpu='strict' re-raises.
  • gpu='loose' (or any other value) raises ValueError.

Tests do not require a real GPU. A numpy-backed cupy stub is installed when the real package is unavailable, so they run on CPU-only CI.

Test plan

  • pytest xrspatial/geotiff/tests/test_gpu_strict_fallback_1516.py -x -q -> 3 passed
  • pytest xrspatial/geotiff/tests/test_gpu_byteswap_1508.py -q -> 7 passed (no regression on adjacent tests)
  • Full xrspatial/geotiff/tests/ run shows no new failures (3 pre-existing palette-plot failures are unrelated)

Closes #1516

…rib#1516)

The GPU decode path was wrapped in a too-broad ``try/except Exception:
pass`` that silently fell back to CPU on any error. Real GPU bugs (xarray-contrib#1508
was an AttributeError) lived undetected because the user-visible result
stayed correct.

Behaviour now:

- Default ``gpu='auto'`` keeps the CPU fallback but emits a
  ``RuntimeWarning`` that includes the original exception type and
  message. Stacklevel=2 so the warning points at the caller.
- New ``gpu='strict'`` re-raises the original exception instead of
  falling back. Useful for tests and CI that exercise the GPU fast
  path.
- Unknown ``gpu=`` values raise ``ValueError`` early.

Catch is still ``Exception`` rather than a narrower set: CUDA failures
arrive as many distinct types (``cuda.RuntimeError``,
``CudaAPIError``, ``CUDARuntimeError``, plain ``RuntimeError``, ...).
``BaseException`` is intentionally not caught so KeyboardInterrupt and
SystemExit propagate.

Test ``xrspatial/geotiff/tests/test_gpu_strict_fallback_1516.py``
monkeypatches ``gpu_decode_tiles_from_file`` to raise a synthetic
``RuntimeError`` and asserts the warn-and-fallback and strict-reraise
paths. Tests do not need a real GPU; a ``cupy`` shim is installed when
the real package is unavailable.

Closes xarray-contrib#1516
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 8, 2026
@brendancol brendancol requested a review from Copilot May 8, 2026 16:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes failures in read_geotiff_gpu’s GPU fast path observable instead of being silently swallowed, while also adding a “strict” mode to force GPU errors to surface (useful for CI/tests that want to exercise the GPU pipeline).

Changes:

  • Add a gpu={'auto','strict'} keyword to read_geotiff_gpu to control CPU fallback vs re-raising GPU exceptions.
  • Emit a RuntimeWarning (with caller-facing stacklevel=2) when the GPU decode fails in gpu='auto' mode.
  • Add regression tests that monkeypatch the GPU decode to raise and assert warning/raise/ValueError behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
xrspatial/geotiff/__init__.py Adds gpu mode selection, warning on fallback, and validation of gpu kwarg in read_geotiff_gpu.
xrspatial/geotiff/tests/test_gpu_strict_fallback_1516.py Adds tests covering gpu='auto' warning+fallback, gpu='strict' re-raise, and invalid kwarg rejection (with a CuPy stub for CPU-only runs).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1557 to +1566
except Exception as e:
if gpu == 'strict':
raise
warnings.warn(
f"read_geotiff_gpu: GPU decode failed "
f"({type(e).__name__}: {e}); falling back to CPU.",
RuntimeWarning,
stacklevel=2,
)
arr_gpu = None
Comment thread xrspatial/geotiff/__init__.py Outdated
Comment thread xrspatial/geotiff/tests/test_gpu_strict_fallback_1516.py Outdated
brendancol added 2 commits May 8, 2026 10:04
- Plumb gpu='strict'/'auto' through the second GPU decode stage too.
  Previously only gpu_decode_tiles_from_file failures honored the mode;
  the gpu_decode_tiles call after CPU-mmap tile extraction silently
  fell back regardless. Strict now raises and auto warns from either
  stage.
- Update the gpu kwarg docstring to describe the actual two-stage
  fallback (GDS GPU decode -> CPU-mmap + GPU decode -> full CPU + transfer).
- Strengthen _ensure_cupy_stub: also stub when cupy is installed but
  CUDA is not available, and restore the original module after each test.
- Add tests covering the second-stage strict re-raise and second-stage
  warning behaviour.
- Document that stripped layouts and sparse-tile files route directly
  to the CPU reader before any GPU decode stage runs, so the gpu kwarg
  does not affect them. cupy.asarray upload failures in those paths
  propagate unchanged.
- Tighten test_default_mode_warns_on_second_stage_failure to assert
  exactly two RuntimeWarning records (one per stage). The previous
  pytest.warns context only verified that >=1 matching warning fired,
  so a regression where one of the two handlers stopped warning would
  not have been caught.
@brendancol brendancol merged commit 6147d0a into xarray-contrib:main May 8, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

read_geotiff_gpu: silent CPU fallback hides GPU bugs

2 participants