diff --git a/.claude/commands/backend-parity.md b/.claude/commands/backend-parity.md
new file mode 100644
index 00000000..f6fd804d
--- /dev/null
+++ b/.claude/commands/backend-parity.md
@@ -0,0 +1,159 @@
+# Backend Parity: Cross-Backend Consistency Audit
+
+Verify that all implemented backends produce consistent results for a given
+function or set of functions. The prompt is: $ARGUMENTS
+
+---
+
+## Step 1 -- Identify targets
+
+1. If $ARGUMENTS names specific functions (e.g. `slope`, `aspect`), use those.
+2. If $ARGUMENTS names a category (e.g. `hydrology`, `surface`, `focal`), read
+   `README.md` to find all functions in that category.
+3. If $ARGUMENTS is empty or says "all", scan the full feature matrix in `README.md`
+   and test every function that claims support for 2+ backends.
+4. For each function, read its source file and find the `ArrayTypeFunctionMapping`
+   call to determine which backends are actually implemented (not just what the
+   README claims).
+
+## Step 2 -- Build test inputs
+
+For each target function, create test rasters at three scales:
+
+| Name    | Size    | Purpose                                         |
+|---------|---------|--------------------------------------------------|
+| tiny    | 8x6     | Fast, easy to inspect cell-by-cell               |
+| medium  | 64x64   | Catches chunk-boundary artifacts in dask          |
+| large   | 256x256 | Stress test, exposes numerical accumulation drift |
+
+For each size, generate two variants:
+- **Clean:** no NaN, realistic value range for the function
+  (e.g. 0-5000m for elevation, 0-1 for NDVI inputs)
+- **Dirty:** 5-10% random NaN, some extreme values near dtype limits
+
+Use `np.random.default_rng(42)` for reproducibility. For functions that require
+specific input structure (e.g. `flow_direction` needs a DEM with drainage, not
+random noise), use the project's `perlin` module or a synthetic cone/valley.
+
+Also test with at least two dtypes: `float32` and `float64`.
+
+## Step 3 -- Run every backend
+
+For each function, input variant, and dtype:
+
+1. **NumPy:** `create_test_raster(data, backend='numpy')` -- always the baseline.
+2. **Dask+NumPy:** test with two chunk configurations:
+   - `chunks=(size//2, size//2)` -- even split
+   - `chunks=(size//3, size//3)` -- ragged remainder
+3. **CuPy:** `create_test_raster(data, backend='cupy')` -- skip if CUDA unavailable.
+4. **Dask+CuPy:** `create_test_raster(data, backend='dask+cupy')` -- skip if CUDA
+   unavailable.
+
+If the function has parameter variants (e.g. `boundary`, `method`), test the
+default parameters first. If $ARGUMENTS includes "thorough", also sweep all
+parameter combinations.
+
+## Step 4 -- Pairwise comparison
+
+For every non-NumPy result, compare against the NumPy baseline. Extract data using
+the project conventions:
+- Dask: `.data.compute()`
+- CuPy: `.data.get()`
+- Dask+CuPy: `.data.compute().get()`
+
+For each pair, compute and record:
+
+### 4a. Value agreement
+```python
+abs_diff = np.abs(result - baseline)
+max_abs = np.nanmax(abs_diff)
+rel_diff = abs_diff / (np.abs(baseline) + 1e-30)  # avoid div-by-zero
+max_rel = np.nanmax(rel_diff)
+mean_abs = np.nanmean(abs_diff)
+```
+
+### 4b. NaN mask agreement
+```python
+nan_match = np.array_equal(np.isnan(result), np.isnan(baseline))
+nan_only_in_result = np.sum(np.isnan(result) & ~np.isnan(baseline))
+nan_only_in_baseline = np.sum(np.isnan(baseline) & ~np.isnan(result))
+```
+
+### 4c. Metadata preservation
+Using `general_output_checks` from `general_checks.py`:
+- Output type matches input type (DataArray backed by the same array type)
+- Shape, dims, coords, attrs preserved
+
+### 4d. Pass/fail thresholds
+
+| Comparison            | rtol     | atol     |
+|-----------------------|----------|----------|
+| NumPy vs Dask+NumPy   | 1e-5     | 0        |
+| NumPy vs CuPy         | 1e-6     | 1e-6     |
+| NumPy vs Dask+CuPy    | 1e-6     | 1e-6     |
+
+A comparison **fails** if `max_abs > atol` AND `max_rel > rtol`, or if NaN masks
+disagree.
+
+## Step 5 -- Chunk boundary analysis
+
+Dask backends are the most likely source of parity issues due to `map_overlap`
+boundary handling. For any Dask comparison that fails or is borderline:
+
+1. Identify which cells diverge from the NumPy result.
+2. Map those cells to chunk boundaries (cells within `depth` pixels of a chunk edge).
+3. Report what percentage of divergent cells are at chunk boundaries vs interior.
+4. If all divergence is at boundaries, the issue is likely in the `map_overlap`
+   `depth` or `boundary` parameter. Say so explicitly.
+
+## Step 6 -- Generate the report
+
+```
+## Backend Parity Report
+
+### Functions tested
+| Function            | Backends implemented       | Source file              |
+|---------------------|---------------------------|--------------------------|
+| slope               | numpy, cupy, dask, dask+cupy | xrspatial/slope.py    |
+| ...                 | ...                        | ...                      |
+
+### Parity Matrix
+
+#### <function_name>
+| Comparison            | Input       | Dtype   | Max |Δ|   | Max |Δ/ref| | NaN match | Metadata | Status |
+|-----------------------|-------------|---------|----------|------------|-----------|----------|--------|
+| NumPy vs Dask+NumPy   | tiny clean  | float32 | ...      | ...        | yes       | ok       | PASS   |
+| NumPy vs Dask+NumPy   | medium dirty| float64 | ...      | ...        | yes       | ok       | PASS   |
+| NumPy vs CuPy         | tiny clean  | float32 | ...      | ...        | no (3)    | ok       | FAIL   |
+| ...                   | ...         | ...     | ...      | ...        | ...       | ...      | ...    |
+
+### Failures
+For each FAIL row:
+- Which cells diverged
+- Whether divergence correlates with chunk boundaries (Dask) or specific
+  input values (CuPy)
+- Likely root cause
+- Suggested fix
+
+### Summary
+- Functions tested: N
+- Total comparisons: N
+- Passed: N
+- Failed: N
+- Skipped (no CUDA): N
+```
+
+---
+
+## General rules
+
+- Do not modify any source or test files. This command is read-only.
+- Use `create_test_raster` from `general_checks.py` for all raster construction.
+- Any temporary files must include the function name for uniqueness.
+- If CUDA is unavailable, skip CuPy and Dask+CuPy gracefully. Report them
+  as SKIPPED, not FAIL.
+- If $ARGUMENTS includes "fix", still do not auto-fix. Report the issue and ask.
+- If a function is not in `ArrayTypeFunctionMapping` (e.g. it only has a numpy
+  path), note it as "single-backend only" and skip parity checks for it.
+- If $ARGUMENTS includes a specific tolerance (e.g. `rtol=1e-3`), override the
+  defaults in the threshold table.
diff --git a/.claude/commands/bench.md b/.claude/commands/bench.md
new file mode 100644
index 00000000..60c56d1f
--- /dev/null
+++ b/.claude/commands/bench.md
@@ -0,0 +1,127 @@
+# Bench: Local Performance Comparison
+
+Run ASV benchmarks for the current branch against master and report regressions
+and improvements. The prompt is: $ARGUMENTS
+
+---
+
+## Step 1 -- Identify what changed
+
+1. If $ARGUMENTS names specific benchmark classes or functions (e.g. `Slope`,
+   `flow_accumulation`), use those directly.
+2. If $ARGUMENTS is empty or says "auto", run `git diff origin/master --name-only`
+   to find changed source files under `xrspatial/`. Map each changed file to the
+   corresponding benchmark module in `benchmarks/benchmarks/`. Use the filename
+   and imports to match (e.g. changes to `slope.py` map to `benchmarks/benchmarks/slope.py`).
+3. If no benchmark exists for the changed code, note this in the report and
+   suggest whether one should be added.
+
+## Step 2 -- Check prerequisites
+
+1. Verify ASV is installed: `python -c "import asv"`. If missing, tell the user
+   to install it (`pip install asv`) and stop.
+2. Verify the benchmarks directory exists at `benchmarks/`.
+3. Read `benchmarks/asv.conf.json` to confirm the project name and branch settings.
+4. Check whether the ASV machine file exists (`.asv/machine.json`). If not, run
+   `cd benchmarks && asv machine --yes` to initialize it.
+
+## Step 3 -- Run the comparison
+
+Run ASV in continuous-comparison mode from the `benchmarks/` directory:
+
+```bash
+cd benchmarks && asv continuous origin/master HEAD -b "<regex>" -e
+```
+
+Where `<regex>` is a pattern matching the benchmark classes identified in Step 1
+(e.g. `Slope|Aspect` or `FlowAccumulation`). The `-e` flag shows stderr on failure.
+
+If $ARGUMENTS contains "quick", add `--quick` to run each benchmark only once
+(faster but noisier).
+
+If $ARGUMENTS contains "full", omit the `-b` filter to run all benchmarks.
+
+## Step 4 -- Parse and interpret results
+
+ASV continuous outputs lines like:
+```
+BENCHMARKS NOT SIGNIFICANTLY CHANGED.
+```
+or:
+```
+REGRESSION: benchmarks.slope.Slope.time_numpy  3.45ms -> 5.67ms  (1.64x)
+IMPROVED:   benchmarks.slope.Slope.time_dask   8.12ms -> 4.23ms  (0.52x)
+```
+
+Parse the output and classify each result:
+
+| Category     | Criteria                    |
+|--------------|-----------------------------|
+| REGRESSION   | Ratio > 1.2x (matches CI)   |
+| IMPROVED     | Ratio < 0.8x                |
+| UNCHANGED    | Between 0.8x and 1.2x       |
+
+## Step 5 -- Generate the report
+
+```
+## Benchmark Report: <branch> vs master
+
+### Changed files
+- <list of changed source files>
+
+### Benchmarks run
+- <list of benchmark classes/functions matched>
+
+### Results
+
+| Benchmark                          | master    | HEAD      | Ratio | Status     |
+|------------------------------------|-----------|-----------|-------|------------|
+| slope.Slope.time_numpy             | 3.45 ms   | 3.51 ms   | 1.02x | UNCHANGED  |
+| slope.Slope.time_dask_numpy        | 8.12 ms   | 4.23 ms   | 0.52x | IMPROVED   |
+| ...                                | ...       | ...       | ...   | ...        |
+
+### Regressions
+<details for each regression: which benchmark, how much slower, likely cause>
+
+### Improvements
+<details for each improvement>
+
+### Missing benchmarks
+<list any changed functions that have no benchmark coverage>
+
+### Recommendation
+- [ ] Safe to merge (no regressions)
+- [ ] Add "performance" label to PR (regressions found, CI will recheck)
+- [ ] Consider adding benchmarks for: <uncovered functions>
+```
+
+## Step 6 -- Suggest benchmark additions (if gaps found)
+
+If Step 1 found changed functions with no benchmark coverage:
+
+1. Read an existing benchmark file in `benchmarks/benchmarks/` that covers a
+   similar function (same category or same backend pattern).
+2. Describe what a new benchmark should test:
+   - Which function and parameter variants
+   - Suggested array sizes (match `common.py` conventions)
+   - Which backends to benchmark (numpy at minimum, dask if applicable)
+3. Ask the user whether they want you to write the benchmark file.
+
+Do NOT write benchmark files automatically. Report the gap and propose, then wait.
+
+---
+
+## General rules
+
+- Always run benchmarks from the `benchmarks/` directory, not the project root.
+- The regression threshold is 1.2x, matching `.github/workflows/benchmarks.yml`.
+  Do not change this unless $ARGUMENTS overrides it.
+- If ASV setup or machine detection fails, report the error clearly and suggest
+  the fix. Do not retry in a loop.
+- If benchmarks take longer than 5 minutes per class, note the elapsed time so
+  the user can plan accordingly.
+- Do not modify any source, test, or benchmark files. This command is read-only
+  analysis (unless the user explicitly asks for a benchmark to be written in
+  response to Step 6).
+- If $ARGUMENTS says "compare <branch1> <branch2>", run
+  `asv continuous <branch1> <branch2>` instead of the default origin/master vs HEAD.
diff --git a/.claude/commands/efficiency-audit.md b/.claude/commands/efficiency-audit.md
new file mode 100644
index 00000000..471e463a
--- /dev/null
+++ b/.claude/commands/efficiency-audit.md
@@ -0,0 +1,274 @@
+# Efficiency Audit: Compute Waste and Anti-Pattern Detection
+
+Analyze source code for performance anti-patterns specific to the NumPy / CuPy /
+Dask / Numba stack. The prompt is: $ARGUMENTS
+
+---
+
+## Step 0 -- Determine mode
+
+Check $ARGUMENTS for a mode keyword:
+
+- **`compare`**: Skip straight to Step 7 (post-fix comparison). Requires a saved
+  baseline file from a previous run.
+- **`no-bench`**: Run the static audit only (Steps 1-6), skip benchmarking entirely.
+- **Otherwise** (default): Run the full audit with baseline benchmarks.
+
+## Step 1 -- Scope the audit
+
+1. If $ARGUMENTS names specific files or functions, audit only those.
+2. If $ARGUMENTS names a category (e.g. `hydrology`, `surface`), identify all
+   source files in that category from the README feature matrix.
+3. If $ARGUMENTS is empty or says "all", audit every `.py` file under `xrspatial/`
+   (excluding `tests/`, `datasets/`, and `__pycache__/`).
+4. Read each file in scope.
+
+## Step 2 -- Static analysis: Dask anti-patterns
+
+Search for these patterns in each file. For every hit, record the file, line
+number, the offending code, and the severity (HIGH / MEDIUM / LOW).
+
+### 2a. Premature materialization (HIGH)
+- **`.values` on a Dask-backed DataArray or CuPy array:** forces a full compute
+  or GPU-to-CPU transfer. Search for `.values` usage outside of tests.
+- **`.compute()` inside a loop or repeated call:** materializes the full graph
+  each iteration instead of building a lazy pipeline.
+- **`np.array()` or `np.asarray()` wrapping a Dask or CuPy array:** silent
+  materialization.
+
+### 2b. Chunking issues (MEDIUM)
+- **`da.stack()` without a following `.rechunk()`:** creates size-1 chunks on the
+  new axis, causing extreme task-graph overhead.
+- **`map_overlap` with depth >= chunk_size / 2:** overlap regions dominate the
+  chunk, wasting memory and compute. Flag if depth is not obviously small relative
+  to expected chunk sizes.
+- **Missing `boundary` argument in `map_overlap`:** defaults may not match the
+  function's intended boundary handling.
+
+### 2c. Redundant computation (MEDIUM)
+- **Calling the same function twice on the same input** without caching the result
+  (e.g. computing slope inside aspect when aspect already computes slope internally).
+- **Building large intermediate arrays** that could be fused into the kernel
+  (e.g. allocating a full-size output array, then filling it cell by cell in Numba
+  instead of writing directly).
+
+## Step 3 -- Static analysis: GPU anti-patterns
+
+### 3a. Register pressure (HIGH)
+- **CUDA kernels with many float64 local variables:** count the number of named
+  float64 locals in each `@cuda.jit` kernel. Flag kernels with more than 20
+  float64 locals (likely to spill to slow local memory).
+- **Thread blocks larger than 16x16 on register-heavy kernels:** check the
+  `cuda_args()` call or any custom dims function. If the kernel has high register
+  count and uses 32x32 blocks, flag it.
+
+### 3b. Unnecessary transfers (HIGH)
+- **`.data.get()` followed by CuPy operations:** data round-trips GPU -> CPU -> GPU.
+- **`cupy.asarray(numpy_array)` inside a hot path:** repeated CPU -> GPU transfers
+  that could be hoisted outside the loop.
+- **Mixing NumPy and CuPy operations** in the same function without an obvious
+  reason (e.g. `np.where` on a CuPy array silently converts to NumPy).
+
+### 3c. Kernel launch overhead (LOW)
+- **Per-cell kernel launches:** launching a CUDA kernel inside a Python loop over
+  cells instead of processing the full grid in one kernel launch.
+- **Small array kernel launches:** calling a CUDA kernel on arrays smaller than
+  the thread block (overhead dominates).
+
+## Step 4 -- Static analysis: Numba anti-patterns
+
+### 4a. JIT compilation issues (MEDIUM)
+- **Missing `@ngjit` or `@jit(nopython=True)`:** pure-Python loops over arrays
+  without JIT compilation. Search for nested `for` loops operating on `.data`
+  arrays without a Numba decorator.
+- **Object-mode fallback:** `@jit` without `nopython=True` may silently fall back
+  to object mode. Only `@ngjit` or `@jit(nopython=True)` guarantees compilation.
+- **Type instability:** mixing int and float in Numba functions (e.g. initializing
+  with `0` then assigning a float) can cause unnecessary casts.
+
+### 4b. Memory layout (LOW)
+- **Column-major iteration on row-major arrays:** Numba loops that iterate
+  `for col ... for row` on C-contiguous arrays (cache-unfriendly access pattern).
+  The inner loop should iterate over the last axis (columns for row-major).
+
+## Step 5 -- Static analysis: General Python anti-patterns
+
+### 5a. Unnecessary copies (MEDIUM)
+- **`.copy()` on arrays that are never mutated:** wasted allocation.
+- **`np.zeros_like()` + fill loop:** when `np.empty()` + fill or direct
+  computation would avoid zero-initialization overhead.
+
+### 5b. Inefficient I/O patterns (LOW)
+- **Reading the same file multiple times** in a function.
+- **Writing intermediate results to disk** when they could stay in memory.
+
+## Step 6 -- Baseline benchmarks
+
+**Skip this step if mode is `no-bench` or `compare`.**
+
+For each public function in the audited scope, capture rough baseline timings.
+This does not use ASV; it runs quick inline timings so the user gets a
+before-snapshot without heavyweight setup.
+
+### 6a. Build a benchmark script
+
+Create a temporary script at `/tmp/efficiency_audit_bench_<scope_hash>.py` (use a
+short hash of the audited file list to keep the name unique). The script should:
+
+1. Import the public functions found in the audited files.
+2. Generate a test array using the same helper pattern as
+   `benchmarks/benchmarks/common.py`:
+   ```python
+   import numpy as np, xarray as xr
+   ny, nx = 512, 512  # moderate size -- fast but meaningful
+   x = np.linspace(-180, 180, nx)
+   y = np.linspace(-90, 90, ny)
+   x2, y2 = np.meshgrid(x, y)
+   z = 100.0 * np.exp(-x2**2 / 5e5 - y2**2 / 2e5)
+   z += np.random.default_rng(71942).normal(0, 2, (ny, nx))
+   raster = xr.DataArray(z, dims=['y', 'x'])
+   ```
+   Adjust as needed (e.g. add coords for geodesic functions, integer data for
+   zonal, etc.).
+3. For each function, time it with `timeit.repeat(number=1, repeat=3)` and take
+   the **median** of the repeats. One iteration is enough -- we want a rough
+   ballpark, not precise statistics.
+4. Print results as JSON to stdout:
+   ```json
+   {
+     "scope": ["slope.py", "aspect.py"],
+     "array_shape": [512, 512],
+     "backend": "numpy",
+     "timings": {
+       "slope": {"median_ms": 12.3, "runs": [12.1, 12.3, 13.0]},
+       "aspect": {"median_ms": 8.7, "runs": [8.5, 8.7, 9.1]}
+     }
+   }
+   ```
+
+### 6b. Run the benchmark script
+
+Execute the script and capture stdout. If a function errors (e.g. missing
+optional dependency), record `"error": "<message>"` instead of timings and
+continue with the rest.
+
+### 6c. Save the baseline
+
+Write the JSON output to `.efficiency-audit-baseline.json` in the project root.
+This file is gitignored-by-convention (do not add it to git). Tell the user the
+baseline has been saved and what it contains.
+
+If a baseline file already exists, back it up to
+`.efficiency-audit-baseline.prev.json` before overwriting.
+
+## Step 7 -- Generate the report
+
+```
+## Efficiency Audit Report
+
+### Scope
+- Files audited: N
+- Functions audited: N
+
+### Findings
+
+#### HIGH severity
+| # | File:Line          | Pattern                    | Description                           | Fix                              |
+|---|--------------------|---------------------------|---------------------------------------|----------------------------------|
+| 1 | slope.py:142       | Premature materialization  | `.values` on dask input in _run_dask  | Use `.data.compute()` instead    |
+| 2 | geodesic.py:87     | Register pressure          | 24 float64 locals in _gpu kernel      | Split kernel or use 16x16 blocks |
+| ...| ...               | ...                        | ...                                   | ...                              |
+
+#### MEDIUM severity
+| # | File:Line          | Pattern                    | Description                           | Fix                              |
+|---|--------------------|---------------------------|---------------------------------------|----------------------------------|
+| ...| ...               | ...                        | ...                                   | ...                              |
+
+#### LOW severity
+| # | File:Line          | Pattern                    | Description                           | Fix                              |
+|---|--------------------|---------------------------|---------------------------------------|----------------------------------|
+| ...| ...               | ...                        | ...                                   | ...                              |
+
+### Baseline Timings (512x512, numpy)
+| Function   | Median (ms) | Runs (ms)           |
+|------------|-------------|---------------------|
+| slope      | 12.3        | 12.1, 12.3, 13.0   |
+| aspect     | 8.7         | 8.5, 8.7, 9.1      |
+| ...        | ...         | ...                 |
+
+(If any function errored, show "ERROR: <reason>" in the Median column.)
+
+### Summary
+- HIGH: N findings
+- MEDIUM: N findings
+- LOW: N findings
+- Clean files (no issues): <list>
+
+### Recommendations
+<Prioritized list of the top 3-5 changes that would have the most impact,
+with estimated effort (one-liner / small PR / larger refactor)>
+```
+
+## Step 8 -- Post-fix comparison (mode=`compare`)
+
+**Only run this step when $ARGUMENTS contains `compare`.**
+
+1. Read `.efficiency-audit-baseline.json` from the project root. If it does not
+   exist, tell the user to run the audit without `compare` first to capture a
+   baseline, and stop.
+2. Regenerate the benchmark script from Step 6a using the `scope` and
+   `array_shape` recorded in the baseline file (so the comparison is apples to
+   apples).
+3. Run the benchmark script (Step 6b) and capture the new timings.
+4. For each function, compute the ratio: `new_median / old_median`.
+
+Generate a comparison report:
+
+```
+## Efficiency Audit: Post-Fix Comparison
+
+### Baseline
+- Captured: <baseline file mtime or "unknown">
+- Array shape: <from baseline>
+- Backend: <from baseline>
+
+### Results
+
+| Function   | Before (ms) | After (ms) | Ratio | Verdict      |
+|------------|-------------|------------|-------|--------------|
+| slope      | 12.3        | 7.1        | 0.58x | IMPROVED     |
+| aspect     | 8.7         | 8.5        | 0.98x | UNCHANGED    |
+| ...        | ...         | ...        | ...   | ...          |
+
+Thresholds: IMPROVED < 0.8x, REGRESSION > 1.2x, else UNCHANGED.
+
+### Net impact
+- Functions improved: N
+- Functions regressed: N
+- Functions unchanged: N
+- Overall: <one-line summary, e.g. "2 of 3 functions faster, no regressions">
+```
+
+5. Save the new timings to `.efficiency-audit-after.json` for reference.
+
+---
+
+## General rules
+
+- Do not modify source, test, or benchmark files. Temporary scripts go in `/tmp/`.
+- Only flag patterns that are actually present in the code. Do not report
+  hypothetical issues or patterns that "could" occur.
+- Include the exact file path and line number for every finding so the user
+  can navigate directly to the issue.
+- False positives are worse than missed issues. If you are not confident a
+  pattern is actually harmful in context (e.g. `.values` used intentionally
+  on a known-numpy array), do not flag it.
+- If $ARGUMENTS includes "fix", still do not auto-fix. Report and ask.
+- If $ARGUMENTS includes a severity filter (e.g. "high only"), only report
+  findings at that severity level.
+- If $ARGUMENTS includes "diff" or "changed", restrict the audit to files
+  changed on the current branch vs origin/master.
+- Baseline benchmark scripts are disposable. Clean up `/tmp/` scripts after
+  capturing results.
+- The 512x512 array size is a default. If $ARGUMENTS includes a size like
+  `1024x1024` or `small`, adjust accordingly. "small" = 128x128, "large" = 2048x2048.
diff --git a/.claude/commands/review-pr.md b/.claude/commands/review-pr.md
new file mode 100644
index 00000000..3cd092b6
--- /dev/null
+++ b/.claude/commands/review-pr.md
@@ -0,0 +1,186 @@
+# Review PR: Domain-Aware Pull Request Review
+
+Review a pull request with checks specific to a geospatial raster library built on
+NumPy, Dask, CuPy, and Numba. The prompt is: $ARGUMENTS
+
+---
+
+## Step 1 -- Load the PR
+
+1. If $ARGUMENTS contains a PR number (e.g. `123`), fetch it:
+   ```bash
+   gh pr view <number> --json title,body,files,commits,baseRefName,headRefName
+   ```
+2. If $ARGUMENTS is empty, check whether the current branch has an open PR:
+   ```bash
+   gh pr view --json title,body,files,commits,baseRefName,headRefName
+   ```
+3. If neither works, tell the user to provide a PR number and stop.
+4. Get the full diff:
+   ```bash
+   gh pr diff <number>
+   ```
+5. Read every changed file in full (not just the diff) to understand surrounding
+   context.
+
+## Step 2 -- Correctness review
+
+Check the changed code for numerical and algorithmic correctness:
+
+### 2a. Algorithm accuracy
+- Does the implementation match the cited algorithm or paper? If a paper or
+  standard is referenced (in comments, docstring, or PR body), verify the
+  formulas match.
+- Are there off-by-one errors in neighborhood indexing (common in 3x3 kernels)?
+- Is the output in the correct units and range? (e.g. slope in degrees 0-90,
+  aspect in degrees 0-360, NDVI in -1 to 1)
+
+### 2b. Floating point concerns
+- Are there divisions that could produce inf or NaN on valid input?
+- Is there catastrophic cancellation risk (subtracting nearly equal large numbers)?
+- Does the code handle the float32 vs float64 distinction correctly? (e.g. using
+  float64 intermediates for accumulation, returning the expected output dtype)
+
+### 2c. NaN handling
+- Does the function propagate NaN correctly for its semantics?
+- For neighborhood operations with `boundary='nan'`: do edge cells become NaN?
+- Are NaN checks using `np.isnan` (not `== np.nan`)?
+
+### 2d. Edge cases
+- Empty input, single-row, single-column, 1x1 rasters
+- All-NaN input
+- Constant-value input (derivative operations should return zero)
+- Very large or very small values
+
+## Step 3 -- Backend completeness review
+
+### 3a. Dispatch registration
+- Does the `ArrayTypeFunctionMapping` include all four backends?
+- If a backend is intentionally omitted, is there a comment explaining why?
+- Does the public function's docstring mention which backends are supported?
+
+### 3b. Dask correctness
+- Does `map_overlap` use the correct `depth` for the kernel size?
+  (depth should be `kernel_radius`, e.g. 1 for a 3x3 kernel)
+- Is the `boundary` parameter forwarded correctly from the public API to
+  `map_overlap`?
+- Does the chunk function return the same shape as its input?
+- For 3D stacked arrays: is `.rechunk({0: N})` called after `da.stack()`?
+
+### 3c. CuPy correctness
+- Does the CUDA kernel handle array bounds correctly (guard against
+  out-of-bounds thread indices)?
+- Is the thread block size appropriate for the kernel's register usage?
+- Are results extracted with `.data.get()`, not `.values`?
+
+## Step 4 -- Performance review
+
+### 4a. Anti-patterns
+Run the same checks as `/efficiency-audit` but scoped to only the changed files.
+Specifically check for:
+- Premature materialization (`.values`, `.compute()` in loops)
+- Unnecessary copies
+- GPU register pressure in new CUDA kernels
+- Missing `@ngjit` on CPU loops
+
+### 4b. Benchmark coverage
+- Does a benchmark exist in `benchmarks/benchmarks/` for the changed function?
+- If this PR adds a new function, does it also add a benchmark?
+- If the PR modifies performance-critical code, should the "performance" label
+  be added?
+
+## Step 5 -- Test coverage review
+
+### 5a. Test existence
+- Are there tests for the changed code?
+- Do tests cover all implemented backends (using the helpers from
+  `general_checks.py`)?
+
+### 5b. Test quality
+- Do tests compare against known reference values (QGIS, analytical, etc.),
+  not just "does it run without crashing"?
+- Are edge cases tested (NaN, constant surface, boundary modes)?
+- Do dask tests use multiple chunk sizes (including ragged chunks)?
+- Are temporary files uniquely named?
+
+### 5c. Missing tests
+- List any code paths or parameter combinations that have no test coverage.
+
+## Step 6 -- Documentation and API review
+
+### 6a. Docstrings
+- Does every new public function have a docstring with Parameters, Returns,
+  and a short description?
+- Are parameter types and defaults documented?
+
+### 6b. README feature matrix
+- If a new function was added, is it in the README feature matrix?
+- Are the backend checkmarks accurate?
+
+### 6c. API consistency
+- Does the function signature follow the project's conventions?
+  (e.g. `agg` for input DataArray, `name` for output name, `boundary` for
+  boundary mode)
+- Does it return an `xr.DataArray` with coords, dims, and attrs preserved?
+
+## Step 7 -- Generate the review
+
+Format the review as a structured comment suitable for posting on the PR.
+Organize findings by severity:
+
+```
+## PR Review: <title>
+
+### Blockers (must fix before merge)
+- [ ] <finding with file:line reference>
+
+### Suggestions (should fix, not blocking)
+- [ ] <finding with file:line reference>
+
+### Nits (optional improvements)
+- [ ] <finding with file:line reference>
+
+### What looks good
+- <positive observations, kept brief>
+
+### Checklist
+- [ ] Algorithm matches reference/paper
+- [ ] All implemented backends produce consistent results
+- [ ] NaN handling is correct
+- [ ] Edge cases are covered by tests
+- [ ] Dask chunk boundaries handled correctly
+- [ ] No premature materialization or unnecessary copies
+- [ ] Benchmark exists or is not needed
+- [ ] README feature matrix updated (if applicable)
+- [ ] Docstrings present and accurate
+```
+
+After generating the review, **run it through the `/humanizer` skill** before
+showing it to the user or posting it to GitHub.
+
+## Step 8 -- Post (if requested)
+
+If $ARGUMENTS includes "post" or "comment":
+1. Post the review as a PR comment using `gh pr comment <number> --body "..."`.
+2. Confirm the comment was posted successfully.
+
+If $ARGUMENTS does not include "post", show the review to the user and ask
+whether they want it posted.
+
+---
+
+## General rules
+
+- Do not approve or request changes on the PR via GitHub's review system. Only
+  post comments.
+- Read the full context of changed files, not just the diff. Many bugs are only
+  visible when you understand the surrounding code.
+- Be specific. Every finding must include a file path and line number. Vague
+  feedback ("consider improving performance") is not useful.
+- Do not suggest changes to code that was not modified in the PR unless the
+  existing code has a clear bug that the PR makes worse.
+- False positives erode trust. If you are uncertain whether something is a
+  problem, say so explicitly rather than presenting it as a definite issue.
+- Run `/humanizer` on the final review text before posting or displaying.
+- If $ARGUMENTS includes "quick", skip Steps 4 and 6 (performance and docs)
+  and focus only on correctness, backend parity, and test coverage.
diff --git a/.claude/commands/validate.md b/.claude/commands/validate.md
new file mode 100644
index 00000000..23b130ab
--- /dev/null
+++ b/.claude/commands/validate.md
@@ -0,0 +1,216 @@
+# Validate: Numerical Accuracy and Backend Parity Check
+
+Take a function name (or detect the changed function from the current branch diff)
+and verify its numerical accuracy against reference implementations and across all
+four backends. The prompt is: $ARGUMENTS
+
+---
+
+## Step 1 -- Identify the target
+
+1. If $ARGUMENTS names a specific function (e.g. `slope`, `flow_accumulation`),
+   use that.
+2. If $ARGUMENTS is empty or says "auto", run `git diff origin/master --name-only`
+   to find changed source files under `xrspatial/`. Identify which public functions
+   were added or modified. If multiple functions changed, validate each one.
+3. Read the function's source to understand:
+   - Which backends are implemented (check the `ArrayTypeFunctionMapping` call)
+   - What parameters it accepts (boundary modes, method variants, etc.)
+   - What the expected output range and dtype should be
+   - Whether it's a neighborhood operation (uses `map_overlap`) or a per-cell operation
+
+## Step 2 -- Select or build reference data
+
+Build **three** test datasets, each serving a different purpose:
+
+### 2a. Analytical known-answer dataset
+Create a small synthetic raster where the correct answer can be computed by hand
+or from a closed-form formula. Examples:
+
+- **Slope/aspect:** a perfect plane tilted at a known angle (e.g. `z = 2x + 3y`
+  gives slope = arctan(sqrt(13)) for planar method)
+- **Flow direction:** a simple cone or V-shaped valley where flow paths are obvious
+- **Focal:** a raster with a single non-zero cell surrounded by zeros
+- **Multispectral indices:** bands with known ratios so NDVI/NDWI etc. are trivially
+  verifiable
+
+Compute the expected result array by hand (or with basic numpy math) and store it
+as a numpy array. This is the **ground truth** for this dataset.
+
+### 2b. QGIS / rasterio / scipy reference dataset
+Check whether the function's existing test file already has a reference fixture
+(like `qgis_slope` in `test_slope.py`). If so, reuse it.
+
+If no reference exists, attempt to compute one:
+1. Check if `rasterio` is installed (`python -c "import rasterio"`). If available,
+   write the test raster to a temporary GeoTIFF (unique name including the function
+   name, e.g. `tmp_validate_slope.tif`) and run the equivalent rasterio/GDAL operation.
+2. If rasterio is not available, check for `scipy.ndimage` equivalents (e.g.
+   `generic_filter`, `uniform_filter`, `sobel`).
+3. If neither is available, skip this dataset and note it in the report.
+
+### 2c. Realistic stress dataset
+Generate a larger raster (at least 256x256) with terrain-like features using the
+project's `perlin` module or `np.random.default_rng(42)`. Include:
+- NaN patches (5-10% of cells) to test NaN propagation
+- A mix of flat and steep areas
+- Edge values near dtype limits for the tested dtypes
+
+This dataset is for backend parity and performance, not absolute accuracy.
+
+## Step 3 -- Run across all backends
+
+For each dataset and each parameter combination (e.g. boundary modes, method
+variants), run the function on every implemented backend:
+
+1. **NumPy** -- always available, treat as the baseline
+2. **Dask+NumPy** -- use `create_test_raster(data, backend='dask+numpy')` with
+   at least two different chunk sizes:
+   - Chunks that evenly divide the array
+   - Ragged chunks (array size not divisible by chunk size)
+3. **CuPy** -- skip with a note if CUDA is not available
+4. **Dask+CuPy** -- skip with a note if CUDA is not available
+
+Use the helpers from `general_checks.py`:
+- `create_test_raster()` to build DataArrays for each backend
+- For CuPy results, extract with `.data.get()`
+- For Dask results, extract with `.data.compute()`
+
+## Step 4 -- Compare results
+
+Run four categories of comparison, reporting pass/fail and numeric details for each:
+
+### 4a. Ground truth comparison (dataset 2a)
+Compare the NumPy backend result against the hand-computed expected array.
+```python
+np.testing.assert_allclose(result, expected, rtol=1e-6, atol=1e-10, equal_nan=True)
+```
+If this fails, the algorithm itself has a bug. Report the max absolute error,
+max relative error, and the cell location(s) where divergence is worst.
+
+### 4b. Reference implementation comparison (dataset 2b)
+Compare the NumPy result against the rasterio/scipy/QGIS reference.
+Use `rtol=1e-5` (matching the project's existing QGIS tolerance convention).
+Exclude edge cells if the implementations handle boundaries differently (document
+which edges were excluded and why).
+
+### 4c. Backend parity (all datasets)
+Compare every non-NumPy backend against the NumPy result:
+
+| Comparison            | Default tolerance          |
+|-----------------------|---------------------------|
+| NumPy vs Dask+NumPy   | `rtol=1e-5`               |
+| NumPy vs CuPy         | `atol=1e-6, rtol=1e-6`    |
+| NumPy vs Dask+CuPy    | `atol=1e-6, rtol=1e-6`    |
+
+For each comparison, report:
+- Max absolute difference
+- Max relative difference
+- Whether NaN locations match exactly (`np.isnan` masks must be identical)
+- Whether output shape, dims, coords, and attrs are preserved (use
+  `general_output_checks`)
+
+### 4d. Edge case and invariant checks
+Run these regardless of which function is being validated:
+
+- **NaN propagation:** cells neighboring NaN input should behave correctly for the
+  function (NaN output for most neighborhood ops with `boundary='nan'`)
+- **Constant surface:** if the input is uniform (e.g. all 42.0), the output should
+  be zero for derivative operations (slope, curvature) or uniform for pass-through
+  operations
+- **Single-cell raster:** 1x1 input should not crash (may return NaN)
+- **Dtype preservation:** run with float32 and float64 inputs; verify the output
+  dtype matches expectations
+- **Boundary modes:** if the function accepts a `boundary` parameter, test all
+  valid modes (`nan`, `nearest`, `reflect`, `wrap`) and verify:
+  - Shape is preserved
+  - Non-nan modes produce no NaN output when source has no NaN
+  - NumPy and Dask results agree for each mode
+
+## Step 5 -- Generate the report
+
+Print a structured report with these sections:
+
+```
+## Validation Report: <function_name>
+
+### Target
+- Function: <name>
+- Source: <file_path>
+- Backends implemented: <list>
+- Parameter variants tested: <list>
+
+### Datasets
+| Dataset          | Shape   | Dtype   | NaN% | Notes                    |
+|------------------|---------|---------|------|--------------------------|
+| Analytical       | ...     | ...     | ...  | <description>            |
+| Reference (src)  | ...     | ...     | ...  | <reference tool used>    |
+| Stress           | ...     | ...     | ...  | <generation method>      |
+
+### Results
+
+#### Ground Truth (analytical dataset)
+- Status: PASS / FAIL
+- Max absolute error: ...
+- Max relative error: ...
+- Worst cell: (row, col) expected=... got=...
+
+#### Reference Implementation
+- Reference: <rasterio / scipy / QGIS fixture / skipped>
+- Status: PASS / FAIL / SKIPPED
+- Max absolute error: ...
+- Notes: <edge exclusions, known differences>
+
+#### Backend Parity
+| Comparison              | Dataset     | Max |Δ|    | Max |Δ/ref| | NaN match | Status |
+|-------------------------|-------------|-----------|-------------|-----------|--------|
+| NumPy vs Dask+NumPy     | analytical  | ...       | ...         | yes/no    | ...    |
+| NumPy vs Dask+NumPy     | stress      | ...       | ...         | yes/no    | ...    |
+| NumPy vs CuPy           | analytical  | ...       | ...         | yes/no    | ...    |
+| ...                     | ...         | ...       | ...         | ...       | ...    |
+
+#### Edge Cases
+| Check              | Status | Notes                               |
+|--------------------|--------|-------------------------------------|
+| NaN propagation    | ...    |                                     |
+| Constant surface   | ...    |                                     |
+| Single-cell        | ...    |                                     |
+| Dtype float32      | ...    |                                     |
+| Dtype float64      | ...    |                                     |
+| Boundary modes     | ...    | <modes tested>                      |
+
+### Verdict
+- Overall: PASS / FAIL
+- <1-3 sentence summary of findings>
+- <action items if anything failed>
+```
+
+## Step 6 -- Suggest fixes (if failures found)
+
+If any check failed:
+1. Identify the root cause (algorithm bug, boundary handling, dtype casting,
+   chunking artifact, GPU precision, etc.)
+2. Describe the fix concisely.
+3. Ask the user whether they want you to apply the fix now.
+
+Do NOT apply fixes automatically. The purpose of `/validate` is to report, not to
+change code.
+
+---
+
+## General rules
+
+- Run all comparisons in a Python script or inline pytest, not by eyeballing
+  print output. Use `np.testing.assert_allclose` for numeric checks.
+- Any temporary files (GeoTIFFs, intermediate arrays) must use unique names
+  including the function name (e.g. `tmp_validate_slope_256x256.tif`). Clean them
+  up at the end.
+- If CUDA is not available, skip GPU backends gracefully and note it in the report.
+  Never fail the validation just because a backend is unavailable.
+- If $ARGUMENTS specifies a tolerance override (e.g. "validate slope rtol=1e-3"),
+  use the provided tolerances instead of the defaults.
+- If $ARGUMENTS specifies "quick", skip the stress dataset and boundary mode sweep
+  to give a faster result.
+- Do not modify any source or test files. This command is read-only analysis.
+- If the function has a `method` parameter (e.g. `slope(method='geodesic')`),
+  validate each method variant separately.