diff --git a/.claude/commands/backend-parity.md b/.claude/commands/backend-parity.md new file mode 100644 index 00000000..f6fd804d --- /dev/null +++ b/.claude/commands/backend-parity.md @@ -0,0 +1,159 @@ +# Backend Parity: Cross-Backend Consistency Audit + +Verify that all implemented backends produce consistent results for a given +function or set of functions. The prompt is: $ARGUMENTS + +--- + +## Step 1 -- Identify targets + +1. If $ARGUMENTS names specific functions (e.g. `slope`, `aspect`), use those. +2. If $ARGUMENTS names a category (e.g. `hydrology`, `surface`, `focal`), read + `README.md` to find all functions in that category. +3. If $ARGUMENTS is empty or says "all", scan the full feature matrix in `README.md` + and test every function that claims support for 2+ backends. +4. For each function, read its source file and find the `ArrayTypeFunctionMapping` + call to determine which backends are actually implemented (not just what the + README claims). + +## Step 2 -- Build test inputs + +For each target function, create test rasters at three scales: + +| Name | Size | Purpose | +|---------|---------|--------------------------------------------------| +| tiny | 8x6 | Fast, easy to inspect cell-by-cell | +| medium | 64x64 | Catches chunk-boundary artifacts in dask | +| large | 256x256 | Stress test, exposes numerical accumulation drift | + +For each size, generate two variants: +- **Clean:** no NaN, realistic value range for the function + (e.g. 0-5000m for elevation, 0-1 for NDVI inputs) +- **Dirty:** 5-10% random NaN, some extreme values near dtype limits + +Use `np.random.default_rng(42)` for reproducibility. For functions that require +specific input structure (e.g. `flow_direction` needs a DEM with drainage, not +random noise), use the project's `perlin` module or a synthetic cone/valley. + +Also test with at least two dtypes: `float32` and `float64`. + +## Step 3 -- Run every backend + +For each function, input variant, and dtype: + +1. **NumPy:** `create_test_raster(data, backend='numpy')` -- always the baseline. +2. **Dask+NumPy:** test with two chunk configurations: + - `chunks=(size//2, size//2)` -- even split + - `chunks=(size//3, size//3)` -- ragged remainder +3. **CuPy:** `create_test_raster(data, backend='cupy')` -- skip if CUDA unavailable. +4. **Dask+CuPy:** `create_test_raster(data, backend='dask+cupy')` -- skip if CUDA + unavailable. + +If the function has parameter variants (e.g. `boundary`, `method`), test the +default parameters first. If $ARGUMENTS includes "thorough", also sweep all +parameter combinations. + +## Step 4 -- Pairwise comparison + +For every non-NumPy result, compare against the NumPy baseline. Extract data using +the project conventions: +- Dask: `.data.compute()` +- CuPy: `.data.get()` +- Dask+CuPy: `.data.compute().get()` + +For each pair, compute and record: + +### 4a. Value agreement +```python +abs_diff = np.abs(result - baseline) +max_abs = np.nanmax(abs_diff) +rel_diff = abs_diff / (np.abs(baseline) + 1e-30) # avoid div-by-zero +max_rel = np.nanmax(rel_diff) +mean_abs = np.nanmean(abs_diff) +``` + +### 4b. NaN mask agreement +```python +nan_match = np.array_equal(np.isnan(result), np.isnan(baseline)) +nan_only_in_result = np.sum(np.isnan(result) & ~np.isnan(baseline)) +nan_only_in_baseline = np.sum(np.isnan(baseline) & ~np.isnan(result)) +``` + +### 4c. Metadata preservation +Using `general_output_checks` from `general_checks.py`: +- Output type matches input type (DataArray backed by the same array type) +- Shape, dims, coords, attrs preserved + +### 4d. Pass/fail thresholds + +| Comparison | rtol | atol | +|-----------------------|----------|----------| +| NumPy vs Dask+NumPy | 1e-5 | 0 | +| NumPy vs CuPy | 1e-6 | 1e-6 | +| NumPy vs Dask+CuPy | 1e-6 | 1e-6 | + +A comparison **fails** if `max_abs > atol` AND `max_rel > rtol`, or if NaN masks +disagree. + +## Step 5 -- Chunk boundary analysis + +Dask backends are the most likely source of parity issues due to `map_overlap` +boundary handling. For any Dask comparison that fails or is borderline: + +1. Identify which cells diverge from the NumPy result. +2. Map those cells to chunk boundaries (cells within `depth` pixels of a chunk edge). +3. Report what percentage of divergent cells are at chunk boundaries vs interior. +4. If all divergence is at boundaries, the issue is likely in the `map_overlap` + `depth` or `boundary` parameter. Say so explicitly. + +## Step 6 -- Generate the report + +``` +## Backend Parity Report + +### Functions tested +| Function | Backends implemented | Source file | +|---------------------|---------------------------|--------------------------| +| slope | numpy, cupy, dask, dask+cupy | xrspatial/slope.py | +| ... | ... | ... | + +### Parity Matrix + +#### +| Comparison | Input | Dtype | Max |Δ| | Max |Δ/ref| | NaN match | Metadata | Status | +|-----------------------|-------------|---------|----------|------------|-----------|----------|--------| +| NumPy vs Dask+NumPy | tiny clean | float32 | ... | ... | yes | ok | PASS | +| NumPy vs Dask+NumPy | medium dirty| float64 | ... | ... | yes | ok | PASS | +| NumPy vs CuPy | tiny clean | float32 | ... | ... | no (3) | ok | FAIL | +| ... | ... | ... | ... | ... | ... | ... | ... | + +### Failures +For each FAIL row: +- Which cells diverged +- Whether divergence correlates with chunk boundaries (Dask) or specific + input values (CuPy) +- Likely root cause +- Suggested fix + +### Summary +- Functions tested: N +- Total comparisons: N +- Passed: N +- Failed: N +- Skipped (no CUDA): N +``` + +--- + +## General rules + +- Do not modify any source or test files. This command is read-only. +- Use `create_test_raster` from `general_checks.py` for all raster construction. +- Any temporary files must include the function name for uniqueness. +- If CUDA is unavailable, skip CuPy and Dask+CuPy gracefully. Report them + as SKIPPED, not FAIL. +- If $ARGUMENTS includes "fix", still do not auto-fix. Report the issue and ask. +- If a function is not in `ArrayTypeFunctionMapping` (e.g. it only has a numpy + path), note it as "single-backend only" and skip parity checks for it. +- If $ARGUMENTS includes a specific tolerance (e.g. `rtol=1e-3`), override the + defaults in the threshold table. diff --git a/.claude/commands/bench.md b/.claude/commands/bench.md new file mode 100644 index 00000000..60c56d1f --- /dev/null +++ b/.claude/commands/bench.md @@ -0,0 +1,127 @@ +# Bench: Local Performance Comparison + +Run ASV benchmarks for the current branch against master and report regressions +and improvements. The prompt is: $ARGUMENTS + +--- + +## Step 1 -- Identify what changed + +1. If $ARGUMENTS names specific benchmark classes or functions (e.g. `Slope`, + `flow_accumulation`), use those directly. +2. If $ARGUMENTS is empty or says "auto", run `git diff origin/master --name-only` + to find changed source files under `xrspatial/`. Map each changed file to the + corresponding benchmark module in `benchmarks/benchmarks/`. Use the filename + and imports to match (e.g. changes to `slope.py` map to `benchmarks/benchmarks/slope.py`). +3. If no benchmark exists for the changed code, note this in the report and + suggest whether one should be added. + +## Step 2 -- Check prerequisites + +1. Verify ASV is installed: `python -c "import asv"`. If missing, tell the user + to install it (`pip install asv`) and stop. +2. Verify the benchmarks directory exists at `benchmarks/`. +3. Read `benchmarks/asv.conf.json` to confirm the project name and branch settings. +4. Check whether the ASV machine file exists (`.asv/machine.json`). If not, run + `cd benchmarks && asv machine --yes` to initialize it. + +## Step 3 -- Run the comparison + +Run ASV in continuous-comparison mode from the `benchmarks/` directory: + +```bash +cd benchmarks && asv continuous origin/master HEAD -b "" -e +``` + +Where `` is a pattern matching the benchmark classes identified in Step 1 +(e.g. `Slope|Aspect` or `FlowAccumulation`). The `-e` flag shows stderr on failure. + +If $ARGUMENTS contains "quick", add `--quick` to run each benchmark only once +(faster but noisier). + +If $ARGUMENTS contains "full", omit the `-b` filter to run all benchmarks. + +## Step 4 -- Parse and interpret results + +ASV continuous outputs lines like: +``` +BENCHMARKS NOT SIGNIFICANTLY CHANGED. +``` +or: +``` +REGRESSION: benchmarks.slope.Slope.time_numpy 3.45ms -> 5.67ms (1.64x) +IMPROVED: benchmarks.slope.Slope.time_dask 8.12ms -> 4.23ms (0.52x) +``` + +Parse the output and classify each result: + +| Category | Criteria | +|--------------|-----------------------------| +| REGRESSION | Ratio > 1.2x (matches CI) | +| IMPROVED | Ratio < 0.8x | +| UNCHANGED | Between 0.8x and 1.2x | + +## Step 5 -- Generate the report + +``` +## Benchmark Report: vs master + +### Changed files +- + +### Benchmarks run +- + +### Results + +| Benchmark | master | HEAD | Ratio | Status | +|------------------------------------|-----------|-----------|-------|------------| +| slope.Slope.time_numpy | 3.45 ms | 3.51 ms | 1.02x | UNCHANGED | +| slope.Slope.time_dask_numpy | 8.12 ms | 4.23 ms | 0.52x | IMPROVED | +| ... | ... | ... | ... | ... | + +### Regressions +
+ +### Improvements +
+ +### Missing benchmarks + + +### Recommendation +- [ ] Safe to merge (no regressions) +- [ ] Add "performance" label to PR (regressions found, CI will recheck) +- [ ] Consider adding benchmarks for: +``` + +## Step 6 -- Suggest benchmark additions (if gaps found) + +If Step 1 found changed functions with no benchmark coverage: + +1. Read an existing benchmark file in `benchmarks/benchmarks/` that covers a + similar function (same category or same backend pattern). +2. Describe what a new benchmark should test: + - Which function and parameter variants + - Suggested array sizes (match `common.py` conventions) + - Which backends to benchmark (numpy at minimum, dask if applicable) +3. Ask the user whether they want you to write the benchmark file. + +Do NOT write benchmark files automatically. Report the gap and propose, then wait. + +--- + +## General rules + +- Always run benchmarks from the `benchmarks/` directory, not the project root. +- The regression threshold is 1.2x, matching `.github/workflows/benchmarks.yml`. + Do not change this unless $ARGUMENTS overrides it. +- If ASV setup or machine detection fails, report the error clearly and suggest + the fix. Do not retry in a loop. +- If benchmarks take longer than 5 minutes per class, note the elapsed time so + the user can plan accordingly. +- Do not modify any source, test, or benchmark files. This command is read-only + analysis (unless the user explicitly asks for a benchmark to be written in + response to Step 6). +- If $ARGUMENTS says "compare ", run + `asv continuous ` instead of the default origin/master vs HEAD. diff --git a/.claude/commands/efficiency-audit.md b/.claude/commands/efficiency-audit.md new file mode 100644 index 00000000..471e463a --- /dev/null +++ b/.claude/commands/efficiency-audit.md @@ -0,0 +1,274 @@ +# Efficiency Audit: Compute Waste and Anti-Pattern Detection + +Analyze source code for performance anti-patterns specific to the NumPy / CuPy / +Dask / Numba stack. The prompt is: $ARGUMENTS + +--- + +## Step 0 -- Determine mode + +Check $ARGUMENTS for a mode keyword: + +- **`compare`**: Skip straight to Step 7 (post-fix comparison). Requires a saved + baseline file from a previous run. +- **`no-bench`**: Run the static audit only (Steps 1-6), skip benchmarking entirely. +- **Otherwise** (default): Run the full audit with baseline benchmarks. + +## Step 1 -- Scope the audit + +1. If $ARGUMENTS names specific files or functions, audit only those. +2. If $ARGUMENTS names a category (e.g. `hydrology`, `surface`), identify all + source files in that category from the README feature matrix. +3. If $ARGUMENTS is empty or says "all", audit every `.py` file under `xrspatial/` + (excluding `tests/`, `datasets/`, and `__pycache__/`). +4. Read each file in scope. + +## Step 2 -- Static analysis: Dask anti-patterns + +Search for these patterns in each file. For every hit, record the file, line +number, the offending code, and the severity (HIGH / MEDIUM / LOW). + +### 2a. Premature materialization (HIGH) +- **`.values` on a Dask-backed DataArray or CuPy array:** forces a full compute + or GPU-to-CPU transfer. Search for `.values` usage outside of tests. +- **`.compute()` inside a loop or repeated call:** materializes the full graph + each iteration instead of building a lazy pipeline. +- **`np.array()` or `np.asarray()` wrapping a Dask or CuPy array:** silent + materialization. + +### 2b. Chunking issues (MEDIUM) +- **`da.stack()` without a following `.rechunk()`:** creates size-1 chunks on the + new axis, causing extreme task-graph overhead. +- **`map_overlap` with depth >= chunk_size / 2:** overlap regions dominate the + chunk, wasting memory and compute. Flag if depth is not obviously small relative + to expected chunk sizes. +- **Missing `boundary` argument in `map_overlap`:** defaults may not match the + function's intended boundary handling. + +### 2c. Redundant computation (MEDIUM) +- **Calling the same function twice on the same input** without caching the result + (e.g. computing slope inside aspect when aspect already computes slope internally). +- **Building large intermediate arrays** that could be fused into the kernel + (e.g. allocating a full-size output array, then filling it cell by cell in Numba + instead of writing directly). + +## Step 3 -- Static analysis: GPU anti-patterns + +### 3a. Register pressure (HIGH) +- **CUDA kernels with many float64 local variables:** count the number of named + float64 locals in each `@cuda.jit` kernel. Flag kernels with more than 20 + float64 locals (likely to spill to slow local memory). +- **Thread blocks larger than 16x16 on register-heavy kernels:** check the + `cuda_args()` call or any custom dims function. If the kernel has high register + count and uses 32x32 blocks, flag it. + +### 3b. Unnecessary transfers (HIGH) +- **`.data.get()` followed by CuPy operations:** data round-trips GPU -> CPU -> GPU. +- **`cupy.asarray(numpy_array)` inside a hot path:** repeated CPU -> GPU transfers + that could be hoisted outside the loop. +- **Mixing NumPy and CuPy operations** in the same function without an obvious + reason (e.g. `np.where` on a CuPy array silently converts to NumPy). + +### 3c. Kernel launch overhead (LOW) +- **Per-cell kernel launches:** launching a CUDA kernel inside a Python loop over + cells instead of processing the full grid in one kernel launch. +- **Small array kernel launches:** calling a CUDA kernel on arrays smaller than + the thread block (overhead dominates). + +## Step 4 -- Static analysis: Numba anti-patterns + +### 4a. JIT compilation issues (MEDIUM) +- **Missing `@ngjit` or `@jit(nopython=True)`:** pure-Python loops over arrays + without JIT compilation. Search for nested `for` loops operating on `.data` + arrays without a Numba decorator. +- **Object-mode fallback:** `@jit` without `nopython=True` may silently fall back + to object mode. Only `@ngjit` or `@jit(nopython=True)` guarantees compilation. +- **Type instability:** mixing int and float in Numba functions (e.g. initializing + with `0` then assigning a float) can cause unnecessary casts. + +### 4b. Memory layout (LOW) +- **Column-major iteration on row-major arrays:** Numba loops that iterate + `for col ... for row` on C-contiguous arrays (cache-unfriendly access pattern). + The inner loop should iterate over the last axis (columns for row-major). + +## Step 5 -- Static analysis: General Python anti-patterns + +### 5a. Unnecessary copies (MEDIUM) +- **`.copy()` on arrays that are never mutated:** wasted allocation. +- **`np.zeros_like()` + fill loop:** when `np.empty()` + fill or direct + computation would avoid zero-initialization overhead. + +### 5b. Inefficient I/O patterns (LOW) +- **Reading the same file multiple times** in a function. +- **Writing intermediate results to disk** when they could stay in memory. + +## Step 6 -- Baseline benchmarks + +**Skip this step if mode is `no-bench` or `compare`.** + +For each public function in the audited scope, capture rough baseline timings. +This does not use ASV; it runs quick inline timings so the user gets a +before-snapshot without heavyweight setup. + +### 6a. Build a benchmark script + +Create a temporary script at `/tmp/efficiency_audit_bench_.py` (use a +short hash of the audited file list to keep the name unique). The script should: + +1. Import the public functions found in the audited files. +2. Generate a test array using the same helper pattern as + `benchmarks/benchmarks/common.py`: + ```python + import numpy as np, xarray as xr + ny, nx = 512, 512 # moderate size -- fast but meaningful + x = np.linspace(-180, 180, nx) + y = np.linspace(-90, 90, ny) + x2, y2 = np.meshgrid(x, y) + z = 100.0 * np.exp(-x2**2 / 5e5 - y2**2 / 2e5) + z += np.random.default_rng(71942).normal(0, 2, (ny, nx)) + raster = xr.DataArray(z, dims=['y', 'x']) + ``` + Adjust as needed (e.g. add coords for geodesic functions, integer data for + zonal, etc.). +3. For each function, time it with `timeit.repeat(number=1, repeat=3)` and take + the **median** of the repeats. One iteration is enough -- we want a rough + ballpark, not precise statistics. +4. Print results as JSON to stdout: + ```json + { + "scope": ["slope.py", "aspect.py"], + "array_shape": [512, 512], + "backend": "numpy", + "timings": { + "slope": {"median_ms": 12.3, "runs": [12.1, 12.3, 13.0]}, + "aspect": {"median_ms": 8.7, "runs": [8.5, 8.7, 9.1]} + } + } + ``` + +### 6b. Run the benchmark script + +Execute the script and capture stdout. If a function errors (e.g. missing +optional dependency), record `"error": ""` instead of timings and +continue with the rest. + +### 6c. Save the baseline + +Write the JSON output to `.efficiency-audit-baseline.json` in the project root. +This file is gitignored-by-convention (do not add it to git). Tell the user the +baseline has been saved and what it contains. + +If a baseline file already exists, back it up to +`.efficiency-audit-baseline.prev.json` before overwriting. + +## Step 7 -- Generate the report + +``` +## Efficiency Audit Report + +### Scope +- Files audited: N +- Functions audited: N + +### Findings + +#### HIGH severity +| # | File:Line | Pattern | Description | Fix | +|---|--------------------|---------------------------|---------------------------------------|----------------------------------| +| 1 | slope.py:142 | Premature materialization | `.values` on dask input in _run_dask | Use `.data.compute()` instead | +| 2 | geodesic.py:87 | Register pressure | 24 float64 locals in _gpu kernel | Split kernel or use 16x16 blocks | +| ...| ... | ... | ... | ... | + +#### MEDIUM severity +| # | File:Line | Pattern | Description | Fix | +|---|--------------------|---------------------------|---------------------------------------|----------------------------------| +| ...| ... | ... | ... | ... | + +#### LOW severity +| # | File:Line | Pattern | Description | Fix | +|---|--------------------|---------------------------|---------------------------------------|----------------------------------| +| ...| ... | ... | ... | ... | + +### Baseline Timings (512x512, numpy) +| Function | Median (ms) | Runs (ms) | +|------------|-------------|---------------------| +| slope | 12.3 | 12.1, 12.3, 13.0 | +| aspect | 8.7 | 8.5, 8.7, 9.1 | +| ... | ... | ... | + +(If any function errored, show "ERROR: " in the Median column.) + +### Summary +- HIGH: N findings +- MEDIUM: N findings +- LOW: N findings +- Clean files (no issues): + +### Recommendations + +``` + +## Step 8 -- Post-fix comparison (mode=`compare`) + +**Only run this step when $ARGUMENTS contains `compare`.** + +1. Read `.efficiency-audit-baseline.json` from the project root. If it does not + exist, tell the user to run the audit without `compare` first to capture a + baseline, and stop. +2. Regenerate the benchmark script from Step 6a using the `scope` and + `array_shape` recorded in the baseline file (so the comparison is apples to + apples). +3. Run the benchmark script (Step 6b) and capture the new timings. +4. For each function, compute the ratio: `new_median / old_median`. + +Generate a comparison report: + +``` +## Efficiency Audit: Post-Fix Comparison + +### Baseline +- Captured: +- Array shape: +- Backend: + +### Results + +| Function | Before (ms) | After (ms) | Ratio | Verdict | +|------------|-------------|------------|-------|--------------| +| slope | 12.3 | 7.1 | 0.58x | IMPROVED | +| aspect | 8.7 | 8.5 | 0.98x | UNCHANGED | +| ... | ... | ... | ... | ... | + +Thresholds: IMPROVED < 0.8x, REGRESSION > 1.2x, else UNCHANGED. + +### Net impact +- Functions improved: N +- Functions regressed: N +- Functions unchanged: N +- Overall: +``` + +5. Save the new timings to `.efficiency-audit-after.json` for reference. + +--- + +## General rules + +- Do not modify source, test, or benchmark files. Temporary scripts go in `/tmp/`. +- Only flag patterns that are actually present in the code. Do not report + hypothetical issues or patterns that "could" occur. +- Include the exact file path and line number for every finding so the user + can navigate directly to the issue. +- False positives are worse than missed issues. If you are not confident a + pattern is actually harmful in context (e.g. `.values` used intentionally + on a known-numpy array), do not flag it. +- If $ARGUMENTS includes "fix", still do not auto-fix. Report and ask. +- If $ARGUMENTS includes a severity filter (e.g. "high only"), only report + findings at that severity level. +- If $ARGUMENTS includes "diff" or "changed", restrict the audit to files + changed on the current branch vs origin/master. +- Baseline benchmark scripts are disposable. Clean up `/tmp/` scripts after + capturing results. +- The 512x512 array size is a default. If $ARGUMENTS includes a size like + `1024x1024` or `small`, adjust accordingly. "small" = 128x128, "large" = 2048x2048. diff --git a/.claude/commands/review-pr.md b/.claude/commands/review-pr.md new file mode 100644 index 00000000..3cd092b6 --- /dev/null +++ b/.claude/commands/review-pr.md @@ -0,0 +1,186 @@ +# Review PR: Domain-Aware Pull Request Review + +Review a pull request with checks specific to a geospatial raster library built on +NumPy, Dask, CuPy, and Numba. The prompt is: $ARGUMENTS + +--- + +## Step 1 -- Load the PR + +1. If $ARGUMENTS contains a PR number (e.g. `123`), fetch it: + ```bash + gh pr view --json title,body,files,commits,baseRefName,headRefName + ``` +2. If $ARGUMENTS is empty, check whether the current branch has an open PR: + ```bash + gh pr view --json title,body,files,commits,baseRefName,headRefName + ``` +3. If neither works, tell the user to provide a PR number and stop. +4. Get the full diff: + ```bash + gh pr diff + ``` +5. Read every changed file in full (not just the diff) to understand surrounding + context. + +## Step 2 -- Correctness review + +Check the changed code for numerical and algorithmic correctness: + +### 2a. Algorithm accuracy +- Does the implementation match the cited algorithm or paper? If a paper or + standard is referenced (in comments, docstring, or PR body), verify the + formulas match. +- Are there off-by-one errors in neighborhood indexing (common in 3x3 kernels)? +- Is the output in the correct units and range? (e.g. slope in degrees 0-90, + aspect in degrees 0-360, NDVI in -1 to 1) + +### 2b. Floating point concerns +- Are there divisions that could produce inf or NaN on valid input? +- Is there catastrophic cancellation risk (subtracting nearly equal large numbers)? +- Does the code handle the float32 vs float64 distinction correctly? (e.g. using + float64 intermediates for accumulation, returning the expected output dtype) + +### 2c. NaN handling +- Does the function propagate NaN correctly for its semantics? +- For neighborhood operations with `boundary='nan'`: do edge cells become NaN? +- Are NaN checks using `np.isnan` (not `== np.nan`)? + +### 2d. Edge cases +- Empty input, single-row, single-column, 1x1 rasters +- All-NaN input +- Constant-value input (derivative operations should return zero) +- Very large or very small values + +## Step 3 -- Backend completeness review + +### 3a. Dispatch registration +- Does the `ArrayTypeFunctionMapping` include all four backends? +- If a backend is intentionally omitted, is there a comment explaining why? +- Does the public function's docstring mention which backends are supported? + +### 3b. Dask correctness +- Does `map_overlap` use the correct `depth` for the kernel size? + (depth should be `kernel_radius`, e.g. 1 for a 3x3 kernel) +- Is the `boundary` parameter forwarded correctly from the public API to + `map_overlap`? +- Does the chunk function return the same shape as its input? +- For 3D stacked arrays: is `.rechunk({0: N})` called after `da.stack()`? + +### 3c. CuPy correctness +- Does the CUDA kernel handle array bounds correctly (guard against + out-of-bounds thread indices)? +- Is the thread block size appropriate for the kernel's register usage? +- Are results extracted with `.data.get()`, not `.values`? + +## Step 4 -- Performance review + +### 4a. Anti-patterns +Run the same checks as `/efficiency-audit` but scoped to only the changed files. +Specifically check for: +- Premature materialization (`.values`, `.compute()` in loops) +- Unnecessary copies +- GPU register pressure in new CUDA kernels +- Missing `@ngjit` on CPU loops + +### 4b. Benchmark coverage +- Does a benchmark exist in `benchmarks/benchmarks/` for the changed function? +- If this PR adds a new function, does it also add a benchmark? +- If the PR modifies performance-critical code, should the "performance" label + be added? + +## Step 5 -- Test coverage review + +### 5a. Test existence +- Are there tests for the changed code? +- Do tests cover all implemented backends (using the helpers from + `general_checks.py`)? + +### 5b. Test quality +- Do tests compare against known reference values (QGIS, analytical, etc.), + not just "does it run without crashing"? +- Are edge cases tested (NaN, constant surface, boundary modes)? +- Do dask tests use multiple chunk sizes (including ragged chunks)? +- Are temporary files uniquely named? + +### 5c. Missing tests +- List any code paths or parameter combinations that have no test coverage. + +## Step 6 -- Documentation and API review + +### 6a. Docstrings +- Does every new public function have a docstring with Parameters, Returns, + and a short description? +- Are parameter types and defaults documented? + +### 6b. README feature matrix +- If a new function was added, is it in the README feature matrix? +- Are the backend checkmarks accurate? + +### 6c. API consistency +- Does the function signature follow the project's conventions? + (e.g. `agg` for input DataArray, `name` for output name, `boundary` for + boundary mode) +- Does it return an `xr.DataArray` with coords, dims, and attrs preserved? + +## Step 7 -- Generate the review + +Format the review as a structured comment suitable for posting on the PR. +Organize findings by severity: + +``` +## PR Review: + +### Blockers (must fix before merge) +- [ ] <finding with file:line reference> + +### Suggestions (should fix, not blocking) +- [ ] <finding with file:line reference> + +### Nits (optional improvements) +- [ ] <finding with file:line reference> + +### What looks good +- <positive observations, kept brief> + +### Checklist +- [ ] Algorithm matches reference/paper +- [ ] All implemented backends produce consistent results +- [ ] NaN handling is correct +- [ ] Edge cases are covered by tests +- [ ] Dask chunk boundaries handled correctly +- [ ] No premature materialization or unnecessary copies +- [ ] Benchmark exists or is not needed +- [ ] README feature matrix updated (if applicable) +- [ ] Docstrings present and accurate +``` + +After generating the review, **run it through the `/humanizer` skill** before +showing it to the user or posting it to GitHub. + +## Step 8 -- Post (if requested) + +If $ARGUMENTS includes "post" or "comment": +1. Post the review as a PR comment using `gh pr comment <number> --body "..."`. +2. Confirm the comment was posted successfully. + +If $ARGUMENTS does not include "post", show the review to the user and ask +whether they want it posted. + +--- + +## General rules + +- Do not approve or request changes on the PR via GitHub's review system. Only + post comments. +- Read the full context of changed files, not just the diff. Many bugs are only + visible when you understand the surrounding code. +- Be specific. Every finding must include a file path and line number. Vague + feedback ("consider improving performance") is not useful. +- Do not suggest changes to code that was not modified in the PR unless the + existing code has a clear bug that the PR makes worse. +- False positives erode trust. If you are uncertain whether something is a + problem, say so explicitly rather than presenting it as a definite issue. +- Run `/humanizer` on the final review text before posting or displaying. +- If $ARGUMENTS includes "quick", skip Steps 4 and 6 (performance and docs) + and focus only on correctness, backend parity, and test coverage. diff --git a/.claude/commands/validate.md b/.claude/commands/validate.md new file mode 100644 index 00000000..23b130ab --- /dev/null +++ b/.claude/commands/validate.md @@ -0,0 +1,216 @@ +# Validate: Numerical Accuracy and Backend Parity Check + +Take a function name (or detect the changed function from the current branch diff) +and verify its numerical accuracy against reference implementations and across all +four backends. The prompt is: $ARGUMENTS + +--- + +## Step 1 -- Identify the target + +1. If $ARGUMENTS names a specific function (e.g. `slope`, `flow_accumulation`), + use that. +2. If $ARGUMENTS is empty or says "auto", run `git diff origin/master --name-only` + to find changed source files under `xrspatial/`. Identify which public functions + were added or modified. If multiple functions changed, validate each one. +3. Read the function's source to understand: + - Which backends are implemented (check the `ArrayTypeFunctionMapping` call) + - What parameters it accepts (boundary modes, method variants, etc.) + - What the expected output range and dtype should be + - Whether it's a neighborhood operation (uses `map_overlap`) or a per-cell operation + +## Step 2 -- Select or build reference data + +Build **three** test datasets, each serving a different purpose: + +### 2a. Analytical known-answer dataset +Create a small synthetic raster where the correct answer can be computed by hand +or from a closed-form formula. Examples: + +- **Slope/aspect:** a perfect plane tilted at a known angle (e.g. `z = 2x + 3y` + gives slope = arctan(sqrt(13)) for planar method) +- **Flow direction:** a simple cone or V-shaped valley where flow paths are obvious +- **Focal:** a raster with a single non-zero cell surrounded by zeros +- **Multispectral indices:** bands with known ratios so NDVI/NDWI etc. are trivially + verifiable + +Compute the expected result array by hand (or with basic numpy math) and store it +as a numpy array. This is the **ground truth** for this dataset. + +### 2b. QGIS / rasterio / scipy reference dataset +Check whether the function's existing test file already has a reference fixture +(like `qgis_slope` in `test_slope.py`). If so, reuse it. + +If no reference exists, attempt to compute one: +1. Check if `rasterio` is installed (`python -c "import rasterio"`). If available, + write the test raster to a temporary GeoTIFF (unique name including the function + name, e.g. `tmp_validate_slope.tif`) and run the equivalent rasterio/GDAL operation. +2. If rasterio is not available, check for `scipy.ndimage` equivalents (e.g. + `generic_filter`, `uniform_filter`, `sobel`). +3. If neither is available, skip this dataset and note it in the report. + +### 2c. Realistic stress dataset +Generate a larger raster (at least 256x256) with terrain-like features using the +project's `perlin` module or `np.random.default_rng(42)`. Include: +- NaN patches (5-10% of cells) to test NaN propagation +- A mix of flat and steep areas +- Edge values near dtype limits for the tested dtypes + +This dataset is for backend parity and performance, not absolute accuracy. + +## Step 3 -- Run across all backends + +For each dataset and each parameter combination (e.g. boundary modes, method +variants), run the function on every implemented backend: + +1. **NumPy** -- always available, treat as the baseline +2. **Dask+NumPy** -- use `create_test_raster(data, backend='dask+numpy')` with + at least two different chunk sizes: + - Chunks that evenly divide the array + - Ragged chunks (array size not divisible by chunk size) +3. **CuPy** -- skip with a note if CUDA is not available +4. **Dask+CuPy** -- skip with a note if CUDA is not available + +Use the helpers from `general_checks.py`: +- `create_test_raster()` to build DataArrays for each backend +- For CuPy results, extract with `.data.get()` +- For Dask results, extract with `.data.compute()` + +## Step 4 -- Compare results + +Run four categories of comparison, reporting pass/fail and numeric details for each: + +### 4a. Ground truth comparison (dataset 2a) +Compare the NumPy backend result against the hand-computed expected array. +```python +np.testing.assert_allclose(result, expected, rtol=1e-6, atol=1e-10, equal_nan=True) +``` +If this fails, the algorithm itself has a bug. Report the max absolute error, +max relative error, and the cell location(s) where divergence is worst. + +### 4b. Reference implementation comparison (dataset 2b) +Compare the NumPy result against the rasterio/scipy/QGIS reference. +Use `rtol=1e-5` (matching the project's existing QGIS tolerance convention). +Exclude edge cells if the implementations handle boundaries differently (document +which edges were excluded and why). + +### 4c. Backend parity (all datasets) +Compare every non-NumPy backend against the NumPy result: + +| Comparison | Default tolerance | +|-----------------------|---------------------------| +| NumPy vs Dask+NumPy | `rtol=1e-5` | +| NumPy vs CuPy | `atol=1e-6, rtol=1e-6` | +| NumPy vs Dask+CuPy | `atol=1e-6, rtol=1e-6` | + +For each comparison, report: +- Max absolute difference +- Max relative difference +- Whether NaN locations match exactly (`np.isnan` masks must be identical) +- Whether output shape, dims, coords, and attrs are preserved (use + `general_output_checks`) + +### 4d. Edge case and invariant checks +Run these regardless of which function is being validated: + +- **NaN propagation:** cells neighboring NaN input should behave correctly for the + function (NaN output for most neighborhood ops with `boundary='nan'`) +- **Constant surface:** if the input is uniform (e.g. all 42.0), the output should + be zero for derivative operations (slope, curvature) or uniform for pass-through + operations +- **Single-cell raster:** 1x1 input should not crash (may return NaN) +- **Dtype preservation:** run with float32 and float64 inputs; verify the output + dtype matches expectations +- **Boundary modes:** if the function accepts a `boundary` parameter, test all + valid modes (`nan`, `nearest`, `reflect`, `wrap`) and verify: + - Shape is preserved + - Non-nan modes produce no NaN output when source has no NaN + - NumPy and Dask results agree for each mode + +## Step 5 -- Generate the report + +Print a structured report with these sections: + +``` +## Validation Report: <function_name> + +### Target +- Function: <name> +- Source: <file_path> +- Backends implemented: <list> +- Parameter variants tested: <list> + +### Datasets +| Dataset | Shape | Dtype | NaN% | Notes | +|------------------|---------|---------|------|--------------------------| +| Analytical | ... | ... | ... | <description> | +| Reference (src) | ... | ... | ... | <reference tool used> | +| Stress | ... | ... | ... | <generation method> | + +### Results + +#### Ground Truth (analytical dataset) +- Status: PASS / FAIL +- Max absolute error: ... +- Max relative error: ... +- Worst cell: (row, col) expected=... got=... + +#### Reference Implementation +- Reference: <rasterio / scipy / QGIS fixture / skipped> +- Status: PASS / FAIL / SKIPPED +- Max absolute error: ... +- Notes: <edge exclusions, known differences> + +#### Backend Parity +| Comparison | Dataset | Max |Δ| | Max |Δ/ref| | NaN match | Status | +|-------------------------|-------------|-----------|-------------|-----------|--------| +| NumPy vs Dask+NumPy | analytical | ... | ... | yes/no | ... | +| NumPy vs Dask+NumPy | stress | ... | ... | yes/no | ... | +| NumPy vs CuPy | analytical | ... | ... | yes/no | ... | +| ... | ... | ... | ... | ... | ... | + +#### Edge Cases +| Check | Status | Notes | +|--------------------|--------|-------------------------------------| +| NaN propagation | ... | | +| Constant surface | ... | | +| Single-cell | ... | | +| Dtype float32 | ... | | +| Dtype float64 | ... | | +| Boundary modes | ... | <modes tested> | + +### Verdict +- Overall: PASS / FAIL +- <1-3 sentence summary of findings> +- <action items if anything failed> +``` + +## Step 6 -- Suggest fixes (if failures found) + +If any check failed: +1. Identify the root cause (algorithm bug, boundary handling, dtype casting, + chunking artifact, GPU precision, etc.) +2. Describe the fix concisely. +3. Ask the user whether they want you to apply the fix now. + +Do NOT apply fixes automatically. The purpose of `/validate` is to report, not to +change code. + +--- + +## General rules + +- Run all comparisons in a Python script or inline pytest, not by eyeballing + print output. Use `np.testing.assert_allclose` for numeric checks. +- Any temporary files (GeoTIFFs, intermediate arrays) must use unique names + including the function name (e.g. `tmp_validate_slope_256x256.tif`). Clean them + up at the end. +- If CUDA is not available, skip GPU backends gracefully and note it in the report. + Never fail the validation just because a backend is unavailable. +- If $ARGUMENTS specifies a tolerance override (e.g. "validate slope rtol=1e-3"), + use the provided tolerances instead of the defaults. +- If $ARGUMENTS specifies "quick", skip the stress dataset and boundary mode sweep + to give a faster result. +- Do not modify any source or test files. This command is read-only analysis. +- If the function has a `method` parameter (e.g. `slope(method='geodesic')`), + validate each method variant separately.