Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 159 additions & 0 deletions .claude/commands/backend-parity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Backend Parity: Cross-Backend Consistency Audit

Verify that all implemented backends produce consistent results for a given
function or set of functions. The prompt is: $ARGUMENTS

---

## Step 1 -- Identify targets

1. If $ARGUMENTS names specific functions (e.g. `slope`, `aspect`), use those.
2. If $ARGUMENTS names a category (e.g. `hydrology`, `surface`, `focal`), read
`README.md` to find all functions in that category.
3. If $ARGUMENTS is empty or says "all", scan the full feature matrix in `README.md`
and test every function that claims support for 2+ backends.
4. For each function, read its source file and find the `ArrayTypeFunctionMapping`
call to determine which backends are actually implemented (not just what the
README claims).

## Step 2 -- Build test inputs

For each target function, create test rasters at three scales:

| Name | Size | Purpose |
|---------|---------|--------------------------------------------------|
| tiny | 8x6 | Fast, easy to inspect cell-by-cell |
| medium | 64x64 | Catches chunk-boundary artifacts in dask |
| large | 256x256 | Stress test, exposes numerical accumulation drift |

For each size, generate two variants:
- **Clean:** no NaN, realistic value range for the function
(e.g. 0-5000m for elevation, 0-1 for NDVI inputs)
- **Dirty:** 5-10% random NaN, some extreme values near dtype limits

Use `np.random.default_rng(42)` for reproducibility. For functions that require
specific input structure (e.g. `flow_direction` needs a DEM with drainage, not
random noise), use the project's `perlin` module or a synthetic cone/valley.

Also test with at least two dtypes: `float32` and `float64`.

## Step 3 -- Run every backend

For each function, input variant, and dtype:

1. **NumPy:** `create_test_raster(data, backend='numpy')` -- always the baseline.
2. **Dask+NumPy:** test with two chunk configurations:
- `chunks=(size//2, size//2)` -- even split
- `chunks=(size//3, size//3)` -- ragged remainder
3. **CuPy:** `create_test_raster(data, backend='cupy')` -- skip if CUDA unavailable.
4. **Dask+CuPy:** `create_test_raster(data, backend='dask+cupy')` -- skip if CUDA
unavailable.

If the function has parameter variants (e.g. `boundary`, `method`), test the
default parameters first. If $ARGUMENTS includes "thorough", also sweep all
parameter combinations.

## Step 4 -- Pairwise comparison

For every non-NumPy result, compare against the NumPy baseline. Extract data using
the project conventions:
- Dask: `.data.compute()`
- CuPy: `.data.get()`
- Dask+CuPy: `.data.compute().get()`

For each pair, compute and record:

### 4a. Value agreement
```python
abs_diff = np.abs(result - baseline)
max_abs = np.nanmax(abs_diff)
rel_diff = abs_diff / (np.abs(baseline) + 1e-30) # avoid div-by-zero
max_rel = np.nanmax(rel_diff)
mean_abs = np.nanmean(abs_diff)
```

### 4b. NaN mask agreement
```python
nan_match = np.array_equal(np.isnan(result), np.isnan(baseline))
nan_only_in_result = np.sum(np.isnan(result) & ~np.isnan(baseline))
nan_only_in_baseline = np.sum(np.isnan(baseline) & ~np.isnan(result))
```

### 4c. Metadata preservation
Using `general_output_checks` from `general_checks.py`:
- Output type matches input type (DataArray backed by the same array type)
- Shape, dims, coords, attrs preserved

### 4d. Pass/fail thresholds

| Comparison | rtol | atol |
|-----------------------|----------|----------|
| NumPy vs Dask+NumPy | 1e-5 | 0 |
| NumPy vs CuPy | 1e-6 | 1e-6 |
| NumPy vs Dask+CuPy | 1e-6 | 1e-6 |

A comparison **fails** if `max_abs > atol` AND `max_rel > rtol`, or if NaN masks
disagree.

## Step 5 -- Chunk boundary analysis

Dask backends are the most likely source of parity issues due to `map_overlap`
boundary handling. For any Dask comparison that fails or is borderline:

1. Identify which cells diverge from the NumPy result.
2. Map those cells to chunk boundaries (cells within `depth` pixels of a chunk edge).
3. Report what percentage of divergent cells are at chunk boundaries vs interior.
4. If all divergence is at boundaries, the issue is likely in the `map_overlap`
`depth` or `boundary` parameter. Say so explicitly.

## Step 6 -- Generate the report

```
## Backend Parity Report

### Functions tested
| Function | Backends implemented | Source file |
|---------------------|---------------------------|--------------------------|
| slope | numpy, cupy, dask, dask+cupy | xrspatial/slope.py |
| ... | ... | ... |

### Parity Matrix

#### <function_name>
| Comparison | Input | Dtype | Max |Δ| | Max |Δ/ref| | NaN match | Metadata | Status |
|-----------------------|-------------|---------|----------|------------|-----------|----------|--------|
| NumPy vs Dask+NumPy | tiny clean | float32 | ... | ... | yes | ok | PASS |
| NumPy vs Dask+NumPy | medium dirty| float64 | ... | ... | yes | ok | PASS |
| NumPy vs CuPy | tiny clean | float32 | ... | ... | no (3) | ok | FAIL |
| ... | ... | ... | ... | ... | ... | ... | ... |

### Failures
For each FAIL row:
- Which cells diverged
- Whether divergence correlates with chunk boundaries (Dask) or specific
input values (CuPy)
- Likely root cause
- Suggested fix

### Summary
- Functions tested: N
- Total comparisons: N
- Passed: N
- Failed: N
- Skipped (no CUDA): N
```

---

## General rules

- Do not modify any source or test files. This command is read-only.
- Use `create_test_raster` from `general_checks.py` for all raster construction.
- Any temporary files must include the function name for uniqueness.
- If CUDA is unavailable, skip CuPy and Dask+CuPy gracefully. Report them
as SKIPPED, not FAIL.
- If $ARGUMENTS includes "fix", still do not auto-fix. Report the issue and ask.
- If a function is not in `ArrayTypeFunctionMapping` (e.g. it only has a numpy
path), note it as "single-backend only" and skip parity checks for it.
- If $ARGUMENTS includes a specific tolerance (e.g. `rtol=1e-3`), override the
defaults in the threshold table.
127 changes: 127 additions & 0 deletions .claude/commands/bench.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Bench: Local Performance Comparison

Run ASV benchmarks for the current branch against master and report regressions
and improvements. The prompt is: $ARGUMENTS

---

## Step 1 -- Identify what changed

1. If $ARGUMENTS names specific benchmark classes or functions (e.g. `Slope`,
`flow_accumulation`), use those directly.
2. If $ARGUMENTS is empty or says "auto", run `git diff origin/master --name-only`
to find changed source files under `xrspatial/`. Map each changed file to the
corresponding benchmark module in `benchmarks/benchmarks/`. Use the filename
and imports to match (e.g. changes to `slope.py` map to `benchmarks/benchmarks/slope.py`).
3. If no benchmark exists for the changed code, note this in the report and
suggest whether one should be added.

## Step 2 -- Check prerequisites

1. Verify ASV is installed: `python -c "import asv"`. If missing, tell the user
to install it (`pip install asv`) and stop.
2. Verify the benchmarks directory exists at `benchmarks/`.
3. Read `benchmarks/asv.conf.json` to confirm the project name and branch settings.
4. Check whether the ASV machine file exists (`.asv/machine.json`). If not, run
`cd benchmarks && asv machine --yes` to initialize it.

## Step 3 -- Run the comparison

Run ASV in continuous-comparison mode from the `benchmarks/` directory:

```bash
cd benchmarks && asv continuous origin/master HEAD -b "<regex>" -e
```

Where `<regex>` is a pattern matching the benchmark classes identified in Step 1
(e.g. `Slope|Aspect` or `FlowAccumulation`). The `-e` flag shows stderr on failure.

If $ARGUMENTS contains "quick", add `--quick` to run each benchmark only once
(faster but noisier).

If $ARGUMENTS contains "full", omit the `-b` filter to run all benchmarks.

## Step 4 -- Parse and interpret results

ASV continuous outputs lines like:
```
BENCHMARKS NOT SIGNIFICANTLY CHANGED.
```
or:
```
REGRESSION: benchmarks.slope.Slope.time_numpy 3.45ms -> 5.67ms (1.64x)
IMPROVED: benchmarks.slope.Slope.time_dask 8.12ms -> 4.23ms (0.52x)
```

Parse the output and classify each result:

| Category | Criteria |
|--------------|-----------------------------|
| REGRESSION | Ratio > 1.2x (matches CI) |
| IMPROVED | Ratio < 0.8x |
| UNCHANGED | Between 0.8x and 1.2x |

## Step 5 -- Generate the report

```
## Benchmark Report: <branch> vs master

### Changed files
- <list of changed source files>

### Benchmarks run
- <list of benchmark classes/functions matched>

### Results

| Benchmark | master | HEAD | Ratio | Status |
|------------------------------------|-----------|-----------|-------|------------|
| slope.Slope.time_numpy | 3.45 ms | 3.51 ms | 1.02x | UNCHANGED |
| slope.Slope.time_dask_numpy | 8.12 ms | 4.23 ms | 0.52x | IMPROVED |
| ... | ... | ... | ... | ... |

### Regressions
<details for each regression: which benchmark, how much slower, likely cause>

### Improvements
<details for each improvement>

### Missing benchmarks
<list any changed functions that have no benchmark coverage>

### Recommendation
- [ ] Safe to merge (no regressions)
- [ ] Add "performance" label to PR (regressions found, CI will recheck)
- [ ] Consider adding benchmarks for: <uncovered functions>
```

## Step 6 -- Suggest benchmark additions (if gaps found)

If Step 1 found changed functions with no benchmark coverage:

1. Read an existing benchmark file in `benchmarks/benchmarks/` that covers a
similar function (same category or same backend pattern).
2. Describe what a new benchmark should test:
- Which function and parameter variants
- Suggested array sizes (match `common.py` conventions)
- Which backends to benchmark (numpy at minimum, dask if applicable)
3. Ask the user whether they want you to write the benchmark file.

Do NOT write benchmark files automatically. Report the gap and propose, then wait.

---

## General rules

- Always run benchmarks from the `benchmarks/` directory, not the project root.
- The regression threshold is 1.2x, matching `.github/workflows/benchmarks.yml`.
Do not change this unless $ARGUMENTS overrides it.
- If ASV setup or machine detection fails, report the error clearly and suggest
the fix. Do not retry in a loop.
- If benchmarks take longer than 5 minutes per class, note the elapsed time so
the user can plan accordingly.
- Do not modify any source, test, or benchmark files. This command is read-only
analysis (unless the user explicitly asks for a benchmark to be written in
response to Step 6).
- If $ARGUMENTS says "compare <branch1> <branch2>", run
`asv continuous <branch1> <branch2>` instead of the default origin/master vs HEAD.
Loading
Loading