Speed up generate_terrain numpy backend with a fused parallel kernel by brendancol · Pull Request #3566 · xarray-contrib/xarray-spatial

brendancol · 2026-06-27T23:02:16Z

Closes #3565

Speeds up generate_terrain on the numpy backend (and dask+numpy, which routes through the same _gen_terrain).

What changed

The numpy octave loop called _perlin once per octave, and each call allocated ~20 full-size numpy temporaries (floor, fancy-index gather, fade, lerp). At 16 octaves that's hundreds of allocations.
Replaced it with one fused @njit(parallel=True) kernel that computes warp + fbm/ridged + cube per pixel, prange over rows. No per-octave temporaries.
_gen_terrain takes the fast path when worley_blend == 0 (the default). The worley path keeps the existing numpy code, since it needs a global min/max pass.

Results (numpy, default fbm, 16 octaves)

512x512: 244 ms -> 7.8 ms
1024x1024: 1506 ms -> 21 ms
2048x2048: 7368 ms -> 76 ms (~97x)

dask+numpy at 2048x2048 drops to 78 ms.

Correctness

Output matches the previous implementation to float32 epsilon: fbm ~1.2e-7 relative, ridged ~1.6e-6 relative, worley path unchanged exactly.

Backends

numpy: fused kernel (faster)
dask+numpy: faster for free (same _gen_terrain)
cupy, dask+cupy: unchanged (GPU was already ~28 ms at 2048x2048)

Test plan

pytest xrspatial/tests/test_terrain.py (52 passed, includes GPU + dask parity)
Added a reference-based test built from the un-fused _perlin so the kernel has an independent anchor on CPU-only CI (fbm, ridged, non-default params)
perlin / validation / templates / dask-task-name suites green

https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR

) The numpy octave loop called _perlin once per octave, and each call allocated ~20 full-size temporaries. Collapse warp + fbm/ridged + cube into one @njit(parallel=True) per-pixel kernel with prange over rows. numpy and dask+numpy both route through _gen_terrain, so both speed up; cupy and dask+cupy are unchanged. Worley blending keeps the numpy path since it needs a global min/max pass. ~97x faster at 2048x2048 (7368 -> 76 ms); output matches to float32 epsilon. Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR

dask+numpy and numpy now share the fused kernel and the GPU parity tests are skipped on CPU-only CI, so add a reference built from the un-fused _perlin to independently check the kernel for fbm, ridged, and non-default lacunarity/persistence. Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR

brendancol

PR Review: Speed up generate_terrain numpy backend with a fused parallel kernel

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

terrain.py:61 _check_memory still budgets _SCRATCH_BYTES_PER_PIXEL = 24 (the old 5-7 same-shape arrays). The fused fast path allocates far less per pixel: linx/liny plus one output array, no per-octave temporaries. So for worley-off calls the guard now over-estimates and can raise MemoryError on grids that would actually fit. Not a correctness problem (it errs toward refusing), but the budget could be revisited for the fast path. Fine as a follow-up.

Nits (optional improvements)

test_terrain.py: the reference test uses atol=2e-2, which is loose in absolute terms on zfactor-scaled values. The rtol=1e-4 is what actually guards against drift; the atol is only there to absorb the ridged feedback path's ~6e-3 absolute gap. A one-line comment would stop a future reader from tightening atol and hitting a flaky failure.
terrain.py: the dask path rebuilds the permutation tables per chunk via _make_perm_table. Cheap (256-entry shuffle) and it matches the pre-existing per-chunk structure, so no action needed. Noting it for the record.

What looks good

Output verified against the pre-change implementation to float32 epsilon: fbm 1.2e-7 rel, ridged 1.6e-6 rel, worley path byte-identical.
Edge cases confirmed: 1x1, single-row, single-column, 2x2, and strong warp pushing coordinates negative (the & 255 two's-complement wrap matches the numpy path).
prange writes distinct rows and acc/amp/freq/weight are loop-local, so there's no data race.
Worley stays on the numpy path because it needs a global min/max pass; the fast-path gate is worley_blend <= 0.
The new reference test gives the kernel an independent anchor on CPU-only CI, where dask-vs-numpy parity can't catch a shared-kernel bug and the GPU tests are skipped.
Memory footprint is lower than before (no meshgrid, no per-octave temporaries).

Checklist

Algorithm matches reference (perlin octave loop preserved, verified numerically)
All implemented backends produce consistent results (numpy/dask parity tests pass; GPU unchanged)
NaN handling is correct (all-NaN template overwritten, output finite)
Edge cases are covered (verified 1x1/row/col/warp; existing all-NaN tests pass)
Dask chunk boundaries handled correctly (per-chunk block_info coords unchanged)
No premature materialization or unnecessary copies
Benchmark exists (benchmarks/benchmarks/terrain.py)
README feature matrix not applicable (no new function, no backend change)
Docstrings present and accurate (public signature unchanged)

Pull the rtol/atol into named constants with a comment so a future reader does not tighten atol and flake the ridged case. Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR

brendancol

Follow-up review

Pushed one follow-up commit for the review findings.

Nit (reference-test tolerance): fixed. The rtol/atol are now named constants with a comment explaining that rtol is the real drift guard and atol only absorbs the ridged path's ~6e-3 gap.
Suggestion (_check_memory 24 B/pixel budget): not changing it. The guard is a deliberately conservative memory safety check (caps scratch at 50% of RAM) and erring toward refusing is the safe direction. Loosening it to admit a few larger worley-off grids trades real OOM protection for a niche win, so I'd rather leave it.
Nit (per-chunk perm rebuild on dask): no action, it was informational and matches the existing per-chunk structure.

Terrain suite still green (52 passed). No blockers outstanding.

CI aborted on macOS 3.14 (SIGABRT) in the dask terrain test: numba's 'workqueue' threading layer is not threadsafe, so launching the parallel=True kernel concurrently from dask's threaded scheduler crashes the process. Take a module-level lock around the launch, the same hazard and fix the reproject kernels use (#3141). Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR

brendancol added 2 commits June 27, 2026 16:00

brendancol commented Jun 27, 2026

View reviewed changes

Address review nit: document reference-test tolerance (#3565)

ffa7427

Pull the rtol/atol into named constants with a comment so a future reader does not tighten atol and flake the ridged case. Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR

brendancol commented Jun 27, 2026

View reviewed changes

brendancol added 2 commits June 27, 2026 16:05

Merge remote-tracking branch 'origin/main' into issue-3565

5e75003

brendancol merged commit 36101af into main Jun 27, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up generate_terrain numpy backend with a fused parallel kernel#3566

Speed up generate_terrain numpy backend with a fused parallel kernel#3566
brendancol merged 5 commits into
mainfrom
issue-3565

brendancol commented Jun 27, 2026

Uh oh!

brendancol left a comment

Uh oh!

brendancol left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

brendancol commented Jun 27, 2026

What changed

Results (numpy, default fbm, 16 octaves)

Correctness

Backends

Test plan

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review: Speed up generate_terrain numpy backend with a fused parallel kernel

Blockers (must fix before merge)

Suggestions (should fix, not blocking)

Nits (optional improvements)

What looks good

Checklist

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

Follow-up review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant