Speed up generate_terrain numpy backend with a fused parallel kernel#3566
Merged
Conversation
) The numpy octave loop called _perlin once per octave, and each call allocated ~20 full-size temporaries. Collapse warp + fbm/ridged + cube into one @njit(parallel=True) per-pixel kernel with prange over rows. numpy and dask+numpy both route through _gen_terrain, so both speed up; cupy and dask+cupy are unchanged. Worley blending keeps the numpy path since it needs a global min/max pass. ~97x faster at 2048x2048 (7368 -> 76 ms); output matches to float32 epsilon. Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR
dask+numpy and numpy now share the fused kernel and the GPU parity tests are skipped on CPU-only CI, so add a reference built from the un-fused _perlin to independently check the kernel for fbm, ridged, and non-default lacunarity/persistence. Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR
brendancol
commented
Jun 27, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
PR Review: Speed up generate_terrain numpy backend with a fused parallel kernel
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
- terrain.py:61
_check_memorystill budgets_SCRATCH_BYTES_PER_PIXEL = 24(the old 5-7 same-shape arrays). The fused fast path allocates far less per pixel: linx/liny plus one output array, no per-octave temporaries. So for worley-off calls the guard now over-estimates and can raise MemoryError on grids that would actually fit. Not a correctness problem (it errs toward refusing), but the budget could be revisited for the fast path. Fine as a follow-up.
Nits (optional improvements)
- test_terrain.py: the reference test uses
atol=2e-2, which is loose in absolute terms on zfactor-scaled values. Thertol=1e-4is what actually guards against drift; the atol is only there to absorb the ridged feedback path's ~6e-3 absolute gap. A one-line comment would stop a future reader from tightening atol and hitting a flaky failure. - terrain.py: the dask path rebuilds the permutation tables per chunk via
_make_perm_table. Cheap (256-entry shuffle) and it matches the pre-existing per-chunk structure, so no action needed. Noting it for the record.
What looks good
- Output verified against the pre-change implementation to float32 epsilon: fbm 1.2e-7 rel, ridged 1.6e-6 rel, worley path byte-identical.
- Edge cases confirmed: 1x1, single-row, single-column, 2x2, and strong warp pushing coordinates negative (the
& 255two's-complement wrap matches the numpy path). - prange writes distinct rows and acc/amp/freq/weight are loop-local, so there's no data race.
- Worley stays on the numpy path because it needs a global min/max pass; the fast-path gate is
worley_blend <= 0. - The new reference test gives the kernel an independent anchor on CPU-only CI, where dask-vs-numpy parity can't catch a shared-kernel bug and the GPU tests are skipped.
- Memory footprint is lower than before (no meshgrid, no per-octave temporaries).
Checklist
- Algorithm matches reference (perlin octave loop preserved, verified numerically)
- All implemented backends produce consistent results (numpy/dask parity tests pass; GPU unchanged)
- NaN handling is correct (all-NaN template overwritten, output finite)
- Edge cases are covered (verified 1x1/row/col/warp; existing all-NaN tests pass)
- Dask chunk boundaries handled correctly (per-chunk block_info coords unchanged)
- No premature materialization or unnecessary copies
- Benchmark exists (benchmarks/benchmarks/terrain.py)
- README feature matrix not applicable (no new function, no backend change)
- Docstrings present and accurate (public signature unchanged)
Pull the rtol/atol into named constants with a comment so a future reader does not tighten atol and flake the ridged case. Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR
brendancol
commented
Jun 27, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
Follow-up review
Pushed one follow-up commit for the review findings.
- Nit (reference-test tolerance): fixed. The rtol/atol are now named constants with a comment explaining that rtol is the real drift guard and atol only absorbs the ridged path's ~6e-3 gap.
- Suggestion (
_check_memory24 B/pixel budget): not changing it. The guard is a deliberately conservative memory safety check (caps scratch at 50% of RAM) and erring toward refusing is the safe direction. Loosening it to admit a few larger worley-off grids trades real OOM protection for a niche win, so I'd rather leave it. - Nit (per-chunk perm rebuild on dask): no action, it was informational and matches the existing per-chunk structure.
Terrain suite still green (52 passed). No blockers outstanding.
CI aborted on macOS 3.14 (SIGABRT) in the dask terrain test: numba's 'workqueue' threading layer is not threadsafe, so launching the parallel=True kernel concurrently from dask's threaded scheduler crashes the process. Take a module-level lock around the launch, the same hazard and fix the reproject kernels use (#3141). Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3565
Speeds up
generate_terrainon the numpy backend (and dask+numpy, which routes through the same_gen_terrain).What changed
_perlinonce per octave, and each call allocated ~20 full-size numpy temporaries (floor, fancy-index gather, fade, lerp). At 16 octaves that's hundreds of allocations.@njit(parallel=True)kernel that computes warp + fbm/ridged + cube per pixel,prangeover rows. No per-octave temporaries._gen_terraintakes the fast path whenworley_blend == 0(the default). The worley path keeps the existing numpy code, since it needs a global min/max pass.Results (numpy, default fbm, 16 octaves)
dask+numpy at 2048x2048 drops to 78 ms.
Correctness
Output matches the previous implementation to float32 epsilon: fbm ~1.2e-7 relative, ridged ~1.6e-6 relative, worley path unchanged exactly.
Backends
_gen_terrain)Test plan
pytest xrspatial/tests/test_terrain.py(52 passed, includes GPU + dask parity)_perlinso the kernel has an independent anchor on CPU-only CI (fbm, ridged, non-default params)https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR