Skip to content

Speed up generate_terrain numpy backend with a fused parallel kernel#3566

Merged
brendancol merged 5 commits into
mainfrom
issue-3565
Jun 27, 2026
Merged

Speed up generate_terrain numpy backend with a fused parallel kernel#3566
brendancol merged 5 commits into
mainfrom
issue-3565

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #3565

Speeds up generate_terrain on the numpy backend (and dask+numpy, which routes through the same _gen_terrain).

What changed

  • The numpy octave loop called _perlin once per octave, and each call allocated ~20 full-size numpy temporaries (floor, fancy-index gather, fade, lerp). At 16 octaves that's hundreds of allocations.
  • Replaced it with one fused @njit(parallel=True) kernel that computes warp + fbm/ridged + cube per pixel, prange over rows. No per-octave temporaries.
  • _gen_terrain takes the fast path when worley_blend == 0 (the default). The worley path keeps the existing numpy code, since it needs a global min/max pass.

Results (numpy, default fbm, 16 octaves)

  • 512x512: 244 ms -> 7.8 ms
  • 1024x1024: 1506 ms -> 21 ms
  • 2048x2048: 7368 ms -> 76 ms (~97x)

dask+numpy at 2048x2048 drops to 78 ms.

Correctness

Output matches the previous implementation to float32 epsilon: fbm ~1.2e-7 relative, ridged ~1.6e-6 relative, worley path unchanged exactly.

Backends

  • numpy: fused kernel (faster)
  • dask+numpy: faster for free (same _gen_terrain)
  • cupy, dask+cupy: unchanged (GPU was already ~28 ms at 2048x2048)

Test plan

  • pytest xrspatial/tests/test_terrain.py (52 passed, includes GPU + dask parity)
  • Added a reference-based test built from the un-fused _perlin so the kernel has an independent anchor on CPU-only CI (fbm, ridged, non-default params)
  • perlin / validation / templates / dask-task-name suites green

https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR

)

The numpy octave loop called _perlin once per octave, and each call
allocated ~20 full-size temporaries. Collapse warp + fbm/ridged + cube
into one @njit(parallel=True) per-pixel kernel with prange over rows.
numpy and dask+numpy both route through _gen_terrain, so both speed up;
cupy and dask+cupy are unchanged. Worley blending keeps the numpy path
since it needs a global min/max pass.

~97x faster at 2048x2048 (7368 -> 76 ms); output matches to float32
epsilon.

Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR
dask+numpy and numpy now share the fused kernel and the GPU parity tests
are skipped on CPU-only CI, so add a reference built from the un-fused
_perlin to independently check the kernel for fbm, ridged, and
non-default lacunarity/persistence.

Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Speed up generate_terrain numpy backend with a fused parallel kernel

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

  • terrain.py:61 _check_memory still budgets _SCRATCH_BYTES_PER_PIXEL = 24 (the old 5-7 same-shape arrays). The fused fast path allocates far less per pixel: linx/liny plus one output array, no per-octave temporaries. So for worley-off calls the guard now over-estimates and can raise MemoryError on grids that would actually fit. Not a correctness problem (it errs toward refusing), but the budget could be revisited for the fast path. Fine as a follow-up.

Nits (optional improvements)

  • test_terrain.py: the reference test uses atol=2e-2, which is loose in absolute terms on zfactor-scaled values. The rtol=1e-4 is what actually guards against drift; the atol is only there to absorb the ridged feedback path's ~6e-3 absolute gap. A one-line comment would stop a future reader from tightening atol and hitting a flaky failure.
  • terrain.py: the dask path rebuilds the permutation tables per chunk via _make_perm_table. Cheap (256-entry shuffle) and it matches the pre-existing per-chunk structure, so no action needed. Noting it for the record.

What looks good

  • Output verified against the pre-change implementation to float32 epsilon: fbm 1.2e-7 rel, ridged 1.6e-6 rel, worley path byte-identical.
  • Edge cases confirmed: 1x1, single-row, single-column, 2x2, and strong warp pushing coordinates negative (the & 255 two's-complement wrap matches the numpy path).
  • prange writes distinct rows and acc/amp/freq/weight are loop-local, so there's no data race.
  • Worley stays on the numpy path because it needs a global min/max pass; the fast-path gate is worley_blend <= 0.
  • The new reference test gives the kernel an independent anchor on CPU-only CI, where dask-vs-numpy parity can't catch a shared-kernel bug and the GPU tests are skipped.
  • Memory footprint is lower than before (no meshgrid, no per-octave temporaries).

Checklist

  • Algorithm matches reference (perlin octave loop preserved, verified numerically)
  • All implemented backends produce consistent results (numpy/dask parity tests pass; GPU unchanged)
  • NaN handling is correct (all-NaN template overwritten, output finite)
  • Edge cases are covered (verified 1x1/row/col/warp; existing all-NaN tests pass)
  • Dask chunk boundaries handled correctly (per-chunk block_info coords unchanged)
  • No premature materialization or unnecessary copies
  • Benchmark exists (benchmarks/benchmarks/terrain.py)
  • README feature matrix not applicable (no new function, no backend change)
  • Docstrings present and accurate (public signature unchanged)

Pull the rtol/atol into named constants with a comment so a future
reader does not tighten atol and flake the ridged case.

Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review

Pushed one follow-up commit for the review findings.

  • Nit (reference-test tolerance): fixed. The rtol/atol are now named constants with a comment explaining that rtol is the real drift guard and atol only absorbs the ridged path's ~6e-3 gap.
  • Suggestion (_check_memory 24 B/pixel budget): not changing it. The guard is a deliberately conservative memory safety check (caps scratch at 50% of RAM) and erring toward refusing is the safe direction. Loosening it to admit a few larger worley-off grids trades real OOM protection for a niche win, so I'd rather leave it.
  • Nit (per-chunk perm rebuild on dask): no action, it was informational and matches the existing per-chunk structure.

Terrain suite still green (52 passed). No blockers outstanding.

CI aborted on macOS 3.14 (SIGABRT) in the dask terrain test: numba's
'workqueue' threading layer is not threadsafe, so launching the
parallel=True kernel concurrently from dask's threaded scheduler
crashes the process. Take a module-level lock around the launch, the
same hazard and fix the reproject kernels use (#3141).

Claude-Session: https://claude.ai/code/session_012qsDtsjQFiiNQ3MZjW3HBR
@brendancol brendancol merged commit 36101af into main Jun 27, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed up generate_terrain on the numpy backend with a fused parallel kernel

1 participant