Skip to content

NanoVDB Python: GPU/device bindings (nanovdb.cuda)#2225

Open
swahtz wants to merge 19 commits into
AcademySoftwareFoundation:feature/nanovdb_pythonfrom
swahtz:feature/nanovdb_python_gpu
Open

NanoVDB Python: GPU/device bindings (nanovdb.cuda)#2225
swahtz wants to merge 19 commits into
AcademySoftwareFoundation:feature/nanovdb_pythonfrom
swahtz:feature/nanovdb_python_gpu

Conversation

@swahtz
Copy link
Copy Markdown
Contributor

@swahtz swahtz commented Jun 2, 2026

Builds on feature/nanovdb_python (the CPU api-mirror epoch) to add the GPU/device half of the NanoVDB Python bindings: a real nanovdb.cuda namespace, zero-copy interop, streams, device-side managers, the tools.cuda algorithm surface, and cuda::* infrastructure.

(Note: This is a PR into the feature/nanovdb_python branch and #2219 is a PR of that branch into main. This is to make this PR easier to read so that it only has the deltas of the CUDA bindings. The idea is we review/approve both and then merge this PR into the feature branch and #2219 can then merge into main.)

Example (CuPy)

For functionality (beyond the functionality this PR provides) that needs to be compiled as a kernel for speed, you can now use CuPy to write custom kernels that get JIT compiled and can i/o buffers from NanoVDB Python which is kind of neato.

Build a grid on the device, view its buffer zero-copy as a CuPy array, then run your own CUDA kernel over it — compiled against the headers the wheel ships:

import cupy as cp
import nanovdb

# Build a float level-set sphere directly on the device, then upload it.
handle = nanovdb.tools.cuda.createLevelSetSphere(nanovdb.GridType.Float, 20)
handle.deviceUpload(0, True)            # raw int stream handle; 0 == default stream

# Zero-copy: view the whole device buffer as a CuPy array (no copy).
buf = cp.asarray(handle)                # via __cuda_array_interface__ / __dlpack__
assert int(buf.data.ptr) == handle.device_ptr()

# Run a custom kernel over the device grid. compile_options() supplies the
# -I for the NanoVDB headers bundled in the wheel.
device_grid = handle.deviceGrid(0)      # raw NanoGrid<float>* in GPU memory
kernel = cp.RawKernel(r'''
#include <nanovdb/NanoVDB.h>
extern "C" __global__
void inspect(const nanovdb::NanoGrid<float>* grid, float* out) {
    auto acc = grid->getAccessor();
    out[0] = acc.getValue(nanovdb::Coord(0, 0, 0));
    out[1] = float(grid->activeVoxelCount());
}
''', "inspect", options=nanovdb.cuda.compile_options("-std=c++17"), backend="nvrtc")

out = cp.zeros(2, dtype=cp.float32)
kernel((1,), (1,), (device_grid.data_ptr(), out.data.ptr))
cp.cuda.runtime.deviceSynchronize()
print(out)   # -> [-3.0, 30254.0]  (value@origin, activeVoxelCount)

nanovdb.io.deviceReadGrid("grid.nvdb") reads a grid straight onto the device the same way. See examples/ for the runnable versions (gpu_load_inspect.py, cupy_rawkernel.py) and the examples README for the full GPU walkthrough.

What's new

nanovdb.cuda (infrastructure)

  • DeviceBuffer / DeviceGridHandle moved from the root module into nanovdb.cuda (breaking — see Migration), now with __cuda_array_interface__ v3, __dlpack__/__dlpack_device__, device_ptr/host_ptr, from_external (wrap caller-owned device/managed memory) and from_buffer.
  • Stream-aware deviceUpload/deviceDownload (raw int stream handles).
  • UnifiedBuffer/UnifiedGridHandle (managed memory), DeviceMesh, DeviceStreamMap, DeviceResource, TempDevicePool.
  • Device createNodeManager and buildVoxelBlockManager (with device firstLeafID()/jumpMap() views).

nanovdb.tools.cuda (algorithms) — all take an optional stream

  • Generalized pointsToGrid (world-space float/double → Point) alongside the voxel rasterizers (voxelsTo{OnIndex,Index,RGBA8}Grid, pointsToRGBA8Grid).
  • Morphology/topology: dilateGrid/coarsenGrid/refineGrid/pruneGrid/mergeGrids.
  • Index/QC: indexToGrid, addBlindData, updateGridStats, evalChecksum/validateChecksum/updateChecksum, isValid, signedFloodFill, sampleFromVoxels.
  • DistributedPointsToGrid (multi-GPU; single-GPU path exercised here).
  • tools.cuda.inject / injectPredicateToMask / injectGridMask — bind the sidecar-injection operators from util/cuda/Injection.cuh (the NanoVDB 2.0 paper's §3.4 injectData): copy per-voxel data (scalar or, via a 2-D overload, vector-valued) across a topology change over the src∩dst intersection, and build the per-leaf Mask<3> that pruneGrid consumes.
  • tools.cuda.gatherBoxStencil / activeVoxelCoords — dense decode of the VBM stencil pipeline (decodeInverseMaps + computeBoxStencil are device-only): gatherBoxStencil writes each active voxel's 3×3×3 neighbour values to a (valueCount, 27) device array (centre 13; faces 4/10/12/14/16/22), activeVoxelCoords writes index-space positions to (valueCount, 3) — so dense array/tile frameworks (CuPy, cuTile) can run VDB stencils, and bake results back, without a hand-written decode kernel.

Custom-kernel interopgrid.data_ptr() exposes the device grid pointer and nanovdb.cuda.compile_options() returns the NVRTC include flags, e.g. a CuPy RawKernel that #include <nanovdb/NanoVDB.h> and reads a const NanoGrid<float>*.

Tests / examples / docsTestGpuInterop.py, a 37-test CuPy/Torch-guarded suite (self-skips without CUDA/GPU) wired into ctest; four runnable examples (gpu_load_inspect, cupy_rawkernel, numba_cuda, triton_kernel); and a GPU section in the examples README.

Build/CI — nanobind bumped to 2.12.0 scoped to the NanoVDB CI job + the wheel (pyproject.toml pinned >=2.12.0,<3). The other install sites and FUTURE_MINIMUM_NANOBIND_VERSION stay at 2.5.0, and the binding code remains ≥2.5.0 source-compatible (the shared build.yml co-compiles it against 2.5.0).

examples/levelset_filter.py + three backends — one full GPU level-set filter (Laplacian deform + Godunov reinit + narrow-band retrack via the VBM and dilateGrid/inject/injectPredicateToMask/pruneGrid, style-preserving over FloatGrid / OnIndexGrid+blind-SDF .nvdb files). The driver levelset_filter.py owns everything that isn't backend-specific (read / write / retrack / outer loop) behind a small Backend protocol and a {rawkernel|cupy|cutile} selector; each backend supplies only the dense per-voxel stencils and also runs standalone as a stencil-only smoke test. The three differ only in that compute: levelset_filter_rawkernel.py (a fused cupy.RawModule CUDA kernel), levelset_filter_cupy.py (kernel-free CuPy array ops), and levelset_filter_cutile.py (NVIDIA cuTile @ct.kernels) — all producing identical results.

Also included

A core fix in tools/cuda/GridChecksum.cuh: both evalChecksum overloads copied a 256-entry-LUT-sized region back into the 4-byte checksum field (OOB read → runtime abort); corrected to sizeof(uint32_t). Independent of the Python work, but tools.cuda.evalChecksum aborts without it.

Validation

Validated locally on a Blackwell (sm_120) GPU with CuPy: CAI/DLPack zero-copy round-trips, streams, from_external, device NodeManager/VBM, the rasterizer/morphology/QC ops, and the managed-memory distributed path; nanobind 2.12.0 confirmed to compile both nanovdb_python and openvdb_python. CI does not exercise GPU pathsctest excludes cuda/mgpu and the Python GPU tests self-skip via isCudaAvailable()/isGpuAvailable().

swahtz added 10 commits June 2, 2026 02:31
…andle seam

First step of the GPU/device bindings epoch (issue AcademySoftwareFoundation#2208 follow-on).

- Create a real root `nanovdb.cuda` submodule (mirrors C++ nanovdb::cuda)
  and MOVE DeviceBuffer + DeviceGridHandle into it. Breaking, pre-1.0
  relocation: `nanovdb.DeviceBuffer` -> `nanovdb.cuda.DeviceBuffer`,
  `nanovdb.DeviceGridHandle` -> `nanovdb.cuda.DeviceGridHandle`.
  `nanovdb.tools.cuda` is untouched. No in-repo call site named the moved
  classes (handles come from io.deviceReadGrid / tools.cuda.create*), so
  only a migration note in python/examples/README.md was needed.
- Add a BufferT-templated seam in PyDeviceBuffer.h: defineDeviceBufferLike
  <BufferT>() + addDeviceInterop<BufferT>() keyed only on the duck-typed
  buffer surface (size()/data()/deviceData()), so a future resource-backed
  buffer or UnifiedBuffer registers via one call. Phase A binds only the
  additive size() accessor; CAI/DLPack/pointers/streams land in Phase B.
- Expose nanovdb.cuda.compile_options(*extra) via __init__.py, hasattr-
  guarded so non-CUDA builds still import.

Compatible with nanobind >= 2.5.0 (no newer-only APIs). Verified by per-TU
g++/nvcc compile checks of the changed TUs; runtime import is exercised by
the shared build.yml pytest target.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Fills the Phase-A addDeviceInterop<BufferT> seam and the device GridHandle
with the GPU interop surface:

- __cuda_array_interface__ v3 (read-only dict, whole device buffer as 1-D
  uint8) on DeviceBuffer and DeviceGridHandle.
- __dlpack__ / __dlpack_device__ on both: __dlpack__ builds an
  nb::ndarray<device::cuda,uint8> view parented to the owner (keep_alive)
  and returns nanobind's "dltensor" capsule directly — nanobind owns the
  capsule/deleter/alignment; no hand-built DLManagedTensor.
- Raw pointers: DeviceBuffer.device_ptr()/host_ptr(), DeviceGridHandle
  .device_ptr(), and a generic grid.data_ptr() (host or device per
  provenance — grid(n) vs deviceGrid(n)).
- stream:int=0 threaded through deviceUpload/deviceDownload (was hardcoded
  nullptr), via the current-device (void*,bool) overload so the targeted
  device matches deviceData().
- DeviceBuffer.from_external(size, gpu_ptr, cpu_ptr) (non-owning wrap of
  external host+device memory; mManaged==0 so neither ptr is freed) and
  DeviceGridHandle.from_buffer(buffer) (moves the buffer; ctor validates
  the grid header).

Device-only array protocols live ONLY on DeviceBuffer/DeviceGridHandle,
never the shared NanoGrid<T> class (which also wraps host grids).

Two correctness fixes from adversarial review folded in: __dlpack__ now
returns the cast capsule directly (calling .attr("__dlpack__") on it would
AttributeError, since nb::cast of a no-framework device ndarray already IS
the capsule); deviceUpload/deviceDownload use the current-device overload
instead of hardcoding device 0.

Known limitation: from_external requires BOTH a host and device pointer
because nanovdb::cuda::DeviceBuffer's external ctor asserts both non-null;
wrapping a device-only CuPy pointer would need a device-only external ctor
in the core (follow-up / resource-backed buffer). Compatible with nanobind
>= 2.5.0. Verified by per-TU g++/nvcc compile checks.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…BlockManager

- nanovdb.cuda.createDeviceNodeManager(device_grid, stream=0) + a
  DeviceNodeManagerHandle (over NodeManagerHandle<cuda::DeviceBuffer>):
  polymorphic over BuildTypes.def; takes a device grid (from
  DeviceGridHandle.deviceGrid(n)); mgr() returns the typed device
  NodeManager (kernel-only `this`). New cuda/PyDeviceNodeManager.cu.
- nanovdb.tools.cuda.buildVoxelBlockManager(device_onindex_grid,
  log2_block_width=6, ..., stream=0) + a DeviceVoxelBlockManagerHandle
  whose firstLeafID() (uint32) / jumpMap() (uint64) are zero-copy DEVICE
  arrays exposing the Phase-B CAI/DLPack interop. New
  cuda/PyDeviceVoxelBlockManager.cu. The __device__ decodeInverseMaps is
  intentionally not bound (kernel-only; use the shipped headers).

Four adversarial-review/runtime fixes folded in (all validated on the GPU):
1. Dropped nb::keep_alive<0,1> on the device firstLeafID()/jumpMap() views
   — the returned no-framework device ndarray is a DLPack capsule (not
   weak-referenceable), so keep_alive threw "could not create a weak
   reference"; the ndarray's py_self owner already anchors lifetime.
2. DeviceNodeManagerHandle mgr()/__bool__ now gate on size(), not the host
   data() pointer (which is null for a device-only NodeManager).
3. Dropped DeviceNodeManagerHandle.deviceUpload/deviceDownload — they
   null-deref on a device-built NodeManager (no host mirror); the handle
   is created device-resident, nothing to upload.
4. Device VoxelBlockManager registered on the existing nanovdb.tools.cuda
   (mirrors nanovdb::tools::cuda), not a bogus new nanovdb.cuda.tools.

Compatible with nanobind >= 2.5.0. Verified by per-TU compile + a real
incremental build + GPU runtime smoke (createDeviceNodeManager, device VBM
build, cupy CAI/DLPack round-trips).

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…params

- Generalize the CUDA points->grid surface beyond Rgba8 (nanovdb.tools.cuda):
  - Coordinate input (int32 (N,3) voxel coords) -> grid via voxelsToGrid<T>
    for Rgba8 / ValueOnIndex / ValueIndex (voxelsToRGBA8Grid /
    voxelsToOnIndexGrid / voxelsToIndexGrid). Legacy pointsToRGBA8Grid kept
    (routes to voxelsToGrid<Rgba8>), so existing callers/tests are unchanged.
  - World-position input ((N,3) float OR double) -> NanoGrid<Point> via a new
    pointsToGrid (both precisions under one name; nanobind dtype dispatch).
  - Point is intentionally NOT offered on the coord path: PointsToGrid<Point>
    static_asserts Vec3f/Vec3d coords, so Point grids build from world
    positions only (documented).
- Add stream:int=0 (reinterpret_cast<cudaStream_t>) to every CUDA-launching
  binding — pointsToGrid/voxelsToGrid, signedFloodFill, both sampleFromVoxels
  overloads — with nb::gil_scoped_release around each kernel launch. Stream is
  a trailing default kwarg, so existing call sites are unaffected.
- World path constructs PointsToGrid<Point>(voxelSize, ...) directly rather
  than the fixed-voxelSize free function, which is declared-but-undefined in
  the header (would link-fail).

Returns DeviceGridHandle (Phase-A/B device interop). Compatible with
nanobind >= 2.5.0. Verified by per-TU compile + real incremental build + GPU
runtime smoke (world float32/float64 -> Point grids, coord -> OnIndex/Index/
RGBA8, non-default streams) and the full Python suite (140/140 pass).

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Bind 5 of the 6 tools::cuda morphology/topology grid operators into
nanovdb.tools.cuda, one Py*.cu + .h per header:

- dilateGrid(d_grid, op=26, stream=0)  (NN_FACE=6 / NN_FACE_EDGE_VERTEX=26)
- coarsenGrid(d_grid, stream=0)        (2x topological downsample)
- refineGrid(d_grid, stream=0)         (2x topological upsample)
- pruneGrid(d_grid, leafMask, stream=0) (leafMask: device uint64 array
  reinterpreted as Mask<3>* per leaf; word count validated as a multiple of 8)
- mergeGrids(d_grid1, d_grid2, stream=0) (binary active-mask union; chain for >2)

All take a device grid (NanoGrid<ValueOnIndex>* from deviceGrid(n)) + a
stream:int=0, run getHandle() under nb::gil_scoped_release, and return a
DeviceGridHandle (Phase-A/B CAI/DLPack). Instantiated for ValueOnIndex ONLY:
every op embeds TopologyBuilder<BuildT>, which static_asserts is_onindex.

TopologyBuilder itself is intentionally NOT bound — it is the low-level
multi-stage engine (ordered allocate/count/process* calls, no one-shot
getHandle), not a user-facing op.

Each .cu drops `using namespace nanovdb;` and fully-qualifies nanovdb:: to
avoid a CUB DeviceScan vs nanovdb::cuda 'reference to cuda is ambiguous'
error (same fix PyPointsToGrid.cu already uses).

Compatible with nanobind >= 2.5.0. Verified by per-TU compile + real
incremental build + GPU runtime smoke: all 5 ops run on a device OnIndex
grid with correct active-voxel deltas (dilate grows, coarsen ~8x shrink,
refine ~8x grow; prune + binary merge succeed).

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Both evalChecksum overloads (GridData* and NanoGrid<BuildT>*) copied the
computed head/tail CRC from device to host with size `headSize`
(sizeof(GridData)+sizeof(TreeData), ~672 bytes) into the 4-byte
Checksum::head()/tail() uint32_t destinations, reading ~672 bytes from
`d_lut.get()+256` — one uint32_t past the 1024-byte CRC LUT. On a real
device this trips "CUDA error 1: invalid argument" and cudaCheck aborts.
The sibling validateChecksum copies the CRC correctly with
sizeof(uint32_t); match it. The full-GridData header copies (which
legitimately use headSize) are unchanged.

Surfaced by the new nanovdb.tools.cuda.evalChecksum Python binding;
verified fixed on a Blackwell GPU (evalChecksum returns a Checksum for
CheckMode.Partial and Full instead of aborting).

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Bind into nanovdb.tools.cuda (one Py*.cu + .h per header; stream:int=0 +
nb::gil_scoped_release; fully-qualified nanovdb:: per the CUB ambiguity fix):

- indexToGrid(index_grid, values, stream=0) -> DeviceGridHandle: device
  ValueIndex/ValueOnIndex grid + a device values ndarray -> typed grid.
  Scalar (float/double, flat 1-D) and Vec3 (Vec3f/Vec3d, (N,3) c_contig)
  destinations. Dst restricted to non-special types (processLeafsKernel
  static_assert !is_special); Src restricted to index types (SFINAE).
- addBlindData(grid, blind_data, class, semantic, name, stream=0) ->
  DeviceGridHandle: appends a device blind-data channel (valueCount from
  the array length). Reuses existing GridBlindDataClass/Semantic enums.
- Device QC mirroring the host names on tools.cuda:
  - updateGridStats(grid, mode=StatsMode.Default, stream=0) -> None
  - isValid(grid, mode=CheckMode.Default, verbose=False, stream=0) -> bool
  - evalChecksum/validateChecksum/updateChecksum(grid, mode, stream=0)
    (device grid reinterpreted as GridData*; the GridData* overloads only
    copy the header host-side).
  updateGridStats bound for scalar/vector/bool BuildTs (special/index/mask
  trip Extrema/Stats static_asserts); isValid + checksum bound across the
  full 20-type set.

Include ordering: the QC .cu files include <nanovdb/cuda/GridHandle.cuh>
before the tool .cuh (the tool headers aren't self-contained for the
device GridHandle's updateChecksum/NodeManager uses), matching
unittest/TestNanoVDB.cu.

Compatible with nanobind >= 2.5.0. Verified by per-TU compile + real
incremental build + GPU runtime smoke (updateGridStats/isValid/evalChecksum/
validateChecksum/updateChecksum, indexToGrid float+OnIndex, addBlindData).

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…PointsToGrid

On nanovdb.cuda (NanoVDBModule.cc):
- UnifiedBuffer: registered via the Phase-A defineDeviceBufferLike seam, so
  it gets CAI/DLPack/device_ptr/host_ptr/size for free (managed memory:
  host_ptr == device_ptr). Plus ctors, capacity/resize/clear, advise,
  prefetch, deviceUpload/deviceDownload.
- UnifiedGridHandle (GridHandle<UnifiedBuffer>): needed because
  DistributedPointsToGrid.getHandle returns one; full GridHandle surface +
  deviceGrid/deviceUpload/deviceDownload/device_ptr.
- DeviceMesh (+ DeviceNode), DeviceStreamMap (+ DeviceType enum): host-query
  surface for multi-device/stream coordination.
- DeviceResource (static allocateAsync/deallocateAsync + DEFAULT_ALIGNMENT)
  and TempDevicePool (TempPool<DeviceResource>) bound in their CURRENT shape
  only — not reshaped into the future resource concept.

On nanovdb.tools.cuda (PyTools.cc):
- DistributedPointsToGrid / DistributedIndexPointsToGrid /
  DistributedRGBA8PointsToGrid (ValueOnIndex / ValueIndex / Rgba8): ctor
  (DeviceMesh + scale/translation, mesh kept alive via keep_alive) +
  getHandle((N,3) int32 coords) -> UnifiedGridHandle.

getHandle requires nb::device::cuda_managed input: the C++ pipeline issues
cudaMemAdvise/memPrefetchAsync directly on the pointer, so the array MUST be
CUDA managed/unified memory. (Adversarial review caught that the initial
nb::device::cuda tag made it uncallable — managed input rejected, plain
device input crashed; fixed to cuda_managed.)

Forward-compat guardrail honored: none of the reserved future names
(Resource/MemoryResource/AsyncResource/PinnedResource/ResourceDeviceBuffer/
DeviceBuffer2/default_resource/set_default_resource) are bound. NCCL-gated
and __device__-only internals left unbound.

Compatible with nanobind >= 2.5.0. Verified by concurrent per-TU compile +
real incremental build + GPU runtime smoke: UnifiedBuffer zero-copy
CAI/DLPack round-trip, DeviceMesh/DeviceStreamMap construct, and all three
Distributed*PointsToGrid build OnIndex/Index/RGBA8 grids from managed coords
on a single GPU (plain-device input cleanly rejected).

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…examples, and docs

Adds the test/example/documentation surface for the NanoVDB Python GPU
bindings landed in the preceding phases. No C++ binding changes.

- test/TestGpuInterop.py: stdlib-unittest suite (37 tests) mirroring
  TestNanoVDB.py gating — every case is class-guarded on isCudaAvailable()
  and isGpuAvailable(), CuPy is guarded per-test, and Torch via try-import +
  skipTest, so the suite self-skips cleanly on a non-CUDA/no-GPU build.
  Covers CAI v3 dict + DLPack round-trips aliasing device_ptr (nbytes==size),
  from_external (managed wrap + null rejection), from_buffer move/reject,
  non-default streams, the host-accessor-on-device-grid SIGSEGV contract
  (asserted in a subprocess so it cannot abort the runner), UnifiedBuffer,
  the generalized point/voxel rasterizers, device morphology/QC/flood-fill/
  sampling, device NodeManager and VoxelBlockManager (firstLeafID/jumpMap
  device views), the cuda infrastructure types, and DistributedPointsToGrid
  over managed memory.

- examples/{gpu_load_inspect,cupy_rawkernel,numba_cuda,triton_kernel}.py:
  runnable, import-guarded examples. cupy_rawkernel.py demonstrates the
  device-pointer ABI — a cp.RawKernel that #includes NanoVDB.h via
  compile_options() and reads a const NanoGrid<float>* from grid.data_ptr().
  numba/triton examples self-skip with an install hint when absent.

- examples/README.md: a GPU/CUDA section covering the nanovdb.cuda (infra)
  vs nanovdb.tools.cuda (algorithms) split, attribute-not-submodule access,
  the device-grid build recipe, zero-copy interop + device-pointer ABI, the
  host-accessor caveat, streams, from_external/from_buffer, the managed-memory
  distributed pipeline, compile_options for NVRTC, the relocation migration
  note, and a forward-looking pluggable-memory-resources roadmap.

- python/CMakeLists.txt: registers TestGpuInterop.py as its own ctest test
  (pytest_nanovdb_gpu_interop) mirroring pytest_nanovdb; self-skips without
  a GPU so it is harmless in non-CUDA CI.

Validated locally on a Blackwell GPU with CuPy: full suite green (1 torch
skip), both required examples run end-to-end, TestNanoVDB.py shows no
regression, and a CMake reconfigure confirms both ctest entries register.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…e wheel

The NanoVDB Python GPU bindings (CUDA array interface, DLPack export, device
interop) benefit from a newer nanobind than the repo-wide 2.5.0 floor. Scope
the bump to the two places that build the published NanoVDB surface:

- .github/workflows/nanovdb.yml: install nanobind 2.12.0 (latest 2.x) for the
  dedicated NanoVDB job.
- pyproject.toml: pin the wheel's nanobind to >=2.12.0,<3 (it was previously
  unpinned and silently floated to latest, diverging from CI).

The other seven install sites (build/houdini/docs/ax/weekly) and the global
FUTURE_MINIMUM_NANOBIND_VERSION stay at 2.5.0: nanobind is a shared dependency
and the co-compiled pyopenvdb module plus the shared build.yml still target
2.5.0, so the binding code remains >=2.5.0 source-compatible. The <3 ceiling
keeps the wheel on the 2.x series.

Verified locally on a Blackwell GPU: nanobind 2.12.0 (installed to a separate
prefix, CMake-resolved version confirmed 2.12.0) compiles both nanovdb_python
and openvdb_python with no errors and no deprecation warnings.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
swahtz added 9 commits June 2, 2026 22:54
Expose NanoVDB's sidecar-injection operators (util/cuda/Injection.cuh) on the
nanovdb.tools.cuda submodule. These carry a grid's per-voxel data across a
topology change (dilate/prune/merge), which the morphology ops alone do not do
— the OnIndex value-index space changes, so values must be re-injected.

- inject(src_grid, dst_grid, src_sidecar, dst_sidecar, stream=0): copies sidecar
  values for every voxel present in BOTH grids (the intersection); destination
  voxels with no source counterpart are left unchanged. Wraps
  InjectGridDataFunctor via util::cuda::operatorKernel (one block per source
  leaf, bit-parallel per warp). Bound for float and double sidecars.
- injectPredicateToMask(grid, predicate, leaf_masks, stream=0): turns a boolean
  predicate over a grid's value indices into the per-leaf Mask<3> retain mask
  that pruneGrid consumes, so a value-driven trim (e.g. |phi| <= halfWidth) is
  two device ops with no hand-rolled kernel. Wraps InjectPredicateToMaskFunctor.
  leaf_masks need only be >= (leaf count)*8 uint64; activeVoxelCount*8 is a safe
  size since every leaf holds at least one active voxel.

This is the GPU realization of the injectData abstraction in the NanoVDB 2.0
paper (section 3.4), and completes the dilate -> inject -> prune narrow-band
rebuild loop entirely with bound device ops.

Validated locally on a Blackwell GPU: inject copies exactly the src/dst
intersection (leaving the rest, including the background slot, untouched);
injectPredicateToMask + pruneGrid trims to exactly the predicate-true voxel set;
a second inject carries values onto the pruned grid; bad grid arguments raise.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…section mask)

Completes the util/cuda/Injection.cuh surface on nanovdb.tools.cuda:

- inject(...) gains a vector-valued overload accepting 2-D (value count, dim)
  device sidecars, wrapping InjectGridFeatureFunctor. nanobind dispatches on
  ndim, so the same `inject` name handles both scalar (1-D) and feature (2-D)
  sidecars; the feature dimension is taken from shape[1].
- injectGridMask(src_grid, dst_grid, leaf_masks, stream=0): builds a per-leaf
  Mask<3> over the destination grid marking the voxels also active in the
  source (the src/dst intersection), ready to feed pruneGrid. Wraps
  InjectGridMaskFunctor via util::cuda::lambdaKernel. Like
  injectPredicateToMask, leaf_masks need only be >= (dst leaf count)*8 uint64.

All four Injection.cuh functors are now exposed (inject scalar + feature,
injectPredicateToMask, injectGridMask).

Validated locally on a Blackwell GPU: the feature overload copies the
intersection rows (leaving the rest, including the background row, untouched);
injectGridMask + pruneGrid recovers exactly the source/destination intersection;
the 1-D scalar inject overload still dispatches correctly.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…eable owner

getBlindData() constructed its NumPy view with the generic nb::ndarray<T, ...>
type (no framework tag) owned by the grid, then returned it with
rv_policy::reference. Without the nb::numpy tag nanobind takes a keep-alive
path that needs the owner (the grid) to be weak-referenceable; grid classes are
not, so every call raised "nb::detail::keep_alive(): could not create a weak
reference!" and the function was unusable.

Tag the views as nb::ndarray<nb::numpy, T, ...> (1-D and 2-D), matching the
working zero-copy views elsewhere (LeafNode.values(), the VoxelBlockManager
firstLeafID()/jumpMap() host views). nanobind then attaches the owner as the
NumPy array's base (a strong reference, no weakref), and the existing
keep_alive<0,1> on the def keeps the grid — and the GridHandle that owns the
buffer — alive for the view's lifetime.

Verified locally: grid.getBlindData(0) on an OnIndex grid with a float SDF
channel returns the expected (valueCount,) float32 array, and the view stays
valid after the grid handle is dropped and garbage-collected.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
A full GPU level-set filter on a .nvdb file, driven by the device
VoxelBlockManager, exercising the inject / injectPredicateToMask bindings and
grid.getBlindData added in this branch. Each iteration runs a Laplacian-flow
deform and a first-order Godunov reinitialisation as VBM stencil kernels
(buildVoxelBlockManager + decodeInverseMaps/computeBoxStencil from a
cupy.RawModule), then retracks the narrow band natively with dilateGrid ->
inject -> injectPredicateToMask -> pruneGrid -> inject.

Reads either a FloatGrid or an OnIndexGrid with the SDF in a blind channel
(detected via gridType), normalising both to an OnIndex topology + a float SDF
sidecar read with grid.getBlindData, and writes the result back in the same
style. Run with no arguments it self-tests both forms (sphere shrinks under
curvature flow); it self-skips when CUDA / a GPU / CuPy is unavailable.

Listed in python/examples/README.md.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Materialise, for every active voxel of an OnIndex device grid, the values of
its 3x3x3 neighbourhood into a dense (valueCount, 27) device array, so tile /
array frameworks that cannot pointer-chase the VDB tree (CuPy, cuTile, ...) can
run VDB stencils on a plain dense array. A transient VoxelBlockManager is built
internally; each block decodes its inverse maps and calls the device
computeBoxStencil, then writes the 27 neighbour values per voxel (inactive
spokes read the sidecar's background slot, value index 0). Bound for float and
double sidecars.

Column j is the 3x3x3 spoke (di+1)*9+(dj+1)*3+(dk+1): centre j=13, the six faces
j = 4, 10, 12, 14, 16, 22. This is the gather half of the VBM stencil pipeline
(decodeInverseMaps + computeBoxStencil are device-only); pairing it with a dense
sidecar lets the per-voxel arithmetic run in CuPy/cuTile without a hand-written
CUDA kernel.

Validated locally on a Blackwell GPU: with values[k]=k the centre column equals
each voxel's own value index and all six opposite-face pairs are symmetric
(k's +x neighbour's -x neighbour is k); float and double overloads agree.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…example

The kernel-free counterpart to cupy_levelset_filter.py. Instead of calling the
device VoxelBlockManager decode + computeBoxStencil from a hand-written
cupy.RawModule kernel, it uses tools.cuda.gatherBoxStencil to materialise every
active voxel's 3x3x3 neighbourhood into a dense (valueCount, 27) array, then
runs the Laplacian-flow deform and first-order Godunov reinitialisation as plain
CuPy array math -- no CUDA kernel, no thread indexing, no coordinates. (The same
dense array is what a cuTile @ct.kernel would ct.load as tiles.)

The sidecar's background slot is set to a sentinel so inactive neighbours are
detectable in the gathered array, which lets the pure-CuPy code apply the same
sign-consistent clamped boundary condition (copysign(background, phi_centre))
the kernel version uses. Fixed-topology stencils only; the narrow-band retrack
and file I/O stay in cupy_levelset_filter.py.

Requires only CuPy (no nvcc / NanoVDB headers, since nothing is compiled at
runtime); self-skips otherwise. Verified on a Blackwell GPU: a few Laplacian
steps raise mean||grad phi|-1| from ~0 to ~0.057, and the Godunov reinit brings
it back to ~0.010 -- all computed kernel-free from gatherBoxStencil. Listed in
python/examples/README.md.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…encil a full kernel-free filter

Binding: activeVoxelCoords(device_grid, out) writes each active voxel's
index-space coordinate into a dense (valueCount, 3) int32 array keyed by value
index -- the decode companion to gatherBoxStencil. It is the VBM coordinate
decode exposed as one bound op (transient VBM built internally), so callers can
recover per-voxel positions without a hand-written decode kernel (e.g. to bake a
sidecar result back into a grid).

Example: with activeVoxelCoords closing the last gap (the write needed
per-voxel coordinates), cupy_gather_stencil.py becomes the full kernel-free
counterpart to cupy_levelset_filter.py -- the same .nvdb->.nvdb level-set filter
(deform + Godunov reinit + narrow-band retrack, style-preserving) with no
cupy.RawModule / CUDA kernel anywhere:
  * read   : createOnIndexGrid + grid.getBlindData
  * deform : gatherBoxStencil -> dense (N,27); Laplacian in CuPy
  * renorm : same gather; first-order Godunov in CuPy
  * retrack: dilateGrid -> inject -> extrapolate (CuPy on the gather) ->
             |phi|<=halfWidth predicate (CuPy) -> injectPredicateToMask ->
             pruneGrid -> inject
  * write  : activeVoxelCoords -> tools.build.FloatGrid, written in the input's
             style (FloatGrid, or OnIndex with the SDF in a blind channel)
All per-voxel data lives in dense arrays (the shape a cuTile @ct.kernel would
ct.load). Requires only CuPy -- no nvcc / NanoVDB headers, since nothing is
compiled at runtime.

Validated on a Blackwell GPU: activeVoxelCoords reproduces the exact active
coordinate set; the kernel-free filter self-test matches the RawModule version
(FloatGrid sphere 20.02 -> 19.52, OnIndex sphere 18.05 -> 17.67) for both input
styles, output style preserved. Listed in python/examples/README.md.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
… trio to levelset_filter_<backend>

The three GPU level-set filter examples are one and the same filter
(tools::LevelSetFilter-style Laplacian deform + Godunov reinit +
narrow-band retrack, driven by the device VoxelBlockManager, style-
preserving over FloatGrid / OnIndexGrid+blind-SDF inputs). They differ
only in how the dense per-voxel stencils are computed, so name them by
that backend so they sort together and read as a set:

  cupy_levelset_filter.py -> levelset_filter_rawkernel.py  (fused cupy.RawModule CUDA kernel)
  cupy_gather_stencil.py  -> levelset_filter_cupy.py       (kernel-free, plain CuPy array ops)
  (new)                      levelset_filter_cutile.py     (NVIDIA cuTile tile kernels)

levelset_filter_cutile.py is levelset_filter_cupy.py with the per-voxel
stencil math (deform / renorm / extrapolate) moved from CuPy array ops
into cuda.tile (`@ct.kernel`, ct.load/ct.store over (TILE,) tiles); the
sparse half (gatherBoxStencil / activeVoxelCoords / inject / dilateGrid /
pruneGrid) is shared. Imports of cupy / cuda.tile are guarded so the file
imports without a GPU and the self-test skips cleanly.

Docstrings cross-reference the trio; README groups them under one table
framing them as the same filter in three compute backends.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…pluggable backends

The three GPU level-set filter examples shared ~150 lines of identical code
(the .nvdb read, the style-preserving write, the narrow-band retrack, the
surface probe, the outer loop, the self-test) and differed only in the three
per-voxel stencil functions. Split that common half into a runnable driver and
reduce each backend to just its distinctive compute:

  levelset_filter.py            -- driver: all shared I/O / retrack / loop, a
                                   small Backend protocol, and a `--backend`
                                   selector; runs the full file->file filter.
  levelset_filter_rawkernel.py  -- Backend: the fused cupy.RawModule CUDA kernel
  levelset_filter_cupy.py       -- Backend: kernel-free CuPy array ops
  levelset_filter_cutile.py     -- Backend: NVIDIA cuTile tile kernels

A backend implements setup / active_coords / laplacian / godunov / extrapolate
plus a module-level make_backend(); the gather-based CuPy and cuTile backends
share a GatherBackend base (gatherBoxStencil / activeVoxelCoords), while the
rawkernel backend keeps its own VBM bookkeeping in the opaque per-grid context.
Each backend file also runs standalone as a stencil-only smoke test (deform +
renorm on a sphere, no file I/O and no retrack) of just that backend's math.

Net -197 lines. Behaviour is unchanged: all three backends produce identical
results to before (FloatGrid + OnIndex spheres shrink 18.0525 -> 17.6678, 23964
active voxels) and to each other. README updated to describe the driver + the
three backends.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant