NanoVDB Python: GPU/device bindings (`nanovdb.cuda`) by swahtz · Pull Request #2225 · AcademySoftwareFoundation/openvdb

swahtz · 2026-06-02T06:26:32Z

Builds on feature/nanovdb_python (the CPU api-mirror epoch) to add the GPU/device half of the NanoVDB Python bindings: a real nanovdb.cuda namespace, zero-copy interop, streams, device-side managers, the tools.cuda algorithm surface, and cuda::* infrastructure.

(Note: This is a PR into the feature/nanovdb_python branch and #2219 is a PR of that branch into main. This is to make this PR easier to read so that it only has the deltas of the CUDA bindings. The idea is we review/approve both and then merge this PR into the feature branch and #2219 can then merge into main.)

Example (CuPy)

For functionality (beyond the functionality this PR provides) that needs to be compiled as a kernel for speed, you can now use CuPy to write custom kernels that get JIT compiled and can i/o buffers from NanoVDB Python which is kind of neato.

Build a grid on the device, view its buffer zero-copy as a CuPy array, then run your own CUDA kernel over it — compiled against the headers the wheel ships:

import cupy as cp
import nanovdb

# Build a float level-set sphere directly on the device, then upload it.
handle = nanovdb.tools.cuda.createLevelSetSphere(nanovdb.GridType.Float, 20)
handle.deviceUpload(0, True)            # raw int stream handle; 0 == default stream

# Zero-copy: view the whole device buffer as a CuPy array (no copy).
buf = cp.asarray(handle)                # via __cuda_array_interface__ / __dlpack__
assert int(buf.data.ptr) == handle.device_ptr()

# Run a custom kernel over the device grid. compile_options() supplies the
# -I for the NanoVDB headers bundled in the wheel.
device_grid = handle.deviceGrid(0)      # raw NanoGrid<float>* in GPU memory
kernel = cp.RawKernel(r'''
#include <nanovdb/NanoVDB.h>
extern "C" __global__
void inspect(const nanovdb::NanoGrid<float>* grid, float* out) {
    auto acc = grid->getAccessor();
    out[0] = acc.getValue(nanovdb::Coord(0, 0, 0));
    out[1] = float(grid->activeVoxelCount());
}
''', "inspect", options=nanovdb.cuda.compile_options("-std=c++17"), backend="nvrtc")

out = cp.zeros(2, dtype=cp.float32)
kernel((1,), (1,), (device_grid.data_ptr(), out.data.ptr))
cp.cuda.runtime.deviceSynchronize()
print(out)   # -> [-3.0, 30254.0]  (value@origin, activeVoxelCount)

nanovdb.io.deviceReadGrid("grid.nvdb") reads a grid straight onto the device the same way. See examples/ for the runnable versions (gpu_load_inspect.py, cupy_rawkernel.py) and the examples README for the full GPU walkthrough.

What's new

nanovdb.cuda (infrastructure)

DeviceBuffer / DeviceGridHandle moved from the root module into nanovdb.cuda (breaking — see Migration), now with __cuda_array_interface__ v3, __dlpack__/__dlpack_device__, device_ptr/host_ptr, from_external (wrap caller-owned device/managed memory) and from_buffer.
Stream-aware deviceUpload/deviceDownload (raw int stream handles).
UnifiedBuffer/UnifiedGridHandle (managed memory), DeviceMesh, DeviceStreamMap, DeviceResource, TempDevicePool.
Device createNodeManager and buildVoxelBlockManager (with device firstLeafID()/jumpMap() views).

nanovdb.tools.cuda (algorithms) — all take an optional stream

Generalized pointsToGrid (world-space float/double → Point) alongside the voxel rasterizers (voxelsTo{OnIndex,Index,RGBA8}Grid, pointsToRGBA8Grid).
Morphology/topology: dilateGrid/coarsenGrid/refineGrid/pruneGrid/mergeGrids.
Index/QC: indexToGrid, addBlindData, updateGridStats, evalChecksum/validateChecksum/updateChecksum, isValid, signedFloodFill, sampleFromVoxels.
DistributedPointsToGrid (multi-GPU; single-GPU path exercised here).
tools.cuda.inject / injectPredicateToMask / injectGridMask — bind the sidecar-injection operators from util/cuda/Injection.cuh (the NanoVDB 2.0 paper's §3.4 injectData): copy per-voxel data (scalar or, via a 2-D overload, vector-valued) across a topology change over the src∩dst intersection, and build the per-leaf Mask<3> that pruneGrid consumes.
tools.cuda.gatherBoxStencil / activeVoxelCoords — dense decode of the VBM stencil pipeline (decodeInverseMaps + computeBoxStencil are device-only): gatherBoxStencil writes each active voxel's 3×3×3 neighbour values to a (valueCount, 27) device array (centre 13; faces 4/10/12/14/16/22), activeVoxelCoords writes index-space positions to (valueCount, 3) — so dense array/tile frameworks (CuPy, cuTile) can run VDB stencils, and bake results back, without a hand-written decode kernel.

Custom-kernel interop — grid.data_ptr() exposes the device grid pointer and nanovdb.cuda.compile_options() returns the NVRTC include flags, e.g. a CuPy RawKernel that #include <nanovdb/NanoVDB.h> and reads a const NanoGrid<float>*.

Tests / examples / docs — TestGpuInterop.py, a 37-test CuPy/Torch-guarded suite (self-skips without CUDA/GPU) wired into ctest; four runnable examples (gpu_load_inspect, cupy_rawkernel, numba_cuda, triton_kernel); and a GPU section in the examples README.

Build/CI — nanobind bumped to 2.12.0 scoped to the NanoVDB CI job + the wheel (pyproject.toml pinned >=2.12.0,<3). The other install sites and FUTURE_MINIMUM_NANOBIND_VERSION stay at 2.5.0, and the binding code remains ≥2.5.0 source-compatible (the shared build.yml co-compiles it against 2.5.0).

examples/levelset_filter.py + three backends — one full GPU level-set filter (Laplacian deform + Godunov reinit + narrow-band retrack via the VBM and dilateGrid/inject/injectPredicateToMask/pruneGrid, style-preserving over FloatGrid / OnIndexGrid+blind-SDF .nvdb files). The driver levelset_filter.py owns everything that isn't backend-specific (read / write / retrack / outer loop) behind a small Backend protocol and a {rawkernel|cupy|cutile} selector; each backend supplies only the dense per-voxel stencils and also runs standalone as a stencil-only smoke test. The three differ only in that compute: levelset_filter_rawkernel.py (a fused cupy.RawModule CUDA kernel), levelset_filter_cupy.py (kernel-free CuPy array ops), and levelset_filter_cutile.py (NVIDIA cuTile @ct.kernels) — all producing identical results.

Also included

A core fix in tools/cuda/GridChecksum.cuh: both evalChecksum overloads copied a 256-entry-LUT-sized region back into the 4-byte checksum field (OOB read → runtime abort); corrected to sizeof(uint32_t). Independent of the Python work, but tools.cuda.evalChecksum aborts without it.

Validation

Validated locally on a Blackwell (sm_120) GPU with CuPy: CAI/DLPack zero-copy round-trips, streams, from_external, device NodeManager/VBM, the rasterizer/morphology/QC ops, and the managed-memory distributed path; nanobind 2.12.0 confirmed to compile both nanovdb_python and openvdb_python. CI does not exercise GPU paths — ctest excludes cuda/mgpu and the Python GPU tests self-skip via isCudaAvailable()/isGpuAvailable().

…andle seam First step of the GPU/device bindings epoch (issue AcademySoftwareFoundation#2208 follow-on). - Create a real root `nanovdb.cuda` submodule (mirrors C++ nanovdb::cuda) and MOVE DeviceBuffer + DeviceGridHandle into it. Breaking, pre-1.0 relocation: `nanovdb.DeviceBuffer` -> `nanovdb.cuda.DeviceBuffer`, `nanovdb.DeviceGridHandle` -> `nanovdb.cuda.DeviceGridHandle`. `nanovdb.tools.cuda` is untouched. No in-repo call site named the moved classes (handles come from io.deviceReadGrid / tools.cuda.create*), so only a migration note in python/examples/README.md was needed. - Add a BufferT-templated seam in PyDeviceBuffer.h: defineDeviceBufferLike <BufferT>() + addDeviceInterop<BufferT>() keyed only on the duck-typed buffer surface (size()/data()/deviceData()), so a future resource-backed buffer or UnifiedBuffer registers via one call. Phase A binds only the additive size() accessor; CAI/DLPack/pointers/streams land in Phase B. - Expose nanovdb.cuda.compile_options(*extra) via __init__.py, hasattr- guarded so non-CUDA builds still import. Compatible with nanobind >= 2.5.0 (no newer-only APIs). Verified by per-TU g++/nvcc compile checks of the changed TUs; runtime import is exercised by the shared build.yml pytest target. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Fills the Phase-A addDeviceInterop<BufferT> seam and the device GridHandle with the GPU interop surface: - __cuda_array_interface__ v3 (read-only dict, whole device buffer as 1-D uint8) on DeviceBuffer and DeviceGridHandle. - __dlpack__ / __dlpack_device__ on both: __dlpack__ builds an nb::ndarray<device::cuda,uint8> view parented to the owner (keep_alive) and returns nanobind's "dltensor" capsule directly — nanobind owns the capsule/deleter/alignment; no hand-built DLManagedTensor. - Raw pointers: DeviceBuffer.device_ptr()/host_ptr(), DeviceGridHandle .device_ptr(), and a generic grid.data_ptr() (host or device per provenance — grid(n) vs deviceGrid(n)). - stream:int=0 threaded through deviceUpload/deviceDownload (was hardcoded nullptr), via the current-device (void*,bool) overload so the targeted device matches deviceData(). - DeviceBuffer.from_external(size, gpu_ptr, cpu_ptr) (non-owning wrap of external host+device memory; mManaged==0 so neither ptr is freed) and DeviceGridHandle.from_buffer(buffer) (moves the buffer; ctor validates the grid header). Device-only array protocols live ONLY on DeviceBuffer/DeviceGridHandle, never the shared NanoGrid<T> class (which also wraps host grids). Two correctness fixes from adversarial review folded in: __dlpack__ now returns the cast capsule directly (calling .attr("__dlpack__") on it would AttributeError, since nb::cast of a no-framework device ndarray already IS the capsule); deviceUpload/deviceDownload use the current-device overload instead of hardcoding device 0. Known limitation: from_external requires BOTH a host and device pointer because nanovdb::cuda::DeviceBuffer's external ctor asserts both non-null; wrapping a device-only CuPy pointer would need a device-only external ctor in the core (follow-up / resource-backed buffer). Compatible with nanobind >= 2.5.0. Verified by per-TU g++/nvcc compile checks. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

…BlockManager - nanovdb.cuda.createDeviceNodeManager(device_grid, stream=0) + a DeviceNodeManagerHandle (over NodeManagerHandle<cuda::DeviceBuffer>): polymorphic over BuildTypes.def; takes a device grid (from DeviceGridHandle.deviceGrid(n)); mgr() returns the typed device NodeManager (kernel-only `this`). New cuda/PyDeviceNodeManager.cu. - nanovdb.tools.cuda.buildVoxelBlockManager(device_onindex_grid, log2_block_width=6, ..., stream=0) + a DeviceVoxelBlockManagerHandle whose firstLeafID() (uint32) / jumpMap() (uint64) are zero-copy DEVICE arrays exposing the Phase-B CAI/DLPack interop. New cuda/PyDeviceVoxelBlockManager.cu. The __device__ decodeInverseMaps is intentionally not bound (kernel-only; use the shipped headers). Four adversarial-review/runtime fixes folded in (all validated on the GPU): 1. Dropped nb::keep_alive<0,1> on the device firstLeafID()/jumpMap() views — the returned no-framework device ndarray is a DLPack capsule (not weak-referenceable), so keep_alive threw "could not create a weak reference"; the ndarray's py_self owner already anchors lifetime. 2. DeviceNodeManagerHandle mgr()/__bool__ now gate on size(), not the host data() pointer (which is null for a device-only NodeManager). 3. Dropped DeviceNodeManagerHandle.deviceUpload/deviceDownload — they null-deref on a device-built NodeManager (no host mirror); the handle is created device-resident, nothing to upload. 4. Device VoxelBlockManager registered on the existing nanovdb.tools.cuda (mirrors nanovdb::tools::cuda), not a bogus new nanovdb.cuda.tools. Compatible with nanobind >= 2.5.0. Verified by per-TU compile + a real incremental build + GPU runtime smoke (createDeviceNodeManager, device VBM build, cupy CAI/DLPack round-trips). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

…params - Generalize the CUDA points->grid surface beyond Rgba8 (nanovdb.tools.cuda): - Coordinate input (int32 (N,3) voxel coords) -> grid via voxelsToGrid<T> for Rgba8 / ValueOnIndex / ValueIndex (voxelsToRGBA8Grid / voxelsToOnIndexGrid / voxelsToIndexGrid). Legacy pointsToRGBA8Grid kept (routes to voxelsToGrid<Rgba8>), so existing callers/tests are unchanged. - World-position input ((N,3) float OR double) -> NanoGrid<Point> via a new pointsToGrid (both precisions under one name; nanobind dtype dispatch). - Point is intentionally NOT offered on the coord path: PointsToGrid<Point> static_asserts Vec3f/Vec3d coords, so Point grids build from world positions only (documented). - Add stream:int=0 (reinterpret_cast<cudaStream_t>) to every CUDA-launching binding — pointsToGrid/voxelsToGrid, signedFloodFill, both sampleFromVoxels overloads — with nb::gil_scoped_release around each kernel launch. Stream is a trailing default kwarg, so existing call sites are unaffected. - World path constructs PointsToGrid<Point>(voxelSize, ...) directly rather than the fixed-voxelSize free function, which is declared-but-undefined in the header (would link-fail). Returns DeviceGridHandle (Phase-A/B device interop). Compatible with nanobind >= 2.5.0. Verified by per-TU compile + real incremental build + GPU runtime smoke (world float32/float64 -> Point grids, coord -> OnIndex/Index/ RGBA8, non-default streams) and the full Python suite (140/140 pass). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Bind 5 of the 6 tools::cuda morphology/topology grid operators into nanovdb.tools.cuda, one Py*.cu + .h per header: - dilateGrid(d_grid, op=26, stream=0) (NN_FACE=6 / NN_FACE_EDGE_VERTEX=26) - coarsenGrid(d_grid, stream=0) (2x topological downsample) - refineGrid(d_grid, stream=0) (2x topological upsample) - pruneGrid(d_grid, leafMask, stream=0) (leafMask: device uint64 array reinterpreted as Mask<3>* per leaf; word count validated as a multiple of 8) - mergeGrids(d_grid1, d_grid2, stream=0) (binary active-mask union; chain for >2) All take a device grid (NanoGrid<ValueOnIndex>* from deviceGrid(n)) + a stream:int=0, run getHandle() under nb::gil_scoped_release, and return a DeviceGridHandle (Phase-A/B CAI/DLPack). Instantiated for ValueOnIndex ONLY: every op embeds TopologyBuilder<BuildT>, which static_asserts is_onindex. TopologyBuilder itself is intentionally NOT bound — it is the low-level multi-stage engine (ordered allocate/count/process* calls, no one-shot getHandle), not a user-facing op. Each .cu drops `using namespace nanovdb;` and fully-qualifies nanovdb:: to avoid a CUB DeviceScan vs nanovdb::cuda 'reference to cuda is ambiguous' error (same fix PyPointsToGrid.cu already uses). Compatible with nanobind >= 2.5.0. Verified by per-TU compile + real incremental build + GPU runtime smoke: all 5 ops run on a device OnIndex grid with correct active-voxel deltas (dilate grows, coarsen ~8x shrink, refine ~8x grow; prune + binary merge succeed). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Both evalChecksum overloads (GridData* and NanoGrid<BuildT>*) copied the computed head/tail CRC from device to host with size `headSize` (sizeof(GridData)+sizeof(TreeData), ~672 bytes) into the 4-byte Checksum::head()/tail() uint32_t destinations, reading ~672 bytes from `d_lut.get()+256` — one uint32_t past the 1024-byte CRC LUT. On a real device this trips "CUDA error 1: invalid argument" and cudaCheck aborts. The sibling validateChecksum copies the CRC correctly with sizeof(uint32_t); match it. The full-GridData header copies (which legitimately use headSize) are unchanged. Surfaced by the new nanovdb.tools.cuda.evalChecksum Python binding; verified fixed on a Blackwell GPU (evalChecksum returns a Checksum for CheckMode.Partial and Full instead of aborting). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Bind into nanovdb.tools.cuda (one Py*.cu + .h per header; stream:int=0 + nb::gil_scoped_release; fully-qualified nanovdb:: per the CUB ambiguity fix): - indexToGrid(index_grid, values, stream=0) -> DeviceGridHandle: device ValueIndex/ValueOnIndex grid + a device values ndarray -> typed grid. Scalar (float/double, flat 1-D) and Vec3 (Vec3f/Vec3d, (N,3) c_contig) destinations. Dst restricted to non-special types (processLeafsKernel static_assert !is_special); Src restricted to index types (SFINAE). - addBlindData(grid, blind_data, class, semantic, name, stream=0) -> DeviceGridHandle: appends a device blind-data channel (valueCount from the array length). Reuses existing GridBlindDataClass/Semantic enums. - Device QC mirroring the host names on tools.cuda: - updateGridStats(grid, mode=StatsMode.Default, stream=0) -> None - isValid(grid, mode=CheckMode.Default, verbose=False, stream=0) -> bool - evalChecksum/validateChecksum/updateChecksum(grid, mode, stream=0) (device grid reinterpreted as GridData*; the GridData* overloads only copy the header host-side). updateGridStats bound for scalar/vector/bool BuildTs (special/index/mask trip Extrema/Stats static_asserts); isValid + checksum bound across the full 20-type set. Include ordering: the QC .cu files include <nanovdb/cuda/GridHandle.cuh> before the tool .cuh (the tool headers aren't self-contained for the device GridHandle's updateChecksum/NodeManager uses), matching unittest/TestNanoVDB.cu. Compatible with nanobind >= 2.5.0. Verified by per-TU compile + real incremental build + GPU runtime smoke (updateGridStats/isValid/evalChecksum/ validateChecksum/updateChecksum, indexToGrid float+OnIndex, addBlindData). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

…PointsToGrid On nanovdb.cuda (NanoVDBModule.cc): - UnifiedBuffer: registered via the Phase-A defineDeviceBufferLike seam, so it gets CAI/DLPack/device_ptr/host_ptr/size for free (managed memory: host_ptr == device_ptr). Plus ctors, capacity/resize/clear, advise, prefetch, deviceUpload/deviceDownload. - UnifiedGridHandle (GridHandle<UnifiedBuffer>): needed because DistributedPointsToGrid.getHandle returns one; full GridHandle surface + deviceGrid/deviceUpload/deviceDownload/device_ptr. - DeviceMesh (+ DeviceNode), DeviceStreamMap (+ DeviceType enum): host-query surface for multi-device/stream coordination. - DeviceResource (static allocateAsync/deallocateAsync + DEFAULT_ALIGNMENT) and TempDevicePool (TempPool<DeviceResource>) bound in their CURRENT shape only — not reshaped into the future resource concept. On nanovdb.tools.cuda (PyTools.cc): - DistributedPointsToGrid / DistributedIndexPointsToGrid / DistributedRGBA8PointsToGrid (ValueOnIndex / ValueIndex / Rgba8): ctor (DeviceMesh + scale/translation, mesh kept alive via keep_alive) + getHandle((N,3) int32 coords) -> UnifiedGridHandle. getHandle requires nb::device::cuda_managed input: the C++ pipeline issues cudaMemAdvise/memPrefetchAsync directly on the pointer, so the array MUST be CUDA managed/unified memory. (Adversarial review caught that the initial nb::device::cuda tag made it uncallable — managed input rejected, plain device input crashed; fixed to cuda_managed.) Forward-compat guardrail honored: none of the reserved future names (Resource/MemoryResource/AsyncResource/PinnedResource/ResourceDeviceBuffer/ DeviceBuffer2/default_resource/set_default_resource) are bound. NCCL-gated and __device__-only internals left unbound. Compatible with nanobind >= 2.5.0. Verified by concurrent per-TU compile + real incremental build + GPU runtime smoke: UnifiedBuffer zero-copy CAI/DLPack round-trip, DeviceMesh/DeviceStreamMap construct, and all three Distributed*PointsToGrid build OnIndex/Index/RGBA8 grids from managed coords on a single GPU (plain-device input cleanly rejected). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

…examples, and docs Adds the test/example/documentation surface for the NanoVDB Python GPU bindings landed in the preceding phases. No C++ binding changes. - test/TestGpuInterop.py: stdlib-unittest suite (37 tests) mirroring TestNanoVDB.py gating — every case is class-guarded on isCudaAvailable() and isGpuAvailable(), CuPy is guarded per-test, and Torch via try-import + skipTest, so the suite self-skips cleanly on a non-CUDA/no-GPU build. Covers CAI v3 dict + DLPack round-trips aliasing device_ptr (nbytes==size), from_external (managed wrap + null rejection), from_buffer move/reject, non-default streams, the host-accessor-on-device-grid SIGSEGV contract (asserted in a subprocess so it cannot abort the runner), UnifiedBuffer, the generalized point/voxel rasterizers, device morphology/QC/flood-fill/ sampling, device NodeManager and VoxelBlockManager (firstLeafID/jumpMap device views), the cuda infrastructure types, and DistributedPointsToGrid over managed memory. - examples/{gpu_load_inspect,cupy_rawkernel,numba_cuda,triton_kernel}.py: runnable, import-guarded examples. cupy_rawkernel.py demonstrates the device-pointer ABI — a cp.RawKernel that #includes NanoVDB.h via compile_options() and reads a const NanoGrid<float>* from grid.data_ptr(). numba/triton examples self-skip with an install hint when absent. - examples/README.md: a GPU/CUDA section covering the nanovdb.cuda (infra) vs nanovdb.tools.cuda (algorithms) split, attribute-not-submodule access, the device-grid build recipe, zero-copy interop + device-pointer ABI, the host-accessor caveat, streams, from_external/from_buffer, the managed-memory distributed pipeline, compile_options for NVRTC, the relocation migration note, and a forward-looking pluggable-memory-resources roadmap. - python/CMakeLists.txt: registers TestGpuInterop.py as its own ctest test (pytest_nanovdb_gpu_interop) mirroring pytest_nanovdb; self-skips without a GPU so it is harmless in non-CUDA CI. Validated locally on a Blackwell GPU with CuPy: full suite green (1 torch skip), both required examples run end-to-end, TestNanoVDB.py shows no regression, and a CMake reconfigure confirms both ctest entries register. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

…e wheel The NanoVDB Python GPU bindings (CUDA array interface, DLPack export, device interop) benefit from a newer nanobind than the repo-wide 2.5.0 floor. Scope the bump to the two places that build the published NanoVDB surface: - .github/workflows/nanovdb.yml: install nanobind 2.12.0 (latest 2.x) for the dedicated NanoVDB job. - pyproject.toml: pin the wheel's nanobind to >=2.12.0,<3 (it was previously unpinned and silently floated to latest, diverging from CI). The other seven install sites (build/houdini/docs/ax/weekly) and the global FUTURE_MINIMUM_NANOBIND_VERSION stay at 2.5.0: nanobind is a shared dependency and the co-compiled pyopenvdb module plus the shared build.yml still target 2.5.0, so the binding code remains >=2.5.0 source-compatible. The <3 ceiling keeps the wheel on the 2.x series. Verified locally on a Blackwell GPU: nanobind 2.12.0 (installed to a separate prefix, CMake-resolved version confirmed 2.12.0) compiles both nanovdb_python and openvdb_python with no errors and no deprecation warnings. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Expose NanoVDB's sidecar-injection operators (util/cuda/Injection.cuh) on the nanovdb.tools.cuda submodule. These carry a grid's per-voxel data across a topology change (dilate/prune/merge), which the morphology ops alone do not do — the OnIndex value-index space changes, so values must be re-injected. - inject(src_grid, dst_grid, src_sidecar, dst_sidecar, stream=0): copies sidecar values for every voxel present in BOTH grids (the intersection); destination voxels with no source counterpart are left unchanged. Wraps InjectGridDataFunctor via util::cuda::operatorKernel (one block per source leaf, bit-parallel per warp). Bound for float and double sidecars. - injectPredicateToMask(grid, predicate, leaf_masks, stream=0): turns a boolean predicate over a grid's value indices into the per-leaf Mask<3> retain mask that pruneGrid consumes, so a value-driven trim (e.g. |phi| <= halfWidth) is two device ops with no hand-rolled kernel. Wraps InjectPredicateToMaskFunctor. leaf_masks need only be >= (leaf count)*8 uint64; activeVoxelCount*8 is a safe size since every leaf holds at least one active voxel. This is the GPU realization of the injectData abstraction in the NanoVDB 2.0 paper (section 3.4), and completes the dilate -> inject -> prune narrow-band rebuild loop entirely with bound device ops. Validated locally on a Blackwell GPU: inject copies exactly the src/dst intersection (leaving the rest, including the background slot, untouched); injectPredicateToMask + pruneGrid trims to exactly the predicate-true voxel set; a second inject carries values onto the pruned grid; bad grid arguments raise. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

…section mask) Completes the util/cuda/Injection.cuh surface on nanovdb.tools.cuda: - inject(...) gains a vector-valued overload accepting 2-D (value count, dim) device sidecars, wrapping InjectGridFeatureFunctor. nanobind dispatches on ndim, so the same `inject` name handles both scalar (1-D) and feature (2-D) sidecars; the feature dimension is taken from shape[1]. - injectGridMask(src_grid, dst_grid, leaf_masks, stream=0): builds a per-leaf Mask<3> over the destination grid marking the voxels also active in the source (the src/dst intersection), ready to feed pruneGrid. Wraps InjectGridMaskFunctor via util::cuda::lambdaKernel. Like injectPredicateToMask, leaf_masks need only be >= (dst leaf count)*8 uint64. All four Injection.cuh functors are now exposed (inject scalar + feature, injectPredicateToMask, injectGridMask). Validated locally on a Blackwell GPU: the feature overload copies the intersection rows (leaving the rest, including the background row, untouched); injectGridMask + pruneGrid recovers exactly the source/destination intersection; the 1-D scalar inject overload still dispatches correctly. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

…eable owner getBlindData() constructed its NumPy view with the generic nb::ndarray<T, ...> type (no framework tag) owned by the grid, then returned it with rv_policy::reference. Without the nb::numpy tag nanobind takes a keep-alive path that needs the owner (the grid) to be weak-referenceable; grid classes are not, so every call raised "nb::detail::keep_alive(): could not create a weak reference!" and the function was unusable. Tag the views as nb::ndarray<nb::numpy, T, ...> (1-D and 2-D), matching the working zero-copy views elsewhere (LeafNode.values(), the VoxelBlockManager firstLeafID()/jumpMap() host views). nanobind then attaches the owner as the NumPy array's base (a strong reference, no weakref), and the existing keep_alive<0,1> on the def keeps the grid — and the GridHandle that owns the buffer — alive for the view's lifetime. Verified locally: grid.getBlindData(0) on an OnIndex grid with a float SDF channel returns the expected (valueCount,) float32 array, and the view stays valid after the grid handle is dropped and garbage-collected. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

A full GPU level-set filter on a .nvdb file, driven by the device VoxelBlockManager, exercising the inject / injectPredicateToMask bindings and grid.getBlindData added in this branch. Each iteration runs a Laplacian-flow deform and a first-order Godunov reinitialisation as VBM stencil kernels (buildVoxelBlockManager + decodeInverseMaps/computeBoxStencil from a cupy.RawModule), then retracks the narrow band natively with dilateGrid -> inject -> injectPredicateToMask -> pruneGrid -> inject. Reads either a FloatGrid or an OnIndexGrid with the SDF in a blind channel (detected via gridType), normalising both to an OnIndex topology + a float SDF sidecar read with grid.getBlindData, and writes the result back in the same style. Run with no arguments it self-tests both forms (sphere shrinks under curvature flow); it self-skips when CUDA / a GPU / CuPy is unavailable. Listed in python/examples/README.md. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Materialise, for every active voxel of an OnIndex device grid, the values of its 3x3x3 neighbourhood into a dense (valueCount, 27) device array, so tile / array frameworks that cannot pointer-chase the VDB tree (CuPy, cuTile, ...) can run VDB stencils on a plain dense array. A transient VoxelBlockManager is built internally; each block decodes its inverse maps and calls the device computeBoxStencil, then writes the 27 neighbour values per voxel (inactive spokes read the sidecar's background slot, value index 0). Bound for float and double sidecars. Column j is the 3x3x3 spoke (di+1)*9+(dj+1)*3+(dk+1): centre j=13, the six faces j = 4, 10, 12, 14, 16, 22. This is the gather half of the VBM stencil pipeline (decodeInverseMaps + computeBoxStencil are device-only); pairing it with a dense sidecar lets the per-voxel arithmetic run in CuPy/cuTile without a hand-written CUDA kernel. Validated locally on a Blackwell GPU: with values[k]=k the centre column equals each voxel's own value index and all six opposite-face pairs are symmetric (k's +x neighbour's -x neighbour is k); float and double overloads agree. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

…example The kernel-free counterpart to cupy_levelset_filter.py. Instead of calling the device VoxelBlockManager decode + computeBoxStencil from a hand-written cupy.RawModule kernel, it uses tools.cuda.gatherBoxStencil to materialise every active voxel's 3x3x3 neighbourhood into a dense (valueCount, 27) array, then runs the Laplacian-flow deform and first-order Godunov reinitialisation as plain CuPy array math -- no CUDA kernel, no thread indexing, no coordinates. (The same dense array is what a cuTile @ct.kernel would ct.load as tiles.) The sidecar's background slot is set to a sentinel so inactive neighbours are detectable in the gathered array, which lets the pure-CuPy code apply the same sign-consistent clamped boundary condition (copysign(background, phi_centre)) the kernel version uses. Fixed-topology stencils only; the narrow-band retrack and file I/O stay in cupy_levelset_filter.py. Requires only CuPy (no nvcc / NanoVDB headers, since nothing is compiled at runtime); self-skips otherwise. Verified on a Blackwell GPU: a few Laplacian steps raise mean||grad phi|-1| from ~0 to ~0.057, and the Godunov reinit brings it back to ~0.010 -- all computed kernel-free from gatherBoxStencil. Listed in python/examples/README.md. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

…encil a full kernel-free filter Binding: activeVoxelCoords(device_grid, out) writes each active voxel's index-space coordinate into a dense (valueCount, 3) int32 array keyed by value index -- the decode companion to gatherBoxStencil. It is the VBM coordinate decode exposed as one bound op (transient VBM built internally), so callers can recover per-voxel positions without a hand-written decode kernel (e.g. to bake a sidecar result back into a grid). Example: with activeVoxelCoords closing the last gap (the write needed per-voxel coordinates), cupy_gather_stencil.py becomes the full kernel-free counterpart to cupy_levelset_filter.py -- the same .nvdb->.nvdb level-set filter (deform + Godunov reinit + narrow-band retrack, style-preserving) with no cupy.RawModule / CUDA kernel anywhere: * read : createOnIndexGrid + grid.getBlindData * deform : gatherBoxStencil -> dense (N,27); Laplacian in CuPy * renorm : same gather; first-order Godunov in CuPy * retrack: dilateGrid -> inject -> extrapolate (CuPy on the gather) -> |phi|<=halfWidth predicate (CuPy) -> injectPredicateToMask -> pruneGrid -> inject * write : activeVoxelCoords -> tools.build.FloatGrid, written in the input's style (FloatGrid, or OnIndex with the SDF in a blind channel) All per-voxel data lives in dense arrays (the shape a cuTile @ct.kernel would ct.load). Requires only CuPy -- no nvcc / NanoVDB headers, since nothing is compiled at runtime. Validated on a Blackwell GPU: activeVoxelCoords reproduces the exact active coordinate set; the kernel-free filter self-test matches the RawModule version (FloatGrid sphere 20.02 -> 19.52, OnIndex sphere 18.05 -> 17.67) for both input styles, output style preserved. Listed in python/examples/README.md. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

… trio to levelset_filter_<backend> The three GPU level-set filter examples are one and the same filter (tools::LevelSetFilter-style Laplacian deform + Godunov reinit + narrow-band retrack, driven by the device VoxelBlockManager, style- preserving over FloatGrid / OnIndexGrid+blind-SDF inputs). They differ only in how the dense per-voxel stencils are computed, so name them by that backend so they sort together and read as a set: cupy_levelset_filter.py -> levelset_filter_rawkernel.py (fused cupy.RawModule CUDA kernel) cupy_gather_stencil.py -> levelset_filter_cupy.py (kernel-free, plain CuPy array ops) (new) levelset_filter_cutile.py (NVIDIA cuTile tile kernels) levelset_filter_cutile.py is levelset_filter_cupy.py with the per-voxel stencil math (deform / renorm / extrapolate) moved from CuPy array ops into cuda.tile (`@ct.kernel`, ct.load/ct.store over (TILE,) tiles); the sparse half (gatherBoxStencil / activeVoxelCoords / inject / dilateGrid / pruneGrid) is shared. Imports of cupy / cuda.tile are guarded so the file imports without a GPU and the self-test skips cleanly. Docstrings cross-reference the trio; README groups them under one table framing them as the same filter in three compute backends. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

…pluggable backends The three GPU level-set filter examples shared ~150 lines of identical code (the .nvdb read, the style-preserving write, the narrow-band retrack, the surface probe, the outer loop, the self-test) and differed only in the three per-voxel stencil functions. Split that common half into a runnable driver and reduce each backend to just its distinctive compute: levelset_filter.py -- driver: all shared I/O / retrack / loop, a small Backend protocol, and a `--backend` selector; runs the full file->file filter. levelset_filter_rawkernel.py -- Backend: the fused cupy.RawModule CUDA kernel levelset_filter_cupy.py -- Backend: kernel-free CuPy array ops levelset_filter_cutile.py -- Backend: NVIDIA cuTile tile kernels A backend implements setup / active_coords / laplacian / godunov / extrapolate plus a module-level make_backend(); the gather-based CuPy and cuTile backends share a GatherBackend base (gatherBoxStencil / activeVoxelCoords), while the rawkernel backend keeps its own VBM bookkeeping in the opaque per-grid context. Each backend file also runs standalone as a stencil-only smoke test (deform + renorm on a sphere, no file I/O and no retrack) of just that backend's math. Net -197 lines. Behaviour is unchanged: all three backends produce identical results to before (FloatGrid + OnIndex spheres shrink 18.0525 -> 17.6678, 23964 active voxels) and to each other. README updated to describe the driver + the three backends. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz added 10 commits June 2, 2026 02:31

swahtz requested review from Idclip, apradhana, danrbailey, jmlait, kmuseth and richhones as code owners June 2, 2026 06:26

swahtz added the nanovdb label Jun 2, 2026

swahtz added 9 commits June 2, 2026 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NanoVDB Python: GPU/device bindings (`nanovdb.cuda`)#2225

NanoVDB Python: GPU/device bindings (`nanovdb.cuda`)#2225
swahtz wants to merge 19 commits into
AcademySoftwareFoundation:feature/nanovdb_pythonfrom
swahtz:feature/nanovdb_python_gpu

swahtz commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

swahtz commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example (CuPy)

What's new

Also included

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

swahtz commented Jun 2, 2026 •

edited

Loading