NanoVDB Python: GPU/device bindings (nanovdb.cuda)#2225
Open
swahtz wants to merge 19 commits into
Open
Conversation
…andle seam First step of the GPU/device bindings epoch (issue AcademySoftwareFoundation#2208 follow-on). - Create a real root `nanovdb.cuda` submodule (mirrors C++ nanovdb::cuda) and MOVE DeviceBuffer + DeviceGridHandle into it. Breaking, pre-1.0 relocation: `nanovdb.DeviceBuffer` -> `nanovdb.cuda.DeviceBuffer`, `nanovdb.DeviceGridHandle` -> `nanovdb.cuda.DeviceGridHandle`. `nanovdb.tools.cuda` is untouched. No in-repo call site named the moved classes (handles come from io.deviceReadGrid / tools.cuda.create*), so only a migration note in python/examples/README.md was needed. - Add a BufferT-templated seam in PyDeviceBuffer.h: defineDeviceBufferLike <BufferT>() + addDeviceInterop<BufferT>() keyed only on the duck-typed buffer surface (size()/data()/deviceData()), so a future resource-backed buffer or UnifiedBuffer registers via one call. Phase A binds only the additive size() accessor; CAI/DLPack/pointers/streams land in Phase B. - Expose nanovdb.cuda.compile_options(*extra) via __init__.py, hasattr- guarded so non-CUDA builds still import. Compatible with nanobind >= 2.5.0 (no newer-only APIs). Verified by per-TU g++/nvcc compile checks of the changed TUs; runtime import is exercised by the shared build.yml pytest target. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Fills the Phase-A addDeviceInterop<BufferT> seam and the device GridHandle
with the GPU interop surface:
- __cuda_array_interface__ v3 (read-only dict, whole device buffer as 1-D
uint8) on DeviceBuffer and DeviceGridHandle.
- __dlpack__ / __dlpack_device__ on both: __dlpack__ builds an
nb::ndarray<device::cuda,uint8> view parented to the owner (keep_alive)
and returns nanobind's "dltensor" capsule directly — nanobind owns the
capsule/deleter/alignment; no hand-built DLManagedTensor.
- Raw pointers: DeviceBuffer.device_ptr()/host_ptr(), DeviceGridHandle
.device_ptr(), and a generic grid.data_ptr() (host or device per
provenance — grid(n) vs deviceGrid(n)).
- stream:int=0 threaded through deviceUpload/deviceDownload (was hardcoded
nullptr), via the current-device (void*,bool) overload so the targeted
device matches deviceData().
- DeviceBuffer.from_external(size, gpu_ptr, cpu_ptr) (non-owning wrap of
external host+device memory; mManaged==0 so neither ptr is freed) and
DeviceGridHandle.from_buffer(buffer) (moves the buffer; ctor validates
the grid header).
Device-only array protocols live ONLY on DeviceBuffer/DeviceGridHandle,
never the shared NanoGrid<T> class (which also wraps host grids).
Two correctness fixes from adversarial review folded in: __dlpack__ now
returns the cast capsule directly (calling .attr("__dlpack__") on it would
AttributeError, since nb::cast of a no-framework device ndarray already IS
the capsule); deviceUpload/deviceDownload use the current-device overload
instead of hardcoding device 0.
Known limitation: from_external requires BOTH a host and device pointer
because nanovdb::cuda::DeviceBuffer's external ctor asserts both non-null;
wrapping a device-only CuPy pointer would need a device-only external ctor
in the core (follow-up / resource-backed buffer). Compatible with nanobind
>= 2.5.0. Verified by per-TU g++/nvcc compile checks.
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…BlockManager - nanovdb.cuda.createDeviceNodeManager(device_grid, stream=0) + a DeviceNodeManagerHandle (over NodeManagerHandle<cuda::DeviceBuffer>): polymorphic over BuildTypes.def; takes a device grid (from DeviceGridHandle.deviceGrid(n)); mgr() returns the typed device NodeManager (kernel-only `this`). New cuda/PyDeviceNodeManager.cu. - nanovdb.tools.cuda.buildVoxelBlockManager(device_onindex_grid, log2_block_width=6, ..., stream=0) + a DeviceVoxelBlockManagerHandle whose firstLeafID() (uint32) / jumpMap() (uint64) are zero-copy DEVICE arrays exposing the Phase-B CAI/DLPack interop. New cuda/PyDeviceVoxelBlockManager.cu. The __device__ decodeInverseMaps is intentionally not bound (kernel-only; use the shipped headers). Four adversarial-review/runtime fixes folded in (all validated on the GPU): 1. Dropped nb::keep_alive<0,1> on the device firstLeafID()/jumpMap() views — the returned no-framework device ndarray is a DLPack capsule (not weak-referenceable), so keep_alive threw "could not create a weak reference"; the ndarray's py_self owner already anchors lifetime. 2. DeviceNodeManagerHandle mgr()/__bool__ now gate on size(), not the host data() pointer (which is null for a device-only NodeManager). 3. Dropped DeviceNodeManagerHandle.deviceUpload/deviceDownload — they null-deref on a device-built NodeManager (no host mirror); the handle is created device-resident, nothing to upload. 4. Device VoxelBlockManager registered on the existing nanovdb.tools.cuda (mirrors nanovdb::tools::cuda), not a bogus new nanovdb.cuda.tools. Compatible with nanobind >= 2.5.0. Verified by per-TU compile + a real incremental build + GPU runtime smoke (createDeviceNodeManager, device VBM build, cupy CAI/DLPack round-trips). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…params
- Generalize the CUDA points->grid surface beyond Rgba8 (nanovdb.tools.cuda):
- Coordinate input (int32 (N,3) voxel coords) -> grid via voxelsToGrid<T>
for Rgba8 / ValueOnIndex / ValueIndex (voxelsToRGBA8Grid /
voxelsToOnIndexGrid / voxelsToIndexGrid). Legacy pointsToRGBA8Grid kept
(routes to voxelsToGrid<Rgba8>), so existing callers/tests are unchanged.
- World-position input ((N,3) float OR double) -> NanoGrid<Point> via a new
pointsToGrid (both precisions under one name; nanobind dtype dispatch).
- Point is intentionally NOT offered on the coord path: PointsToGrid<Point>
static_asserts Vec3f/Vec3d coords, so Point grids build from world
positions only (documented).
- Add stream:int=0 (reinterpret_cast<cudaStream_t>) to every CUDA-launching
binding — pointsToGrid/voxelsToGrid, signedFloodFill, both sampleFromVoxels
overloads — with nb::gil_scoped_release around each kernel launch. Stream is
a trailing default kwarg, so existing call sites are unaffected.
- World path constructs PointsToGrid<Point>(voxelSize, ...) directly rather
than the fixed-voxelSize free function, which is declared-but-undefined in
the header (would link-fail).
Returns DeviceGridHandle (Phase-A/B device interop). Compatible with
nanobind >= 2.5.0. Verified by per-TU compile + real incremental build + GPU
runtime smoke (world float32/float64 -> Point grids, coord -> OnIndex/Index/
RGBA8, non-default streams) and the full Python suite (140/140 pass).
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Bind 5 of the 6 tools::cuda morphology/topology grid operators into nanovdb.tools.cuda, one Py*.cu + .h per header: - dilateGrid(d_grid, op=26, stream=0) (NN_FACE=6 / NN_FACE_EDGE_VERTEX=26) - coarsenGrid(d_grid, stream=0) (2x topological downsample) - refineGrid(d_grid, stream=0) (2x topological upsample) - pruneGrid(d_grid, leafMask, stream=0) (leafMask: device uint64 array reinterpreted as Mask<3>* per leaf; word count validated as a multiple of 8) - mergeGrids(d_grid1, d_grid2, stream=0) (binary active-mask union; chain for >2) All take a device grid (NanoGrid<ValueOnIndex>* from deviceGrid(n)) + a stream:int=0, run getHandle() under nb::gil_scoped_release, and return a DeviceGridHandle (Phase-A/B CAI/DLPack). Instantiated for ValueOnIndex ONLY: every op embeds TopologyBuilder<BuildT>, which static_asserts is_onindex. TopologyBuilder itself is intentionally NOT bound — it is the low-level multi-stage engine (ordered allocate/count/process* calls, no one-shot getHandle), not a user-facing op. Each .cu drops `using namespace nanovdb;` and fully-qualifies nanovdb:: to avoid a CUB DeviceScan vs nanovdb::cuda 'reference to cuda is ambiguous' error (same fix PyPointsToGrid.cu already uses). Compatible with nanobind >= 2.5.0. Verified by per-TU compile + real incremental build + GPU runtime smoke: all 5 ops run on a device OnIndex grid with correct active-voxel deltas (dilate grows, coarsen ~8x shrink, refine ~8x grow; prune + binary merge succeed). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Both evalChecksum overloads (GridData* and NanoGrid<BuildT>*) copied the computed head/tail CRC from device to host with size `headSize` (sizeof(GridData)+sizeof(TreeData), ~672 bytes) into the 4-byte Checksum::head()/tail() uint32_t destinations, reading ~672 bytes from `d_lut.get()+256` — one uint32_t past the 1024-byte CRC LUT. On a real device this trips "CUDA error 1: invalid argument" and cudaCheck aborts. The sibling validateChecksum copies the CRC correctly with sizeof(uint32_t); match it. The full-GridData header copies (which legitimately use headSize) are unchanged. Surfaced by the new nanovdb.tools.cuda.evalChecksum Python binding; verified fixed on a Blackwell GPU (evalChecksum returns a Checksum for CheckMode.Partial and Full instead of aborting). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Bind into nanovdb.tools.cuda (one Py*.cu + .h per header; stream:int=0 +
nb::gil_scoped_release; fully-qualified nanovdb:: per the CUB ambiguity fix):
- indexToGrid(index_grid, values, stream=0) -> DeviceGridHandle: device
ValueIndex/ValueOnIndex grid + a device values ndarray -> typed grid.
Scalar (float/double, flat 1-D) and Vec3 (Vec3f/Vec3d, (N,3) c_contig)
destinations. Dst restricted to non-special types (processLeafsKernel
static_assert !is_special); Src restricted to index types (SFINAE).
- addBlindData(grid, blind_data, class, semantic, name, stream=0) ->
DeviceGridHandle: appends a device blind-data channel (valueCount from
the array length). Reuses existing GridBlindDataClass/Semantic enums.
- Device QC mirroring the host names on tools.cuda:
- updateGridStats(grid, mode=StatsMode.Default, stream=0) -> None
- isValid(grid, mode=CheckMode.Default, verbose=False, stream=0) -> bool
- evalChecksum/validateChecksum/updateChecksum(grid, mode, stream=0)
(device grid reinterpreted as GridData*; the GridData* overloads only
copy the header host-side).
updateGridStats bound for scalar/vector/bool BuildTs (special/index/mask
trip Extrema/Stats static_asserts); isValid + checksum bound across the
full 20-type set.
Include ordering: the QC .cu files include <nanovdb/cuda/GridHandle.cuh>
before the tool .cuh (the tool headers aren't self-contained for the
device GridHandle's updateChecksum/NodeManager uses), matching
unittest/TestNanoVDB.cu.
Compatible with nanobind >= 2.5.0. Verified by per-TU compile + real
incremental build + GPU runtime smoke (updateGridStats/isValid/evalChecksum/
validateChecksum/updateChecksum, indexToGrid float+OnIndex, addBlindData).
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…PointsToGrid On nanovdb.cuda (NanoVDBModule.cc): - UnifiedBuffer: registered via the Phase-A defineDeviceBufferLike seam, so it gets CAI/DLPack/device_ptr/host_ptr/size for free (managed memory: host_ptr == device_ptr). Plus ctors, capacity/resize/clear, advise, prefetch, deviceUpload/deviceDownload. - UnifiedGridHandle (GridHandle<UnifiedBuffer>): needed because DistributedPointsToGrid.getHandle returns one; full GridHandle surface + deviceGrid/deviceUpload/deviceDownload/device_ptr. - DeviceMesh (+ DeviceNode), DeviceStreamMap (+ DeviceType enum): host-query surface for multi-device/stream coordination. - DeviceResource (static allocateAsync/deallocateAsync + DEFAULT_ALIGNMENT) and TempDevicePool (TempPool<DeviceResource>) bound in their CURRENT shape only — not reshaped into the future resource concept. On nanovdb.tools.cuda (PyTools.cc): - DistributedPointsToGrid / DistributedIndexPointsToGrid / DistributedRGBA8PointsToGrid (ValueOnIndex / ValueIndex / Rgba8): ctor (DeviceMesh + scale/translation, mesh kept alive via keep_alive) + getHandle((N,3) int32 coords) -> UnifiedGridHandle. getHandle requires nb::device::cuda_managed input: the C++ pipeline issues cudaMemAdvise/memPrefetchAsync directly on the pointer, so the array MUST be CUDA managed/unified memory. (Adversarial review caught that the initial nb::device::cuda tag made it uncallable — managed input rejected, plain device input crashed; fixed to cuda_managed.) Forward-compat guardrail honored: none of the reserved future names (Resource/MemoryResource/AsyncResource/PinnedResource/ResourceDeviceBuffer/ DeviceBuffer2/default_resource/set_default_resource) are bound. NCCL-gated and __device__-only internals left unbound. Compatible with nanobind >= 2.5.0. Verified by concurrent per-TU compile + real incremental build + GPU runtime smoke: UnifiedBuffer zero-copy CAI/DLPack round-trip, DeviceMesh/DeviceStreamMap construct, and all three Distributed*PointsToGrid build OnIndex/Index/RGBA8 grids from managed coords on a single GPU (plain-device input cleanly rejected). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…examples, and docs
Adds the test/example/documentation surface for the NanoVDB Python GPU
bindings landed in the preceding phases. No C++ binding changes.
- test/TestGpuInterop.py: stdlib-unittest suite (37 tests) mirroring
TestNanoVDB.py gating — every case is class-guarded on isCudaAvailable()
and isGpuAvailable(), CuPy is guarded per-test, and Torch via try-import +
skipTest, so the suite self-skips cleanly on a non-CUDA/no-GPU build.
Covers CAI v3 dict + DLPack round-trips aliasing device_ptr (nbytes==size),
from_external (managed wrap + null rejection), from_buffer move/reject,
non-default streams, the host-accessor-on-device-grid SIGSEGV contract
(asserted in a subprocess so it cannot abort the runner), UnifiedBuffer,
the generalized point/voxel rasterizers, device morphology/QC/flood-fill/
sampling, device NodeManager and VoxelBlockManager (firstLeafID/jumpMap
device views), the cuda infrastructure types, and DistributedPointsToGrid
over managed memory.
- examples/{gpu_load_inspect,cupy_rawkernel,numba_cuda,triton_kernel}.py:
runnable, import-guarded examples. cupy_rawkernel.py demonstrates the
device-pointer ABI — a cp.RawKernel that #includes NanoVDB.h via
compile_options() and reads a const NanoGrid<float>* from grid.data_ptr().
numba/triton examples self-skip with an install hint when absent.
- examples/README.md: a GPU/CUDA section covering the nanovdb.cuda (infra)
vs nanovdb.tools.cuda (algorithms) split, attribute-not-submodule access,
the device-grid build recipe, zero-copy interop + device-pointer ABI, the
host-accessor caveat, streams, from_external/from_buffer, the managed-memory
distributed pipeline, compile_options for NVRTC, the relocation migration
note, and a forward-looking pluggable-memory-resources roadmap.
- python/CMakeLists.txt: registers TestGpuInterop.py as its own ctest test
(pytest_nanovdb_gpu_interop) mirroring pytest_nanovdb; self-skips without
a GPU so it is harmless in non-CUDA CI.
Validated locally on a Blackwell GPU with CuPy: full suite green (1 torch
skip), both required examples run end-to-end, TestNanoVDB.py shows no
regression, and a CMake reconfigure confirms both ctest entries register.
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…e wheel The NanoVDB Python GPU bindings (CUDA array interface, DLPack export, device interop) benefit from a newer nanobind than the repo-wide 2.5.0 floor. Scope the bump to the two places that build the published NanoVDB surface: - .github/workflows/nanovdb.yml: install nanobind 2.12.0 (latest 2.x) for the dedicated NanoVDB job. - pyproject.toml: pin the wheel's nanobind to >=2.12.0,<3 (it was previously unpinned and silently floated to latest, diverging from CI). The other seven install sites (build/houdini/docs/ax/weekly) and the global FUTURE_MINIMUM_NANOBIND_VERSION stay at 2.5.0: nanobind is a shared dependency and the co-compiled pyopenvdb module plus the shared build.yml still target 2.5.0, so the binding code remains >=2.5.0 source-compatible. The <3 ceiling keeps the wheel on the 2.x series. Verified locally on a Blackwell GPU: nanobind 2.12.0 (installed to a separate prefix, CMake-resolved version confirmed 2.12.0) compiles both nanovdb_python and openvdb_python with no errors and no deprecation warnings. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Expose NanoVDB's sidecar-injection operators (util/cuda/Injection.cuh) on the nanovdb.tools.cuda submodule. These carry a grid's per-voxel data across a topology change (dilate/prune/merge), which the morphology ops alone do not do — the OnIndex value-index space changes, so values must be re-injected. - inject(src_grid, dst_grid, src_sidecar, dst_sidecar, stream=0): copies sidecar values for every voxel present in BOTH grids (the intersection); destination voxels with no source counterpart are left unchanged. Wraps InjectGridDataFunctor via util::cuda::operatorKernel (one block per source leaf, bit-parallel per warp). Bound for float and double sidecars. - injectPredicateToMask(grid, predicate, leaf_masks, stream=0): turns a boolean predicate over a grid's value indices into the per-leaf Mask<3> retain mask that pruneGrid consumes, so a value-driven trim (e.g. |phi| <= halfWidth) is two device ops with no hand-rolled kernel. Wraps InjectPredicateToMaskFunctor. leaf_masks need only be >= (leaf count)*8 uint64; activeVoxelCount*8 is a safe size since every leaf holds at least one active voxel. This is the GPU realization of the injectData abstraction in the NanoVDB 2.0 paper (section 3.4), and completes the dilate -> inject -> prune narrow-band rebuild loop entirely with bound device ops. Validated locally on a Blackwell GPU: inject copies exactly the src/dst intersection (leaving the rest, including the background slot, untouched); injectPredicateToMask + pruneGrid trims to exactly the predicate-true voxel set; a second inject carries values onto the pruned grid; bad grid arguments raise. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…section mask) Completes the util/cuda/Injection.cuh surface on nanovdb.tools.cuda: - inject(...) gains a vector-valued overload accepting 2-D (value count, dim) device sidecars, wrapping InjectGridFeatureFunctor. nanobind dispatches on ndim, so the same `inject` name handles both scalar (1-D) and feature (2-D) sidecars; the feature dimension is taken from shape[1]. - injectGridMask(src_grid, dst_grid, leaf_masks, stream=0): builds a per-leaf Mask<3> over the destination grid marking the voxels also active in the source (the src/dst intersection), ready to feed pruneGrid. Wraps InjectGridMaskFunctor via util::cuda::lambdaKernel. Like injectPredicateToMask, leaf_masks need only be >= (dst leaf count)*8 uint64. All four Injection.cuh functors are now exposed (inject scalar + feature, injectPredicateToMask, injectGridMask). Validated locally on a Blackwell GPU: the feature overload copies the intersection rows (leaving the rest, including the background row, untouched); injectGridMask + pruneGrid recovers exactly the source/destination intersection; the 1-D scalar inject overload still dispatches correctly. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…eable owner getBlindData() constructed its NumPy view with the generic nb::ndarray<T, ...> type (no framework tag) owned by the grid, then returned it with rv_policy::reference. Without the nb::numpy tag nanobind takes a keep-alive path that needs the owner (the grid) to be weak-referenceable; grid classes are not, so every call raised "nb::detail::keep_alive(): could not create a weak reference!" and the function was unusable. Tag the views as nb::ndarray<nb::numpy, T, ...> (1-D and 2-D), matching the working zero-copy views elsewhere (LeafNode.values(), the VoxelBlockManager firstLeafID()/jumpMap() host views). nanobind then attaches the owner as the NumPy array's base (a strong reference, no weakref), and the existing keep_alive<0,1> on the def keeps the grid — and the GridHandle that owns the buffer — alive for the view's lifetime. Verified locally: grid.getBlindData(0) on an OnIndex grid with a float SDF channel returns the expected (valueCount,) float32 array, and the view stays valid after the grid handle is dropped and garbage-collected. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
A full GPU level-set filter on a .nvdb file, driven by the device VoxelBlockManager, exercising the inject / injectPredicateToMask bindings and grid.getBlindData added in this branch. Each iteration runs a Laplacian-flow deform and a first-order Godunov reinitialisation as VBM stencil kernels (buildVoxelBlockManager + decodeInverseMaps/computeBoxStencil from a cupy.RawModule), then retracks the narrow band natively with dilateGrid -> inject -> injectPredicateToMask -> pruneGrid -> inject. Reads either a FloatGrid or an OnIndexGrid with the SDF in a blind channel (detected via gridType), normalising both to an OnIndex topology + a float SDF sidecar read with grid.getBlindData, and writes the result back in the same style. Run with no arguments it self-tests both forms (sphere shrinks under curvature flow); it self-skips when CUDA / a GPU / CuPy is unavailable. Listed in python/examples/README.md. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Materialise, for every active voxel of an OnIndex device grid, the values of its 3x3x3 neighbourhood into a dense (valueCount, 27) device array, so tile / array frameworks that cannot pointer-chase the VDB tree (CuPy, cuTile, ...) can run VDB stencils on a plain dense array. A transient VoxelBlockManager is built internally; each block decodes its inverse maps and calls the device computeBoxStencil, then writes the 27 neighbour values per voxel (inactive spokes read the sidecar's background slot, value index 0). Bound for float and double sidecars. Column j is the 3x3x3 spoke (di+1)*9+(dj+1)*3+(dk+1): centre j=13, the six faces j = 4, 10, 12, 14, 16, 22. This is the gather half of the VBM stencil pipeline (decodeInverseMaps + computeBoxStencil are device-only); pairing it with a dense sidecar lets the per-voxel arithmetic run in CuPy/cuTile without a hand-written CUDA kernel. Validated locally on a Blackwell GPU: with values[k]=k the centre column equals each voxel's own value index and all six opposite-face pairs are symmetric (k's +x neighbour's -x neighbour is k); float and double overloads agree. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…example The kernel-free counterpart to cupy_levelset_filter.py. Instead of calling the device VoxelBlockManager decode + computeBoxStencil from a hand-written cupy.RawModule kernel, it uses tools.cuda.gatherBoxStencil to materialise every active voxel's 3x3x3 neighbourhood into a dense (valueCount, 27) array, then runs the Laplacian-flow deform and first-order Godunov reinitialisation as plain CuPy array math -- no CUDA kernel, no thread indexing, no coordinates. (The same dense array is what a cuTile @ct.kernel would ct.load as tiles.) The sidecar's background slot is set to a sentinel so inactive neighbours are detectable in the gathered array, which lets the pure-CuPy code apply the same sign-consistent clamped boundary condition (copysign(background, phi_centre)) the kernel version uses. Fixed-topology stencils only; the narrow-band retrack and file I/O stay in cupy_levelset_filter.py. Requires only CuPy (no nvcc / NanoVDB headers, since nothing is compiled at runtime); self-skips otherwise. Verified on a Blackwell GPU: a few Laplacian steps raise mean||grad phi|-1| from ~0 to ~0.057, and the Godunov reinit brings it back to ~0.010 -- all computed kernel-free from gatherBoxStencil. Listed in python/examples/README.md. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…encil a full kernel-free filter
Binding: activeVoxelCoords(device_grid, out) writes each active voxel's
index-space coordinate into a dense (valueCount, 3) int32 array keyed by value
index -- the decode companion to gatherBoxStencil. It is the VBM coordinate
decode exposed as one bound op (transient VBM built internally), so callers can
recover per-voxel positions without a hand-written decode kernel (e.g. to bake a
sidecar result back into a grid).
Example: with activeVoxelCoords closing the last gap (the write needed
per-voxel coordinates), cupy_gather_stencil.py becomes the full kernel-free
counterpart to cupy_levelset_filter.py -- the same .nvdb->.nvdb level-set filter
(deform + Godunov reinit + narrow-band retrack, style-preserving) with no
cupy.RawModule / CUDA kernel anywhere:
* read : createOnIndexGrid + grid.getBlindData
* deform : gatherBoxStencil -> dense (N,27); Laplacian in CuPy
* renorm : same gather; first-order Godunov in CuPy
* retrack: dilateGrid -> inject -> extrapolate (CuPy on the gather) ->
|phi|<=halfWidth predicate (CuPy) -> injectPredicateToMask ->
pruneGrid -> inject
* write : activeVoxelCoords -> tools.build.FloatGrid, written in the input's
style (FloatGrid, or OnIndex with the SDF in a blind channel)
All per-voxel data lives in dense arrays (the shape a cuTile @ct.kernel would
ct.load). Requires only CuPy -- no nvcc / NanoVDB headers, since nothing is
compiled at runtime.
Validated on a Blackwell GPU: activeVoxelCoords reproduces the exact active
coordinate set; the kernel-free filter self-test matches the RawModule version
(FloatGrid sphere 20.02 -> 19.52, OnIndex sphere 18.05 -> 17.67) for both input
styles, output style preserved. Listed in python/examples/README.md.
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
… trio to levelset_filter_<backend> The three GPU level-set filter examples are one and the same filter (tools::LevelSetFilter-style Laplacian deform + Godunov reinit + narrow-band retrack, driven by the device VoxelBlockManager, style- preserving over FloatGrid / OnIndexGrid+blind-SDF inputs). They differ only in how the dense per-voxel stencils are computed, so name them by that backend so they sort together and read as a set: cupy_levelset_filter.py -> levelset_filter_rawkernel.py (fused cupy.RawModule CUDA kernel) cupy_gather_stencil.py -> levelset_filter_cupy.py (kernel-free, plain CuPy array ops) (new) levelset_filter_cutile.py (NVIDIA cuTile tile kernels) levelset_filter_cutile.py is levelset_filter_cupy.py with the per-voxel stencil math (deform / renorm / extrapolate) moved from CuPy array ops into cuda.tile (`@ct.kernel`, ct.load/ct.store over (TILE,) tiles); the sparse half (gatherBoxStencil / activeVoxelCoords / inject / dilateGrid / pruneGrid) is shared. Imports of cupy / cuda.tile are guarded so the file imports without a GPU and the self-test skips cleanly. Docstrings cross-reference the trio; README groups them under one table framing them as the same filter in three compute backends. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…pluggable backends
The three GPU level-set filter examples shared ~150 lines of identical code
(the .nvdb read, the style-preserving write, the narrow-band retrack, the
surface probe, the outer loop, the self-test) and differed only in the three
per-voxel stencil functions. Split that common half into a runnable driver and
reduce each backend to just its distinctive compute:
levelset_filter.py -- driver: all shared I/O / retrack / loop, a
small Backend protocol, and a `--backend`
selector; runs the full file->file filter.
levelset_filter_rawkernel.py -- Backend: the fused cupy.RawModule CUDA kernel
levelset_filter_cupy.py -- Backend: kernel-free CuPy array ops
levelset_filter_cutile.py -- Backend: NVIDIA cuTile tile kernels
A backend implements setup / active_coords / laplacian / godunov / extrapolate
plus a module-level make_backend(); the gather-based CuPy and cuTile backends
share a GatherBackend base (gatherBoxStencil / activeVoxelCoords), while the
rawkernel backend keeps its own VBM bookkeeping in the opaque per-grid context.
Each backend file also runs standalone as a stencil-only smoke test (deform +
renorm on a sphere, no file I/O and no retrack) of just that backend's math.
Net -197 lines. Behaviour is unchanged: all three backends produce identical
results to before (FloatGrid + OnIndex spheres shrink 18.0525 -> 17.6678, 23964
active voxels) and to each other. README updated to describe the driver + the
three backends.
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on
feature/nanovdb_python(the CPU api-mirror epoch) to add the GPU/device half of the NanoVDB Python bindings: a realnanovdb.cudanamespace, zero-copy interop, streams, device-side managers, thetools.cudaalgorithm surface, andcuda::*infrastructure.(Note: This is a PR into the
feature/nanovdb_pythonbranch and #2219 is a PR of that branch intomain. This is to make this PR easier to read so that it only has the deltas of the CUDA bindings. The idea is we review/approve both and then merge this PR into the feature branch and #2219 can then merge intomain.)Example (CuPy)
For functionality (beyond the functionality this PR provides) that needs to be compiled as a kernel for speed, you can now use CuPy to write custom kernels that get JIT compiled and can i/o buffers from NanoVDB Python which is kind of neato.
Build a grid on the device, view its buffer zero-copy as a CuPy array, then run your own CUDA kernel over it — compiled against the headers the wheel ships:
nanovdb.io.deviceReadGrid("grid.nvdb")reads a grid straight onto the device the same way. Seeexamples/for the runnable versions (gpu_load_inspect.py,cupy_rawkernel.py) and the examples README for the full GPU walkthrough.What's new
nanovdb.cuda(infrastructure)DeviceBuffer/DeviceGridHandlemoved from the root module intonanovdb.cuda(breaking — see Migration), now with__cuda_array_interface__v3,__dlpack__/__dlpack_device__,device_ptr/host_ptr,from_external(wrap caller-owned device/managed memory) andfrom_buffer.deviceUpload/deviceDownload(raw int stream handles).UnifiedBuffer/UnifiedGridHandle(managed memory),DeviceMesh,DeviceStreamMap,DeviceResource,TempDevicePool.createNodeManagerandbuildVoxelBlockManager(with devicefirstLeafID()/jumpMap()views).nanovdb.tools.cuda(algorithms) — all take an optionalstreampointsToGrid(world-space float/double → Point) alongside the voxel rasterizers (voxelsTo{OnIndex,Index,RGBA8}Grid,pointsToRGBA8Grid).dilateGrid/coarsenGrid/refineGrid/pruneGrid/mergeGrids.indexToGrid,addBlindData,updateGridStats,evalChecksum/validateChecksum/updateChecksum,isValid,signedFloodFill,sampleFromVoxels.DistributedPointsToGrid(multi-GPU; single-GPU path exercised here).tools.cuda.inject/injectPredicateToMask/injectGridMask— bind the sidecar-injection operators fromutil/cuda/Injection.cuh(the NanoVDB 2.0 paper's §3.4injectData): copy per-voxel data (scalar or, via a 2-D overload, vector-valued) across a topology change over the src∩dst intersection, and build the per-leafMask<3>thatpruneGridconsumes.tools.cuda.gatherBoxStencil/activeVoxelCoords— dense decode of the VBM stencil pipeline (decodeInverseMaps+computeBoxStencilare device-only):gatherBoxStencilwrites each active voxel's 3×3×3 neighbour values to a(valueCount, 27)device array (centre 13; faces 4/10/12/14/16/22),activeVoxelCoordswrites index-space positions to(valueCount, 3)— so dense array/tile frameworks (CuPy, cuTile) can run VDB stencils, and bake results back, without a hand-written decode kernel.Custom-kernel interop —
grid.data_ptr()exposes the device grid pointer andnanovdb.cuda.compile_options()returns the NVRTC include flags, e.g. a CuPyRawKernelthat#include <nanovdb/NanoVDB.h>and reads aconst NanoGrid<float>*.Tests / examples / docs —
TestGpuInterop.py, a 37-test CuPy/Torch-guarded suite (self-skips without CUDA/GPU) wired into ctest; four runnable examples (gpu_load_inspect,cupy_rawkernel,numba_cuda,triton_kernel); and a GPU section in the examples README.Build/CI — nanobind bumped to 2.12.0 scoped to the NanoVDB CI job + the wheel (
pyproject.tomlpinned>=2.12.0,<3). The other install sites andFUTURE_MINIMUM_NANOBIND_VERSIONstay at 2.5.0, and the binding code remains ≥2.5.0 source-compatible (the sharedbuild.ymlco-compiles it against 2.5.0).examples/levelset_filter.py+ three backends — one full GPU level-set filter (Laplacian deform + Godunov reinit + narrow-band retrack via the VBM anddilateGrid/inject/injectPredicateToMask/pruneGrid, style-preserving overFloatGrid/OnIndexGrid+blind-SDF.nvdbfiles). The driverlevelset_filter.pyowns everything that isn't backend-specific (read / write / retrack / outer loop) behind a smallBackendprotocol and a{rawkernel|cupy|cutile}selector; each backend supplies only the dense per-voxel stencils and also runs standalone as a stencil-only smoke test. The three differ only in that compute:levelset_filter_rawkernel.py(a fusedcupy.RawModuleCUDA kernel),levelset_filter_cupy.py(kernel-free CuPy array ops), andlevelset_filter_cutile.py(NVIDIA cuTile@ct.kernels) — all producing identical results.Also included
A core fix in
tools/cuda/GridChecksum.cuh: bothevalChecksumoverloads copied a 256-entry-LUT-sized region back into the 4-byte checksum field (OOB read → runtime abort); corrected tosizeof(uint32_t). Independent of the Python work, buttools.cuda.evalChecksumaborts without it.Validation
Validated locally on a Blackwell (sm_120) GPU with CuPy: CAI/DLPack zero-copy round-trips, streams,
from_external, device NodeManager/VBM, the rasterizer/morphology/QC ops, and the managed-memory distributed path; nanobind 2.12.0 confirmed to compile bothnanovdb_pythonandopenvdb_python. CI does not exercise GPU paths —ctestexcludescuda/mgpuand the Python GPU tests self-skip viaisCudaAvailable()/isGpuAvailable().