[WIP-Don't review] Templatized scratch buffers for topology/rasterization ops by sifakis · Pull Request #2224 · AcademySoftwareFoundation/openvdb

sifakis · 2026-06-01T21:53:25Z

⚠️ [WIP — please don't review yet] This is a draft to share direction and run CI early. The API and commit history are still in flux.

Summary

Make the transient scratch buffer type a (defaulted) template parameter of the CUDA topology/rasterization operators, so a downstream can inject its own buffer type for nanoVDB's internal scratch allocations without forking any nanoVDB headers.

Today these operators hardcode nanovdb::cuda::DeviceBuffer (i.e. cudaMallocAsync/cudaFreeAsync) for all of their internal scratch. A consumer that manages GPU memory through a different allocator — e.g. PyTorch's caching allocator in fvdb-core — currently has to fork the headers to redirect those allocations, otherwise the two allocator pools fragment each other and large workloads hit OOM. (See fvdb-core #655 for the workaround this is meant to replace.)

With this change the consumer instead instantiates the operator with its own ScratchBufferT, no patching required.

Design

New ScratchBufferT template parameter on the operators, used for all device-only scratch.
- Defaulted to nanovdb::cuda::DeviceBuffer on the public operators, so existing call sites and behavior are unchanged (byte-identical under the default).
- Independent from the output buffer type. The grid the caller receives is still controlled by the existing getHandle<BufferT>() parameter; scratch is a separate axis. So you can mix, e.g. ScratchBufferT = DeviceBuffer with output BufferT = UnifiedBuffer, or vice-versa.
The shared internal helper TopologyBuilder takes ScratchBufferT as a non-defaulted parameter (it's an implementation detail; the default belongs only on the public operators). This way an operator that forgets to forward the type fails to compile rather than silently falling back to DeviceBuffer.
A scratch buffer type only needs a small interface: create(bytes, pool*, device, stream), deviceData(), and clear(stream).

What's covered

tools/cuda/TopologyBuilder.cuh — the shared scratch (masks, offsets, parents, count buffers).
tools/cuda/{DilateGrid,PruneGrid,MergeGrids,CoarsenGrid,RefineGrid}.cuh — thread the parameter through to TopologyBuilder.
tools/cuda/MeshToGrid.cuh — its own scratch (transformed triangles, box/triangle pairs, sort/scan temporaries, etc.), plus its internal PruneGrid.

Supporting changes:

Lifted TopologyBuilder::Data and MeshToGrid::BoxTrianglePair out to namespace scope (they don't depend on the scratch type), so the device functors that reference them see one consistent type across instantiations.
cuda::UnifiedBuffer: added a stream-accepting clear(cudaStream_t) overload so it satisfies the scratch interface.
MeshToGrid::getHandleAndUDF now actually honors its SidecarBufferT parameter (it previously hardcoded DeviceBuffer for the UDF sidecar allocation).

Dual-use buffers (intentionally left as `DeviceBuffer` for now)

Two buffers are host-populated and then uploaded (TopologyBuilder::mProcessedRoot, mData). These remain DeviceBuffer in this pass. The clean follow-up is to split each into a host buffer + a device ScratchBufferT half with an explicit copy between them; the parameter introduced here is already sufficient for that device half.

Testing

The CUDA topology unit tests (DilateInjectPrune, RefineCoarsen, MergeGrids, MeshToGrid_UnitTetrahedron) were refactored into helpers templated on every buffer axis (scratch / grid-output / sidecar-output), following the existing testVoxelBlockManager<BuildT, BufferT> pattern, and are instantiated for both all-DeviceBuffer and all-UnifiedBuffer. All pass; results are identical between the two buffer types.
Full TestNanoVDB.cu CUDA suite is green except one pre-existing, unrelated failure (UnifiedBuffer_IO, a missing test-data fixture).
The ex_dilate / ex_refine / ex_coarsen / ex_mesh_to_grid CUDA examples build and run against OpenVDB references with correct results.

Notes for later (not for review yet)

A working design note (TEMPLATIZED_BUFFERS_PLAN.md) is currently committed at the repo root and will be removed before this is opened for real review.
Commit history is intentionally milestone-by-milestone and may be squashed/rebased.

🤖 Generated with Claude Code

Plan for threading a defaulted ScratchBufferT template parameter through the CUDA topology operators (TopologyBuilder + Dilate/Prune/Merge/Coarsen/ Refine/MeshToGrid), so downstreams (e.g. fvdb) can inject a PyTorch-allocator -backed buffer type without forking nanoVDB headers (cf. fvdb-core PR AcademySoftwareFoundation#655). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a defaulted 'ScratchBufferT = nanovdb::cuda::DeviceBuffer' template parameter to TopologyBuilder and route the device-only scratch buffers (mUpper/LowerMasks, *Offsets, *Parents, and the transient *CountsBuffer locals) through it. Dual-use buffers (mProcessedRoot, mData) intentionally stay DeviceBuffer (see plan section 2.4). Direct static_cast on scratch deviceData()/data() switched to reinterpret_cast so a uint8_t*-returning buffer type (e.g. TorchDeviceBuffer) also compiles. With the default DeviceBuffer everything resolves to one type, so behavior is byte-identical; topology CUDA unittests pass unchanged. Data struct stays nested (lift to topology::detail deferred to M5, see plan section 2.6). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…uildT> Move the nested Data struct out of TopologyBuilder into a standalone topology::detail::TopologyBuilderData<BuildT> (parameterized on BuildT only), with a 'using Data = ...' alias retained in the class. The 10 detail functors now name TopologyBuilderData<BuildT> directly instead of typename TopologyBuilder<BuildT>::Data. Data does not depend on the scratch buffer type, so this makes it one consistent type across every TopologyBuilder<BuildT, ScratchBufferT> instantiation -- a prerequisite for a non-default ScratchBufferT (the functors could no longer name TopologyBuilder<BuildT>::Data once the param is required). Behavior unchanged; topology CUDA unittests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a defaulted 'ScratchBufferT = nanovdb::cuda::DeviceBuffer' template parameter to DilateGrid/PruneGrid/MergeGrids/CoarsenGrid/RefineGrid and forward it to their TopologyBuilder<BuildT, ScratchBufferT> mBuilder member; re-template the out-of-line method definitions accordingly. The output buffer type stays the independent method-level getHandle<BufferT> parameter; srcRoot* HostBuffer locals and d_bufferPtr casts are unchanged. Default keeps behavior byte-identical; topology CUDA unittests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a defaulted 'ScratchBufferT = nanovdb::cuda::DeviceBuffer' template parameter to MeshToGrid and route its device-only scratch through it: the 3 members (mXformedTriangles, mBoxTrianglePairsBuffer, mUniqueRootOriginsBuffer) and the transient locals (rootBox{Counts,Offsets}, keys/sortedKeys/uniqueKeys, numSelected, mask/counts/offsets/newPairs, retainMask); forward it to TopologyBuilder<BuildT, ScratchBufferT> mBuilder and re-template the 10 method definitions. Direct static_cast on scratch deviceData() switched to reinterpret_cast. Left as DeviceBuffer: mProcessedRoot (dual-use) and the UDF sidecarBuffer (output, returned as SidecarBufferT). The output GridHandle and the void* deviceUpperMasks/deviceLowerMasks accessor casts are unchanged. BoxTrianglePair stays a nested type (resolves via the default; its lift for a non-default ScratchBufferT is an M5 concern, mirroring the Data lift). Behavior unchanged; topology CUDA unittests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…seam Complete the templatized-scratch seam so it is exercised against a non-DeviceBuffer type, and make the internal helper enforce explicit forwarding: - TopologyBuilder: drop the ScratchBufferT default. It is an internal helper; the default belongs only on the public client classes, so a client that forgets to forward now fails to compile instead of silently using DeviceBuffer scratch. - MeshToGrid: lift BoxTrianglePair to nanovdb::tools::cuda scope (+ public alias), mirroring the TopologyBuilderData lift, so it is one consistent type across all MeshToGrid<BuildT, ScratchBufferT> instantiations (the detail functors reference it via the alias, unchanged). - MeshToGrid::getHandleAndUDF: honor SidecarBufferT (SidecarBufferT::create + reinterpret_cast); it previously hardcoded DeviceBuffer for the sidecar, so the param only compiled as DeviceBuffer. - MeshToGrid: forward ScratchBufferT into the internal PruneGrid so an all-Unified instantiation uses UnifiedBuffer scratch throughout. - UnifiedBuffer: add clear(cudaStream_t) so it satisfies the scratch-buffer interface (managed memory frees synchronously via cudaFree; stream unused). Mirrors the stream-accepting clear() fvdb will add to TorchDeviceBuffer. Canary: refactor the four topology unittests (DilateInjectPrune, RefineCoarsen, MergeGrids, MeshToGrid_UnitTetrahedron) into helpers templated on every buffer axis (scratch / grid-output / sidecar-output), following the existing testVoxelBlockManager<BuildT, BufferT> pattern. Each is instantiated as all-DeviceBuffer and all-UnifiedBuffer. All 8 pass; full CUDA suite 55/56 (only the pre-existing UnifiedBuffer_IO data-file failure, unrelated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sifakis and others added 6 commits June 1, 2026 13:33

swahtz added the nanovdb label Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP-Don't review] Templatized scratch buffers for topology/rasterization ops#2224

[WIP-Don't review] Templatized scratch buffers for topology/rasterization ops#2224
sifakis wants to merge 6 commits into
AcademySoftwareFoundation:masterfrom
sifakis:templatized-buffers

sifakis commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sifakis commented Jun 1, 2026

Summary

Design

What's covered

Dual-use buffers (intentionally left as DeviceBuffer for now)

Testing

Notes for later (not for review yet)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dual-use buffers (intentionally left as `DeviceBuffer` for now)