Skip to content

[WIP-Don't review] Templatized scratch buffers for topology/rasterization ops#2224

Draft
sifakis wants to merge 6 commits into
AcademySoftwareFoundation:masterfrom
sifakis:templatized-buffers
Draft

[WIP-Don't review] Templatized scratch buffers for topology/rasterization ops#2224
sifakis wants to merge 6 commits into
AcademySoftwareFoundation:masterfrom
sifakis:templatized-buffers

Conversation

@sifakis
Copy link
Copy Markdown
Contributor

@sifakis sifakis commented Jun 1, 2026

⚠️ [WIP — please don't review yet] This is a draft to share direction and run CI early. The API and commit history are still in flux.

Summary

Make the transient scratch buffer type a (defaulted) template parameter of the CUDA topology/rasterization operators, so a downstream can inject its own buffer type for nanoVDB's internal scratch allocations without forking any nanoVDB headers.

Today these operators hardcode nanovdb::cuda::DeviceBuffer (i.e. cudaMallocAsync/cudaFreeAsync) for all of their internal scratch. A consumer that manages GPU memory through a different allocator — e.g. PyTorch's caching allocator in fvdb-core — currently has to fork the headers to redirect those allocations, otherwise the two allocator pools fragment each other and large workloads hit OOM. (See fvdb-core #655 for the workaround this is meant to replace.)

With this change the consumer instead instantiates the operator with its own ScratchBufferT, no patching required.

Design

  • New ScratchBufferT template parameter on the operators, used for all device-only scratch.
    • Defaulted to nanovdb::cuda::DeviceBuffer on the public operators, so existing call sites and behavior are unchanged (byte-identical under the default).
    • Independent from the output buffer type. The grid the caller receives is still controlled by the existing getHandle<BufferT>() parameter; scratch is a separate axis. So you can mix, e.g. ScratchBufferT = DeviceBuffer with output BufferT = UnifiedBuffer, or vice-versa.
  • The shared internal helper TopologyBuilder takes ScratchBufferT as a non-defaulted parameter (it's an implementation detail; the default belongs only on the public operators). This way an operator that forgets to forward the type fails to compile rather than silently falling back to DeviceBuffer.
  • A scratch buffer type only needs a small interface: create(bytes, pool*, device, stream), deviceData(), and clear(stream).

What's covered

  • tools/cuda/TopologyBuilder.cuh — the shared scratch (masks, offsets, parents, count buffers).
  • tools/cuda/{DilateGrid,PruneGrid,MergeGrids,CoarsenGrid,RefineGrid}.cuh — thread the parameter through to TopologyBuilder.
  • tools/cuda/MeshToGrid.cuh — its own scratch (transformed triangles, box/triangle pairs, sort/scan temporaries, etc.), plus its internal PruneGrid.

Supporting changes:

  • Lifted TopologyBuilder::Data and MeshToGrid::BoxTrianglePair out to namespace scope (they don't depend on the scratch type), so the device functors that reference them see one consistent type across instantiations.
  • cuda::UnifiedBuffer: added a stream-accepting clear(cudaStream_t) overload so it satisfies the scratch interface.
  • MeshToGrid::getHandleAndUDF now actually honors its SidecarBufferT parameter (it previously hardcoded DeviceBuffer for the UDF sidecar allocation).

Dual-use buffers (intentionally left as DeviceBuffer for now)

Two buffers are host-populated and then uploaded (TopologyBuilder::mProcessedRoot, mData). These remain DeviceBuffer in this pass. The clean follow-up is to split each into a host buffer + a device ScratchBufferT half with an explicit copy between them; the parameter introduced here is already sufficient for that device half.

Testing

  • The CUDA topology unit tests (DilateInjectPrune, RefineCoarsen, MergeGrids, MeshToGrid_UnitTetrahedron) were refactored into helpers templated on every buffer axis (scratch / grid-output / sidecar-output), following the existing testVoxelBlockManager<BuildT, BufferT> pattern, and are instantiated for both all-DeviceBuffer and all-UnifiedBuffer. All pass; results are identical between the two buffer types.
  • Full TestNanoVDB.cu CUDA suite is green except one pre-existing, unrelated failure (UnifiedBuffer_IO, a missing test-data fixture).
  • The ex_dilate / ex_refine / ex_coarsen / ex_mesh_to_grid CUDA examples build and run against OpenVDB references with correct results.

Notes for later (not for review yet)

  • A working design note (TEMPLATIZED_BUFFERS_PLAN.md) is currently committed at the repo root and will be removed before this is opened for real review.
  • Commit history is intentionally milestone-by-milestone and may be squashed/rebased.

🤖 Generated with Claude Code

sifakis and others added 6 commits June 1, 2026 13:33
Plan for threading a defaulted ScratchBufferT template parameter through
the CUDA topology operators (TopologyBuilder + Dilate/Prune/Merge/Coarsen/
Refine/MeshToGrid), so downstreams (e.g. fvdb) can inject a PyTorch-allocator
-backed buffer type without forking nanoVDB headers (cf. fvdb-core PR AcademySoftwareFoundation#655).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a defaulted 'ScratchBufferT = nanovdb::cuda::DeviceBuffer' template
parameter to TopologyBuilder and route the device-only scratch buffers
(mUpper/LowerMasks, *Offsets, *Parents, and the transient *CountsBuffer
locals) through it. Dual-use buffers (mProcessedRoot, mData) intentionally
stay DeviceBuffer (see plan section 2.4). Direct static_cast on scratch
deviceData()/data() switched to reinterpret_cast so a uint8_t*-returning
buffer type (e.g. TorchDeviceBuffer) also compiles.

With the default DeviceBuffer everything resolves to one type, so behavior
is byte-identical; topology CUDA unittests pass unchanged. Data struct stays
nested (lift to topology::detail deferred to M5, see plan section 2.6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uildT>

Move the nested Data struct out of TopologyBuilder into a standalone
topology::detail::TopologyBuilderData<BuildT> (parameterized on BuildT only),
with a 'using Data = ...' alias retained in the class. The 10 detail functors
now name TopologyBuilderData<BuildT> directly instead of
typename TopologyBuilder<BuildT>::Data.

Data does not depend on the scratch buffer type, so this makes it one
consistent type across every TopologyBuilder<BuildT, ScratchBufferT>
instantiation -- a prerequisite for a non-default ScratchBufferT (the functors
could no longer name TopologyBuilder<BuildT>::Data once the param is required).
Behavior unchanged; topology CUDA unittests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a defaulted 'ScratchBufferT = nanovdb::cuda::DeviceBuffer' template
parameter to DilateGrid/PruneGrid/MergeGrids/CoarsenGrid/RefineGrid and forward
it to their TopologyBuilder<BuildT, ScratchBufferT> mBuilder member; re-template
the out-of-line method definitions accordingly. The output buffer type stays the
independent method-level getHandle<BufferT> parameter; srcRoot* HostBuffer locals
and d_bufferPtr casts are unchanged. Default keeps behavior byte-identical;
topology CUDA unittests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a defaulted 'ScratchBufferT = nanovdb::cuda::DeviceBuffer' template
parameter to MeshToGrid and route its device-only scratch through it: the 3
members (mXformedTriangles, mBoxTrianglePairsBuffer, mUniqueRootOriginsBuffer)
and the transient locals (rootBox{Counts,Offsets}, keys/sortedKeys/uniqueKeys,
numSelected, mask/counts/offsets/newPairs, retainMask); forward it to
TopologyBuilder<BuildT, ScratchBufferT> mBuilder and re-template the 10 method
definitions. Direct static_cast on scratch deviceData() switched to
reinterpret_cast.

Left as DeviceBuffer: mProcessedRoot (dual-use) and the UDF sidecarBuffer
(output, returned as SidecarBufferT). The output GridHandle and the void*
deviceUpperMasks/deviceLowerMasks accessor casts are unchanged. BoxTrianglePair
stays a nested type (resolves via the default; its lift for a non-default
ScratchBufferT is an M5 concern, mirroring the Data lift). Behavior unchanged;
topology CUDA unittests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…seam

Complete the templatized-scratch seam so it is exercised against a non-DeviceBuffer
type, and make the internal helper enforce explicit forwarding:

- TopologyBuilder: drop the ScratchBufferT default. It is an internal helper; the
  default belongs only on the public client classes, so a client that forgets to
  forward now fails to compile instead of silently using DeviceBuffer scratch.
- MeshToGrid: lift BoxTrianglePair to nanovdb::tools::cuda scope (+ public alias),
  mirroring the TopologyBuilderData lift, so it is one consistent type across all
  MeshToGrid<BuildT, ScratchBufferT> instantiations (the detail functors reference it
  via the alias, unchanged).
- MeshToGrid::getHandleAndUDF: honor SidecarBufferT (SidecarBufferT::create +
  reinterpret_cast); it previously hardcoded DeviceBuffer for the sidecar, so the
  param only compiled as DeviceBuffer.
- MeshToGrid: forward ScratchBufferT into the internal PruneGrid so an all-Unified
  instantiation uses UnifiedBuffer scratch throughout.
- UnifiedBuffer: add clear(cudaStream_t) so it satisfies the scratch-buffer interface
  (managed memory frees synchronously via cudaFree; stream unused). Mirrors the
  stream-accepting clear() fvdb will add to TorchDeviceBuffer.

Canary: refactor the four topology unittests (DilateInjectPrune, RefineCoarsen,
MergeGrids, MeshToGrid_UnitTetrahedron) into helpers templated on every buffer axis
(scratch / grid-output / sidecar-output), following the existing
testVoxelBlockManager<BuildT, BufferT> pattern. Each is instantiated as all-DeviceBuffer
and all-UnifiedBuffer. All 8 pass; full CUDA suite 55/56 (only the pre-existing
UnifiedBuffer_IO data-file failure, unrelated).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@swahtz swahtz added the nanovdb label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants