[WIP-Don't review] Templatized scratch buffers for topology/rasterization ops#2224
Draft
sifakis wants to merge 6 commits into
Draft
[WIP-Don't review] Templatized scratch buffers for topology/rasterization ops#2224sifakis wants to merge 6 commits into
sifakis wants to merge 6 commits into
Conversation
Plan for threading a defaulted ScratchBufferT template parameter through the CUDA topology operators (TopologyBuilder + Dilate/Prune/Merge/Coarsen/ Refine/MeshToGrid), so downstreams (e.g. fvdb) can inject a PyTorch-allocator -backed buffer type without forking nanoVDB headers (cf. fvdb-core PR AcademySoftwareFoundation#655). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a defaulted 'ScratchBufferT = nanovdb::cuda::DeviceBuffer' template parameter to TopologyBuilder and route the device-only scratch buffers (mUpper/LowerMasks, *Offsets, *Parents, and the transient *CountsBuffer locals) through it. Dual-use buffers (mProcessedRoot, mData) intentionally stay DeviceBuffer (see plan section 2.4). Direct static_cast on scratch deviceData()/data() switched to reinterpret_cast so a uint8_t*-returning buffer type (e.g. TorchDeviceBuffer) also compiles. With the default DeviceBuffer everything resolves to one type, so behavior is byte-identical; topology CUDA unittests pass unchanged. Data struct stays nested (lift to topology::detail deferred to M5, see plan section 2.6). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uildT> Move the nested Data struct out of TopologyBuilder into a standalone topology::detail::TopologyBuilderData<BuildT> (parameterized on BuildT only), with a 'using Data = ...' alias retained in the class. The 10 detail functors now name TopologyBuilderData<BuildT> directly instead of typename TopologyBuilder<BuildT>::Data. Data does not depend on the scratch buffer type, so this makes it one consistent type across every TopologyBuilder<BuildT, ScratchBufferT> instantiation -- a prerequisite for a non-default ScratchBufferT (the functors could no longer name TopologyBuilder<BuildT>::Data once the param is required). Behavior unchanged; topology CUDA unittests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a defaulted 'ScratchBufferT = nanovdb::cuda::DeviceBuffer' template parameter to DilateGrid/PruneGrid/MergeGrids/CoarsenGrid/RefineGrid and forward it to their TopologyBuilder<BuildT, ScratchBufferT> mBuilder member; re-template the out-of-line method definitions accordingly. The output buffer type stays the independent method-level getHandle<BufferT> parameter; srcRoot* HostBuffer locals and d_bufferPtr casts are unchanged. Default keeps behavior byte-identical; topology CUDA unittests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a defaulted 'ScratchBufferT = nanovdb::cuda::DeviceBuffer' template
parameter to MeshToGrid and route its device-only scratch through it: the 3
members (mXformedTriangles, mBoxTrianglePairsBuffer, mUniqueRootOriginsBuffer)
and the transient locals (rootBox{Counts,Offsets}, keys/sortedKeys/uniqueKeys,
numSelected, mask/counts/offsets/newPairs, retainMask); forward it to
TopologyBuilder<BuildT, ScratchBufferT> mBuilder and re-template the 10 method
definitions. Direct static_cast on scratch deviceData() switched to
reinterpret_cast.
Left as DeviceBuffer: mProcessedRoot (dual-use) and the UDF sidecarBuffer
(output, returned as SidecarBufferT). The output GridHandle and the void*
deviceUpperMasks/deviceLowerMasks accessor casts are unchanged. BoxTrianglePair
stays a nested type (resolves via the default; its lift for a non-default
ScratchBufferT is an M5 concern, mirroring the Data lift). Behavior unchanged;
topology CUDA unittests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…seam Complete the templatized-scratch seam so it is exercised against a non-DeviceBuffer type, and make the internal helper enforce explicit forwarding: - TopologyBuilder: drop the ScratchBufferT default. It is an internal helper; the default belongs only on the public client classes, so a client that forgets to forward now fails to compile instead of silently using DeviceBuffer scratch. - MeshToGrid: lift BoxTrianglePair to nanovdb::tools::cuda scope (+ public alias), mirroring the TopologyBuilderData lift, so it is one consistent type across all MeshToGrid<BuildT, ScratchBufferT> instantiations (the detail functors reference it via the alias, unchanged). - MeshToGrid::getHandleAndUDF: honor SidecarBufferT (SidecarBufferT::create + reinterpret_cast); it previously hardcoded DeviceBuffer for the sidecar, so the param only compiled as DeviceBuffer. - MeshToGrid: forward ScratchBufferT into the internal PruneGrid so an all-Unified instantiation uses UnifiedBuffer scratch throughout. - UnifiedBuffer: add clear(cudaStream_t) so it satisfies the scratch-buffer interface (managed memory frees synchronously via cudaFree; stream unused). Mirrors the stream-accepting clear() fvdb will add to TorchDeviceBuffer. Canary: refactor the four topology unittests (DilateInjectPrune, RefineCoarsen, MergeGrids, MeshToGrid_UnitTetrahedron) into helpers templated on every buffer axis (scratch / grid-output / sidecar-output), following the existing testVoxelBlockManager<BuildT, BufferT> pattern. Each is instantiated as all-DeviceBuffer and all-UnifiedBuffer. All 8 pass; full CUDA suite 55/56 (only the pre-existing UnifiedBuffer_IO data-file failure, unrelated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make the transient scratch buffer type a (defaulted) template parameter of the CUDA topology/rasterization operators, so a downstream can inject its own buffer type for nanoVDB's internal scratch allocations without forking any nanoVDB headers.
Today these operators hardcode
nanovdb::cuda::DeviceBuffer(i.e.cudaMallocAsync/cudaFreeAsync) for all of their internal scratch. A consumer that manages GPU memory through a different allocator — e.g. PyTorch's caching allocator in fvdb-core — currently has to fork the headers to redirect those allocations, otherwise the two allocator pools fragment each other and large workloads hit OOM. (See fvdb-core #655 for the workaround this is meant to replace.)With this change the consumer instead instantiates the operator with its own
ScratchBufferT, no patching required.Design
ScratchBufferTtemplate parameter on the operators, used for all device-only scratch.nanovdb::cuda::DeviceBufferon the public operators, so existing call sites and behavior are unchanged (byte-identical under the default).getHandle<BufferT>()parameter; scratch is a separate axis. So you can mix, e.g.ScratchBufferT = DeviceBufferwith outputBufferT = UnifiedBuffer, or vice-versa.TopologyBuildertakesScratchBufferTas a non-defaulted parameter (it's an implementation detail; the default belongs only on the public operators). This way an operator that forgets to forward the type fails to compile rather than silently falling back toDeviceBuffer.create(bytes, pool*, device, stream),deviceData(), andclear(stream).What's covered
tools/cuda/TopologyBuilder.cuh— the shared scratch (masks, offsets, parents, count buffers).tools/cuda/{DilateGrid,PruneGrid,MergeGrids,CoarsenGrid,RefineGrid}.cuh— thread the parameter through toTopologyBuilder.tools/cuda/MeshToGrid.cuh— its own scratch (transformed triangles, box/triangle pairs, sort/scan temporaries, etc.), plus its internalPruneGrid.Supporting changes:
TopologyBuilder::DataandMeshToGrid::BoxTrianglePairout to namespace scope (they don't depend on the scratch type), so the device functors that reference them see one consistent type across instantiations.cuda::UnifiedBuffer: added a stream-acceptingclear(cudaStream_t)overload so it satisfies the scratch interface.MeshToGrid::getHandleAndUDFnow actually honors itsSidecarBufferTparameter (it previously hardcodedDeviceBufferfor the UDF sidecar allocation).Dual-use buffers (intentionally left as
DeviceBufferfor now)Two buffers are host-populated and then uploaded (
TopologyBuilder::mProcessedRoot,mData). These remainDeviceBufferin this pass. The clean follow-up is to split each into a host buffer + a deviceScratchBufferThalf with an explicit copy between them; the parameter introduced here is already sufficient for that device half.Testing
DilateInjectPrune,RefineCoarsen,MergeGrids,MeshToGrid_UnitTetrahedron) were refactored into helpers templated on every buffer axis (scratch / grid-output / sidecar-output), following the existingtestVoxelBlockManager<BuildT, BufferT>pattern, and are instantiated for both all-DeviceBufferand all-UnifiedBuffer. All pass; results are identical between the two buffer types.TestNanoVDB.cuCUDA suite is green except one pre-existing, unrelated failure (UnifiedBuffer_IO, a missing test-data fixture).ex_dilate/ex_refine/ex_coarsen/ex_mesh_to_gridCUDA examples build and run against OpenVDB references with correct results.Notes for later (not for review yet)
TEMPLATIZED_BUFFERS_PLAN.md) is currently committed at the repo root and will be removed before this is opened for real review.🤖 Generated with Claude Code