NanoVDB ReadAccessor::getDimAndActive#2220
Open
swahtz wants to merge 9 commits into
Open
Conversation
Introduce a fused getDim + isActive traversal that shares the
Root->Internal->Leaf descent, the accessor cache update, and the
terminating-level mask read between the two queries. Adds a
DimAndActive POD, getDimAndActiveAndCache on Root/Internal/Leaf,
and getDimAndActive on all four ReadAccessor specializations.
Two policy tags control how `active` is resolved when an
InternalNode skip-flag fires along the descent:
- ActiveExact (default): preserves byte-for-byte semantics of
separate getDim + getAndCache<GetState>. `active` is correct
even when the skip-flag is set, at the cost of forcing the
child descent. Required by callers that read `active` at
upper levels (e.g. fvdb HDDA iterators).
- ActiveOnLeafOnly: cheap skip-flag short-circuit; `active` is
unspecified when the returned `dim` exceeds leaf-dim, and the
caller must gate on `dim` first
(`if (hdda.dim() > 1 || !active) continue;`).
HDDA.h ZeroCrossing is intentionally not migrated -- on the
register-tight sm_120 kernel, any fused shape grows reg/thread by
2-3 and drops occupancy from 4 to 3 blocks/SM. The kernel is
memory-latency bound (~14% achieved occupancy) and the L1 cache
absorbs the second descent at ~87% hit rate, so the occupancy hit
dominates the descent-fusion savings. Rationale recorded inline.
Also rework the ex_raytrace_level_set benchmark harness for honest
measurement: kernel-only cudaEvent timing on the CUDA path, 2
untimed warmup iters, min/median/mean/max reporting,
numIterations 50 -> 200, block size 512 -> 256.
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
The fused getDim+isActive traversal returns a {dim, active} pair. The
otherwise-natural {uint32_t, bool} struct layout passes through two
registers (or a packed 64-bit pair) at the function-return boundary.
NanoVDB's getDim return value is bounded by the largest internal-node
DIM in the default FloatTree hierarchy (4096 for an upper internal --
2^12, well below 2^31). The sign bit is unused, so we can pack:
low 31 bits = dim
sign bit = active
The struct keeps its existing API surface (aggregate-init via a 2-arg
constructor, accessors dim() and active()), so callers that used
`return {dim, active}` continue to work; consumers that previously
read `.dim` / `.active` as members switch to `.dim()` / `.active()`.
Effect on consumers:
- On the static ZeroCrossing kernel (which does not use the fused
API; rationale comment updated below): no change.
- On lean_hdda + fusion (a separate experiment) the renderOp kernel
drops from 62 regs (struct encoding) to 61 regs (packed encoding).
Doesn't cross an occupancy tier on sm_120 at b=256, so the wall-
time delta is within noise. Still useful: any consumer holding
DimAndActive live across other work gains 1 reg of headroom for
no semantic change.
- On fvdb's HDDA iterators (the productive consumer that reads
active unconditionally at upper levels), the packing reduces
return-boundary register pressure in a hot path.
Also expand the rationale comment in HDDA.h's ZeroCrossing. The
original justification framed the don't-migrate decision as a
register-cliff problem; testing against lean_hdda (which has 3 regs
of headroom for fusion) shows the same negative wall-time result
without an occupancy hit. The real reason is the second descent is
L1-warm at ~87 % hit rate, so eliminating it saves ~no cycles, and
the fused call's slightly heavier per-call overhead nets out slightly
worse. Verified both struct- and packed-encoded fusion across both
regimes.
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
The long NB block that explained why HDDA::ZeroCrossing keeps separate getDim + isActive calls (instead of using the fused API this branch adds) was useful during the iteration but doesn't earn its keep in the source tree -- a future reader doesn't need to relive the cliff/lean/descent-cost investigation that justified the decision. The rationale is preserved in the PR description: ZeroCrossing's `dim > 1 || !active` short-circuit makes the active read dead at upper levels, and the L1 cache absorbs the leaf-level descent at ~87% hit rate, so fusion saves essentially no cycles while adding a small per- call overhead. The fused API is retained for callers that read active unconditionally (fvdb's HDDA iterators -- -27% to -62% wall time win). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
The DimAndActive constructor packed active via 'int32_t(a) << 31',
which shifts 1 into the sign bit of a signed int -- undefined behavior
under C++17 (the repo's minimum standard). Do the OR in uint32_t so the
shift is well-defined, and mask dim to 31 bits so a stray high bit can't
clobber the active bit.
Add TestNanoVDB.GetDimAndActive, which asserts the fused
getDimAndActive matches the separate getDim + isActive it replaces:
- ActiveExact: {dim, active} equals getDim(ijk, ray) and isActive(ijk)
byte-for-byte across a box spanning interior, surface and exterior.
- ActiveOnLeafOnly: dim always matches getDim; active matches isActive
whenever dim == 1 (the contract gate).
The sphere's narrow band is too thin to fill an InternalNode, so a
skip-flagged internal node above an active voxel cannot arise from the
geometry. The test forces that case by setting the skip-flag on the
lower node over a known-active voxel, then confirms ActiveExact descends
past it (active == true) while ActiveOnLeafOnly short-circuits
(active == false), both reporting the skipped dim.
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
There was a problem hiding this comment.
Pull request overview
This PR introduces a fused NanoVDB accessor query (ReadAccessor::getDimAndActive) that combines getDim(ijk, ray) and isActive(ijk) into a single root-to-leaf traversal to reduce per-iteration tree-walk cost in raymarch-style callers.
Changes:
- Add
DimAndActiveresult type plus policy tags (ActiveExact,ActiveOnLeafOnly) and implement fused traversal through Root/Internal/Leaf node descent and ReadAccessor entry points. - Add a unit test validating fused semantics against separate
getDim+isActive, including skip-flag behavior. - Update the raytrace level set example to run warmups and report min/median/mean/max timing statistics (and increase default iteration count).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
nanovdb/nanovdb/NanoVDB.h |
Adds DimAndActive + policy tags and implements getDimAndActive* traversal through the NanoVDB tree and accessors. |
nanovdb/nanovdb/unittest/TestNanoVDB.cc |
Adds unit test coverage for ReadAccessor::getDimAndActive under both policies, including skip-flag divergence cases. |
nanovdb/nanovdb/examples/ex_raytrace_level_set/nanovdb.cu |
Collects per-iteration timings with warmup iterations and prints min/median/mean/max stats. |
nanovdb/nanovdb/examples/ex_raytrace_level_set/common.h |
Adds CUDA-event-based kernel-only timing when compiled under NVCC and adjusts block size used by computeForEach. |
nanovdb/nanovdb/examples/ex_raytrace_level_set/main.cc |
Increases default number of iterations for the example run. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- DimAndActive: store the packed field as uint32_t instead of int32_t. The previous int32_t form converted a uint32_t with bit 31 set to a signed int (implementation-defined) and left the defaulted ctor's member uninitialized. uint32_t packs/decodes with no conversion, the default ctor zero-initializes, and the size/ABI is unchanged. - ex_raytrace_level_set reportStats: take samples by const reference, early-return on an empty vector (front()/back() would otherwise be UB if numIterations were 0), and sort a local copy. - ex_raytrace_level_set common.h: check the cudaEventDestroy return codes with NANOVDB_CUDA_CHECK_ERROR, consistent with the rest of the CUDA timing path. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
The packed encoding masks dim to 31 bits (bit 31 holds active). dim is 1u << TOTAL, bounded by the deepest node dim (4096 for the default tree), so it always fits -- but a pathological tree config with TOTAL >= 31 would be silently truncated. Add a debug assert documenting the invariant, in the same style as the version-packing asserts above. Release builds are unaffected (assert compiles out; the mask still protects the active bit). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ReadAccessor::getDimAndActivefuses NanoVDB'sgetDimandisActiveinto a single root-to-leaf descent. For callers that read active unconditionally — fvdb-core'sHDDASegmentIteratorandHDDAVoxelIteratorare the applicable consumers — this collapses two tree walks per iteration into one.Measured on a Blackwell RTX 6000 Pro GPU across four NanoVDB example grids (dragon SDF, crawler SDF, emu SDF, WDAS cloud fog volume; 1024×1024 ray casts, median of 50 iters):
HDDASegmentIterator
HDDAVoxelIterator
Looking at ncu metrics, numbers drop in parallel to wall time improvements: fusion drops dynamic instructions by 26–47% and long-scoreboard stall by 8–9 percentage points on dragon. Register pressure drops 4–6 regs vs unfused (72 → 66–68) — no occupancy regression. The packed
DimAndActiveencoding (uint32 storage) keeps the return boundary at one register.Details on Trying to Use the Fused API in
HDDA::ZeroCrossingThe fused API is intentionally not used by
HDDA::ZeroCrossing. That kernel was benchmarked with the fused call across two register configurations on my Blackwell RTX 6000 Pro GPU — once on master (the existing kernel sits at exactly 64 regs/thread, on the occupancy cliff) and once on top of a separate refactor I did I'll call 'Lean HDDA' which was to free up register headroom (for additional registers needed by the dim/active fusion) by removing themT1, mDelta, mStepmembers from theHDDAclass and deriving them fromRaycomponents on each call in order to reduce register pressure (63 regs/thread, with 3 regs of headroom).On master (without 'lean HDDA'), active/dim fusion grew the kernel to 66–67 regs and dropped occupancy 50% → 37.5% (2 → 1 blocks/SM at 256 threads). On 'lean HDDA', the fusion fits under the cliff (61 regs with the packed encoding) without an occupancy hit (so we could measure fusion without an occupancy regression).
Wall time, however, is unchanged in both regimes — variants land within 0–1.3% of each other across sphere / dragon / crawler / emu, with the fused versions consistently a hair slower. The reason isn't the register/occupancy cliff: it's that
ZeroCrossing'sif (hdda.dim() > 1 || !acc.isActive(ijk)) continue;makes the active read dead at upper levels, and the L1 cache absorbs the second descent at ~87% hit rate on the leaf-level remainder. Eliminating it saves essentially no cycles, while the fused call's slightly heavier per-call overhead nets out marginally worse.The fused API is retained because it's a substantial win for callers that read active unconditionally — fvdb's HDDA iterators see −27% to −62% wall time and −26% to −47% dynamic instructions.
ZeroCrossing's pattern just isn't one of them.