NanoVDB `ReadAccessor::getDimAndActive` by swahtz · Pull Request #2220 · AcademySoftwareFoundation/openvdb

swahtz · 2026-05-27T06:47:11Z

Summary

ReadAccessor::getDimAndActive fuses NanoVDB's getDim and isActive into a single root-to-leaf descent. For callers that read active unconditionally — fvdb-core's HDDASegmentIterator and HDDAVoxelIterator are the applicable consumers — this collapses two tree walks per iteration into one.

Measured on a Blackwell RTX 6000 Pro GPU across four NanoVDB example grids (dragon SDF, crawler SDF, emu SDF, WDAS cloud fog volume; 1024×1024 ray casts, median of 50 iters):

HDDASegmentIterator

Scene	Unfused	Fused	Speedup
dragon	0.972 ms	0.536 ms	1.81×
crawler	4.064 ms	2.953 ms	1.38×
emu	1.879 ms	1.055 ms	1.78×
wdas_cloud	6.493 ms	4.661 ms	1.39×

HDDAVoxelIterator

Scene	Unfused	Fused	Speedup
dragon	2.645 ms	1.235 ms	2.14×
crawler	8.016 ms	5.252 ms	1.53×
emu	5.731 ms	2.194 ms	2.61×
wdas_cloud	4.174 ms	2.508 ms	1.66×

Looking at ncu metrics, numbers drop in parallel to wall time improvements: fusion drops dynamic instructions by 26–47% and long-scoreboard stall by 8–9 percentage points on dragon. Register pressure drops 4–6 regs vs unfused (72 → 66–68) — no occupancy regression. The packed DimAndActive encoding (uint32 storage) keeps the return boundary at one register.

Details on Trying to Use the Fused API in `HDDA::ZeroCrossing`

The fused API is intentionally not used by HDDA::ZeroCrossing. That kernel was benchmarked with the fused call across two register configurations on my Blackwell RTX 6000 Pro GPU — once on master (the existing kernel sits at exactly 64 regs/thread, on the occupancy cliff) and once on top of a separate refactor I did I'll call 'Lean HDDA' which was to free up register headroom (for additional registers needed by the dim/active fusion) by removing the mT1, mDelta, mStep members from the HDDA class and deriving them from Ray components on each call in order to reduce register pressure (63 regs/thread, with 3 regs of headroom).

On master (without 'lean HDDA'), active/dim fusion grew the kernel to 66–67 regs and dropped occupancy 50% → 37.5% (2 → 1 blocks/SM at 256 threads). On 'lean HDDA', the fusion fits under the cliff (61 regs with the packed encoding) without an occupancy hit (so we could measure fusion without an occupancy regression).

Wall time, however, is unchanged in both regimes — variants land within 0–1.3% of each other across sphere / dragon / crawler / emu, with the fused versions consistently a hair slower. The reason isn't the register/occupancy cliff: it's that ZeroCrossing's if (hdda.dim() > 1 || !acc.isActive(ijk)) continue; makes the active read dead at upper levels, and the L1 cache absorbs the second descent at ~87% hit rate on the leaf-level remainder. Eliminating it saves essentially no cycles, while the fused call's slightly heavier per-call overhead nets out marginally worse.

The fused API is retained because it's a substantial win for callers that read active unconditionally — fvdb's HDDA iterators see −27% to −62% wall time and −26% to −47% dynamic instructions. ZeroCrossing's pattern just isn't one of them.

Introduce a fused getDim + isActive traversal that shares the Root->Internal->Leaf descent, the accessor cache update, and the terminating-level mask read between the two queries. Adds a DimAndActive POD, getDimAndActiveAndCache on Root/Internal/Leaf, and getDimAndActive on all four ReadAccessor specializations. Two policy tags control how `active` is resolved when an InternalNode skip-flag fires along the descent: - ActiveExact (default): preserves byte-for-byte semantics of separate getDim + getAndCache<GetState>. `active` is correct even when the skip-flag is set, at the cost of forcing the child descent. Required by callers that read `active` at upper levels (e.g. fvdb HDDA iterators). - ActiveOnLeafOnly: cheap skip-flag short-circuit; `active` is unspecified when the returned `dim` exceeds leaf-dim, and the caller must gate on `dim` first (`if (hdda.dim() > 1 || !active) continue;`). HDDA.h ZeroCrossing is intentionally not migrated -- on the register-tight sm_120 kernel, any fused shape grows reg/thread by 2-3 and drops occupancy from 4 to 3 blocks/SM. The kernel is memory-latency bound (~14% achieved occupancy) and the L1 cache absorbs the second descent at ~87% hit rate, so the occupancy hit dominates the descent-fusion savings. Rationale recorded inline. Also rework the ex_raytrace_level_set benchmark harness for honest measurement: kernel-only cudaEvent timing on the CUDA path, 2 untimed warmup iters, min/median/mean/max reporting, numIterations 50 -> 200, block size 512 -> 256. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

The fused getDim+isActive traversal returns a {dim, active} pair. The otherwise-natural {uint32_t, bool} struct layout passes through two registers (or a packed 64-bit pair) at the function-return boundary. NanoVDB's getDim return value is bounded by the largest internal-node DIM in the default FloatTree hierarchy (4096 for an upper internal -- 2^12, well below 2^31). The sign bit is unused, so we can pack: low 31 bits = dim sign bit = active The struct keeps its existing API surface (aggregate-init via a 2-arg constructor, accessors dim() and active()), so callers that used `return {dim, active}` continue to work; consumers that previously read `.dim` / `.active` as members switch to `.dim()` / `.active()`. Effect on consumers: - On the static ZeroCrossing kernel (which does not use the fused API; rationale comment updated below): no change. - On lean_hdda + fusion (a separate experiment) the renderOp kernel drops from 62 regs (struct encoding) to 61 regs (packed encoding). Doesn't cross an occupancy tier on sm_120 at b=256, so the wall- time delta is within noise. Still useful: any consumer holding DimAndActive live across other work gains 1 reg of headroom for no semantic change. - On fvdb's HDDA iterators (the productive consumer that reads active unconditionally at upper levels), the packing reduces return-boundary register pressure in a hot path. Also expand the rationale comment in HDDA.h's ZeroCrossing. The original justification framed the don't-migrate decision as a register-cliff problem; testing against lean_hdda (which has 3 regs of headroom for fusion) shows the same negative wall-time result without an occupancy hit. The real reason is the second descent is L1-warm at ~87 % hit rate, so eliminating it saves ~no cycles, and the fused call's slightly heavier per-call overhead nets out slightly worse. Verified both struct- and packed-encoded fusion across both regimes. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

The long NB block that explained why HDDA::ZeroCrossing keeps separate getDim + isActive calls (instead of using the fused API this branch adds) was useful during the iteration but doesn't earn its keep in the source tree -- a future reader doesn't need to relive the cliff/lean/descent-cost investigation that justified the decision. The rationale is preserved in the PR description: ZeroCrossing's `dim > 1 || !active` short-circuit makes the active read dead at upper levels, and the L1 cache absorbs the leaf-level descent at ~87% hit rate, so fusion saves essentially no cycles while adding a small per- call overhead. The fused API is retained for callers that read active unconditionally (fvdb's HDDA iterators -- -27% to -62% wall time win). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

The DimAndActive constructor packed active via 'int32_t(a) << 31', which shifts 1 into the sign bit of a signed int -- undefined behavior under C++17 (the repo's minimum standard). Do the OR in uint32_t so the shift is well-defined, and mask dim to 31 bits so a stray high bit can't clobber the active bit. Add TestNanoVDB.GetDimAndActive, which asserts the fused getDimAndActive matches the separate getDim + isActive it replaces: - ActiveExact: {dim, active} equals getDim(ijk, ray) and isActive(ijk) byte-for-byte across a box spanning interior, surface and exterior. - ActiveOnLeafOnly: dim always matches getDim; active matches isActive whenever dim == 1 (the contract gate). The sphere's narrow band is too thin to fill an InternalNode, so a skip-flagged internal node above an active voxel cannot arise from the geometry. The test forces that case by setting the skip-flag on the lower node over a known-active voxel, then confirms ActiveExact descends past it (active == true) while ActiveOnLeafOnly short-circuits (active == false), both reporting the skipped dim. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot

Pull request overview

This PR introduces a fused NanoVDB accessor query (ReadAccessor::getDimAndActive) that combines getDim(ijk, ray) and isActive(ijk) into a single root-to-leaf traversal to reduce per-iteration tree-walk cost in raymarch-style callers.

Changes:

Add DimAndActive result type plus policy tags (ActiveExact, ActiveOnLeafOnly) and implement fused traversal through Root/Internal/Leaf node descent and ReadAccessor entry points.
Add a unit test validating fused semantics against separate getDim + isActive, including skip-flag behavior.
Update the raytrace level set example to run warmups and report min/median/mean/max timing statistics (and increase default iteration count).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`nanovdb/nanovdb/NanoVDB.h`	Adds `DimAndActive` + policy tags and implements `getDimAndActive*` traversal through the NanoVDB tree and accessors.
`nanovdb/nanovdb/unittest/TestNanoVDB.cc`	Adds unit test coverage for `ReadAccessor::getDimAndActive` under both policies, including skip-flag divergence cases.
`nanovdb/nanovdb/examples/ex_raytrace_level_set/nanovdb.cu`	Collects per-iteration timings with warmup iterations and prints min/median/mean/max stats.
`nanovdb/nanovdb/examples/ex_raytrace_level_set/common.h`	Adds CUDA-event-based kernel-only timing when compiled under NVCC and adjusts block size used by `computeForEach`.
`nanovdb/nanovdb/examples/ex_raytrace_level_set/main.cc`	Increases default number of iterations for the example run.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- DimAndActive: store the packed field as uint32_t instead of int32_t. The previous int32_t form converted a uint32_t with bit 31 set to a signed int (implementation-defined) and left the defaulted ctor's member uninitialized. uint32_t packs/decodes with no conversion, the default ctor zero-initializes, and the size/ABI is unchanged. - ex_raytrace_level_set reportStats: take samples by const reference, early-return on an empty vector (front()/back() would otherwise be UB if numIterations were 0), and sort a local copy. - ex_raytrace_level_set common.h: check the cudaEventDestroy return codes with NANOVDB_CUDA_CHECK_ERROR, consistent with the rest of the CUDA timing path. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

The packed encoding masks dim to 31 bits (bit 31 holds active). dim is 1u << TOTAL, bounded by the deepest node dim (4096 for the default tree), so it always fits -- but a pathological tree config with TOTAL >= 31 would be silently truncated. Add a debug assert documenting the invariant, in the same style as the version-packing asserts above. Release builds are unaffected (assert compiles out; the mask still protects the active bit). Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz added 3 commits May 20, 2026 17:44

swahtz added the nanovdb label May 27, 2026

swahtz marked this pull request as ready for review May 27, 2026 20:48

swahtz requested a review from kmuseth as a code owner May 27, 2026 20:48

swahtz marked this pull request as draft May 27, 2026 23:59

swahtz added 4 commits May 28, 2026 03:49

comment cleanup

0f741f3

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

comment update

047a0e3

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

comment cleanup

bdda9f6

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz marked this pull request as ready for review June 5, 2026 02:48

swahtz requested a review from Copilot June 5, 2026 02:48

Copilot started reviewing on behalf of swahtz June 5, 2026 02:48 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Comment thread nanovdb/nanovdb/NanoVDB.h

Comment thread nanovdb/nanovdb/examples/ex_raytrace_level_set/nanovdb.cu

Comment thread nanovdb/nanovdb/examples/ex_raytrace_level_set/common.h Outdated

swahtz requested a review from Copilot June 5, 2026 03:21

Copilot started reviewing on behalf of swahtz June 5, 2026 03:21 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Comment thread nanovdb/nanovdb/NanoVDB.h Outdated

swahtz mentioned this pull request Jun 6, 2026

ray_implicit_intersection improvements openvdb/fvdb-core#663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NanoVDB `ReadAccessor::getDimAndActive`#2220

NanoVDB `ReadAccessor::getDimAndActive`#2220
swahtz wants to merge 9 commits into
AcademySoftwareFoundation:masterfrom
swahtz:experiment/dim_and_active

swahtz commented May 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swahtz commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

HDDASegmentIterator

HDDAVoxelIterator

Details on Trying to Use the Fused API in HDDA::ZeroCrossing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

swahtz commented May 27, 2026 •

edited

Loading

Details on Trying to Use the Fused API in `HDDA::ZeroCrossing`