Skip to content

NanoVDB ReadAccessor::getDimAndActive#2220

Open
swahtz wants to merge 9 commits into
AcademySoftwareFoundation:masterfrom
swahtz:experiment/dim_and_active
Open

NanoVDB ReadAccessor::getDimAndActive#2220
swahtz wants to merge 9 commits into
AcademySoftwareFoundation:masterfrom
swahtz:experiment/dim_and_active

Conversation

@swahtz
Copy link
Copy Markdown
Contributor

@swahtz swahtz commented May 27, 2026

Summary

ReadAccessor::getDimAndActive fuses NanoVDB's getDim and isActive into a single root-to-leaf descent. For callers that read active unconditionally — fvdb-core's HDDASegmentIterator and HDDAVoxelIterator are the applicable consumers — this collapses two tree walks per iteration into one.

Measured on a Blackwell RTX 6000 Pro GPU across four NanoVDB example grids (dragon SDF, crawler SDF, emu SDF, WDAS cloud fog volume; 1024×1024 ray casts, median of 50 iters):

HDDASegmentIterator

Scene Unfused Fused Speedup
dragon 0.972 ms 0.536 ms 1.81×
crawler 4.064 ms 2.953 ms 1.38×
emu 1.879 ms 1.055 ms 1.78×
wdas_cloud 6.493 ms 4.661 ms 1.39×

HDDAVoxelIterator

Scene Unfused Fused Speedup
dragon 2.645 ms 1.235 ms 2.14×
crawler 8.016 ms 5.252 ms 1.53×
emu 5.731 ms 2.194 ms 2.61×
wdas_cloud 4.174 ms 2.508 ms 1.66×

Looking at ncu metrics, numbers drop in parallel to wall time improvements: fusion drops dynamic instructions by 26–47% and long-scoreboard stall by 8–9 percentage points on dragon. Register pressure drops 4–6 regs vs unfused (72 → 66–68) — no occupancy regression. The packed DimAndActive encoding (uint32 storage) keeps the return boundary at one register.

Details on Trying to Use the Fused API in HDDA::ZeroCrossing

The fused API is intentionally not used by HDDA::ZeroCrossing. That kernel was benchmarked with the fused call across two register configurations on my Blackwell RTX 6000 Pro GPU — once on master (the existing kernel sits at exactly 64 regs/thread, on the occupancy cliff) and once on top of a separate refactor I did I'll call 'Lean HDDA' which was to free up register headroom (for additional registers needed by the dim/active fusion) by removing the mT1, mDelta, mStep members from the HDDA class and deriving them from Ray components on each call in order to reduce register pressure (63 regs/thread, with 3 regs of headroom).

On master (without 'lean HDDA'), active/dim fusion grew the kernel to 66–67 regs and dropped occupancy 50% → 37.5% (2 → 1 blocks/SM at 256 threads). On 'lean HDDA', the fusion fits under the cliff (61 regs with the packed encoding) without an occupancy hit (so we could measure fusion without an occupancy regression).

Wall time, however, is unchanged in both regimes — variants land within 0–1.3% of each other across sphere / dragon / crawler / emu, with the fused versions consistently a hair slower. The reason isn't the register/occupancy cliff: it's that ZeroCrossing's if (hdda.dim() > 1 || !acc.isActive(ijk)) continue; makes the active read dead at upper levels, and the L1 cache absorbs the second descent at ~87% hit rate on the leaf-level remainder. Eliminating it saves essentially no cycles, while the fused call's slightly heavier per-call overhead nets out marginally worse.

The fused API is retained because it's a substantial win for callers that read active unconditionally — fvdb's HDDA iterators see −27% to −62% wall time and −26% to −47% dynamic instructions. ZeroCrossing's pattern just isn't one of them.

swahtz added 3 commits May 20, 2026 17:44
Introduce a fused getDim + isActive traversal that shares the
Root->Internal->Leaf descent, the accessor cache update, and the
terminating-level mask read between the two queries. Adds a
DimAndActive POD, getDimAndActiveAndCache on Root/Internal/Leaf,
and getDimAndActive on all four ReadAccessor specializations.

Two policy tags control how `active` is resolved when an
InternalNode skip-flag fires along the descent:

  - ActiveExact (default): preserves byte-for-byte semantics of
    separate getDim + getAndCache<GetState>. `active` is correct
    even when the skip-flag is set, at the cost of forcing the
    child descent. Required by callers that read `active` at
    upper levels (e.g. fvdb HDDA iterators).

  - ActiveOnLeafOnly: cheap skip-flag short-circuit; `active` is
    unspecified when the returned `dim` exceeds leaf-dim, and the
    caller must gate on `dim` first
    (`if (hdda.dim() > 1 || !active) continue;`).

HDDA.h ZeroCrossing is intentionally not migrated -- on the
register-tight sm_120 kernel, any fused shape grows reg/thread by
2-3 and drops occupancy from 4 to 3 blocks/SM. The kernel is
memory-latency bound (~14% achieved occupancy) and the L1 cache
absorbs the second descent at ~87% hit rate, so the occupancy hit
dominates the descent-fusion savings. Rationale recorded inline.

Also rework the ex_raytrace_level_set benchmark harness for honest
measurement: kernel-only cudaEvent timing on the CUDA path, 2
untimed warmup iters, min/median/mean/max reporting,
numIterations 50 -> 200, block size 512 -> 256.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
The fused getDim+isActive traversal returns a {dim, active} pair. The
otherwise-natural {uint32_t, bool} struct layout passes through two
registers (or a packed 64-bit pair) at the function-return boundary.

NanoVDB's getDim return value is bounded by the largest internal-node
DIM in the default FloatTree hierarchy (4096 for an upper internal --
2^12, well below 2^31). The sign bit is unused, so we can pack:

  low 31 bits = dim
  sign bit    = active

The struct keeps its existing API surface (aggregate-init via a 2-arg
constructor, accessors dim() and active()), so callers that used
`return {dim, active}` continue to work; consumers that previously
read `.dim` / `.active` as members switch to `.dim()` / `.active()`.

Effect on consumers:

  - On the static ZeroCrossing kernel (which does not use the fused
    API; rationale comment updated below): no change.

  - On lean_hdda + fusion (a separate experiment) the renderOp kernel
    drops from 62 regs (struct encoding) to 61 regs (packed encoding).
    Doesn't cross an occupancy tier on sm_120 at b=256, so the wall-
    time delta is within noise. Still useful: any consumer holding
    DimAndActive live across other work gains 1 reg of headroom for
    no semantic change.

  - On fvdb's HDDA iterators (the productive consumer that reads
    active unconditionally at upper levels), the packing reduces
    return-boundary register pressure in a hot path.

Also expand the rationale comment in HDDA.h's ZeroCrossing. The
original justification framed the don't-migrate decision as a
register-cliff problem; testing against lean_hdda (which has 3 regs
of headroom for fusion) shows the same negative wall-time result
without an occupancy hit. The real reason is the second descent is
L1-warm at ~87 % hit rate, so eliminating it saves ~no cycles, and
the fused call's slightly heavier per-call overhead nets out slightly
worse. Verified both struct- and packed-encoded fusion across both
regimes.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
The long NB block that explained why HDDA::ZeroCrossing keeps
separate getDim + isActive calls (instead of using the fused API
this branch adds) was useful during the iteration but doesn't earn
its keep in the source tree -- a future reader doesn't need to relive
the cliff/lean/descent-cost investigation that justified the decision.

The rationale is preserved in the PR description: ZeroCrossing's
`dim > 1 || !active` short-circuit makes the active read dead at upper
levels, and the L1 cache absorbs the leaf-level descent at ~87% hit
rate, so fusion saves essentially no cycles while adding a small per-
call overhead.  The fused API is retained for callers that read active
unconditionally (fvdb's HDDA iterators -- -27% to -62% wall time win).

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@swahtz swahtz added the nanovdb label May 27, 2026
@swahtz swahtz marked this pull request as ready for review May 27, 2026 20:48
@swahtz swahtz requested a review from kmuseth as a code owner May 27, 2026 20:48
@swahtz swahtz marked this pull request as draft May 27, 2026 23:59
swahtz added 4 commits May 28, 2026 03:49
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
The DimAndActive constructor packed active via 'int32_t(a) << 31',
which shifts 1 into the sign bit of a signed int -- undefined behavior
under C++17 (the repo's minimum standard). Do the OR in uint32_t so the
shift is well-defined, and mask dim to 31 bits so a stray high bit can't
clobber the active bit.

Add TestNanoVDB.GetDimAndActive, which asserts the fused
getDimAndActive matches the separate getDim + isActive it replaces:

  - ActiveExact: {dim, active} equals getDim(ijk, ray) and isActive(ijk)
    byte-for-byte across a box spanning interior, surface and exterior.
  - ActiveOnLeafOnly: dim always matches getDim; active matches isActive
    whenever dim == 1 (the contract gate).

The sphere's narrow band is too thin to fill an InternalNode, so a
skip-flagged internal node above an active voxel cannot arise from the
geometry. The test forces that case by setting the skip-flag on the
lower node over a known-active voxel, then confirms ActiveExact descends
past it (active == true) while ActiveOnLeafOnly short-circuits
(active == false), both reporting the skipped dim.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@swahtz swahtz marked this pull request as ready for review June 5, 2026 02:48
@swahtz swahtz requested a review from Copilot June 5, 2026 02:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a fused NanoVDB accessor query (ReadAccessor::getDimAndActive) that combines getDim(ijk, ray) and isActive(ijk) into a single root-to-leaf traversal to reduce per-iteration tree-walk cost in raymarch-style callers.

Changes:

  • Add DimAndActive result type plus policy tags (ActiveExact, ActiveOnLeafOnly) and implement fused traversal through Root/Internal/Leaf node descent and ReadAccessor entry points.
  • Add a unit test validating fused semantics against separate getDim + isActive, including skip-flag behavior.
  • Update the raytrace level set example to run warmups and report min/median/mean/max timing statistics (and increase default iteration count).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
nanovdb/nanovdb/NanoVDB.h Adds DimAndActive + policy tags and implements getDimAndActive* traversal through the NanoVDB tree and accessors.
nanovdb/nanovdb/unittest/TestNanoVDB.cc Adds unit test coverage for ReadAccessor::getDimAndActive under both policies, including skip-flag divergence cases.
nanovdb/nanovdb/examples/ex_raytrace_level_set/nanovdb.cu Collects per-iteration timings with warmup iterations and prints min/median/mean/max stats.
nanovdb/nanovdb/examples/ex_raytrace_level_set/common.h Adds CUDA-event-based kernel-only timing when compiled under NVCC and adjusts block size used by computeForEach.
nanovdb/nanovdb/examples/ex_raytrace_level_set/main.cc Increases default number of iterations for the example run.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nanovdb/nanovdb/NanoVDB.h
Comment thread nanovdb/nanovdb/examples/ex_raytrace_level_set/nanovdb.cu
Comment thread nanovdb/nanovdb/examples/ex_raytrace_level_set/common.h Outdated
- DimAndActive: store the packed field as uint32_t instead of int32_t.
  The previous int32_t form converted a uint32_t with bit 31 set to a
  signed int (implementation-defined) and left the defaulted ctor's
  member uninitialized. uint32_t packs/decodes with no conversion, the
  default ctor zero-initializes, and the size/ABI is unchanged.

- ex_raytrace_level_set reportStats: take samples by const reference,
  early-return on an empty vector (front()/back() would otherwise be
  UB if numIterations were 0), and sort a local copy.

- ex_raytrace_level_set common.h: check the cudaEventDestroy return
  codes with NANOVDB_CUDA_CHECK_ERROR, consistent with the rest of the
  CUDA timing path.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread nanovdb/nanovdb/NanoVDB.h Outdated
The packed encoding masks dim to 31 bits (bit 31 holds active). dim is
1u << TOTAL, bounded by the deepest node dim (4096 for the default tree),
so it always fits -- but a pathological tree config with TOTAL >= 31 would
be silently truncated. Add a debug assert documenting the invariant, in
the same style as the version-packing asserts above. Release builds are
unaffected (assert compiles out; the mask still protects the active bit).

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants