Skip to content

Newton + Isaac RTX Rendering Performance Optimizations#5017

Open
ncournia-nv wants to merge 8 commits intoisaac-sim:developfrom
ncournia-nv:dev/ncournia/perf-test
Open

Newton + Isaac RTX Rendering Performance Optimizations#5017
ncournia-nv wants to merge 8 commits intoisaac-sim:developfrom
ncournia-nv:dev/ncournia/perf-test

Conversation

@ncournia-nv
Copy link
Collaborator

Newton + Isaac RTX Rendering Performance Optimizations

This document describes four performance optimizations applied to the Newton physics simulator when used with the Isaac Sim RTX renderer inside Isaac Lab. Together they reduce per-frame time from ~323 ms to ~60 ms (a 5.4x speedup), making Newton's rendering path slightly faster than PhysX's equivalent (~65 ms).

All live primarily in two files:

  • source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py
  • source/isaaclab_newton/isaaclab_newton/physics/_cubric.py (new)

with small additions to PhysicsManager and SimulationContext in the core isaaclab package.


Baseline: ~323 ms per frame

The starting point is the unoptimized Newton + RTX rendering loop. A Nsight Systems trace reveals the structure:

  • Two physics steps execute per frame (typical for 2× physics substeps per render frame).
  • After each physics step, Newton writes updated body transforms to Fabric (Omniverse's GPU scene-graph cache) and then triggers a full CPU hierarchy update via update_world_xforms(). This hierarchy walk recomputes every world-space transform in the scene from parent-child relationships — even though Newton already computed the correct world transforms and wrote them directly.
  • The Kit renderer also runs its own, lighter, internal hierarchy update.

The per-step Fabric sync and hierarchy update dominates the frame. Because it runs after every physics step (not just before rendering), the cost is multiplied by the number of substeps.

newton-rtx-baseline

Optimization 1 — Dirty-Flag Deferred Sync: ~244 ms per frame

Problem

Every physics substep was calling sync_transforms_to_usd(), which writes Newton body poses to Fabric and then invokes update_world_xforms(). The hierarchy update is expensive and only needs to happen once before the renderer reads the scene — not after every substep.

Solution

A dirty-flag pattern decouples physics stepping from Fabric synchronization:

  1. _mark_transforms_dirty() — called at the end of each _simulate() call, sets _transforms_dirty = True. This is cheap (a boolean assignment).
  2. sync_transforms_to_usd() — now checks _transforms_dirty at the top and returns immediately if transforms haven't changed. When dirty, it writes transforms and calls the hierarchy update, then clears the flag.
  3. pre_render() — a new method added to PhysicsManager (base class) and overridden by NewtonManager. It calls sync_transforms_to_usd(). The SimulationContext.render() method calls physics_manager.pre_render() before updating visualizers and cameras, ensuring transforms are flushed exactly once per render frame.

The key insight is that the renderer only reads scene transforms during render(), not during step(). By deferring the Fabric write and hierarchy update to render time, we eliminate redundant work when multiple physics substeps run per render frame. For 2 substeps per frame, this cuts the hierarchy update count in half.

Key code paths

  • _simulate()_mark_transforms_dirty() (just sets a flag)
  • SimulationContext.render()PhysicsManager.pre_render()NewtonManager.sync_transforms_to_usd() (runs once, clears the flag)
newton-rtx-dirty

Optimization 2 — CUDA Graph Capture (Relaxed Mode): ~144 ms per frame

Problem

Looking at the physics steps in the trace, the GPU is underutilized. Each Warp kernel launch (collision detection, constraint solve, integration, FK evaluation) incurs a round-trip to the CPU via Python — launch overhead, GIL acquisition, and driver calls. For a simulation with many small kernels per substep, this CPU-side overhead becomes the bottleneck while the GPU sits idle between dispatches.

Newton already supported CUDA graphs (pre-recording a sequence of kernel launches and replaying them with a single driver call), but CUDA graph capture was disabled when RTX rendering was active. The original code had:

use_cuda_graph = cfg.use_cuda_graph and (cls._usdrt_stage is None)

This was necessary because RTX's background threads use CUDA's legacy stream (stream 0) for async operations like cudaImportExternalMemory. Warp's standard ScopedCapture() uses cudaStreamCaptureModeThreadLocal on a blocking stream, which implicitly synchronizes with legacy stream 0. If RTX ops happen during capture, the CUDA runtime raises error 906 (cudaErrorStreamCaptureImplicit).

Solution

A deferred, relaxed-mode CUDA graph capture strategy that is compatible with RTX:

Deferral: Graph capture is postponed from initialize_solver() to the first step() call. By that time, RTX has finished its initialization (all cudaImportExternalMemory calls are done) and is idle between render frames, providing a clean capture window.

# In initialize_solver():
cls._graph = None
cls._graph_capture_pending = True

# In step():
if cls._graph_capture_pending:
    cls._graph = cls._capture_relaxed_graph(device)

Relaxed-mode capture (_capture_relaxed_graph): This method works around two conflicting requirements:

  1. RTX compatibility: RTX threads use legacy stream 0. A blocking stream (Warp's default) implicitly syncs with it, causing capture failures. Solution: create a non-blocking stream (cudaStreamNonBlocking = 0x01) that has no implicit synchronization with stream 0.

  2. Warp compatibility: mujoco_warp internally calls wp.capture_while, which checks Warp's device.captures registry to decide whether to insert a conditional graph node or synchronize eagerly. Without a registered capture, it calls wp.synchronize_stream on the capturing stream — which is illegal inside graph capture. Solution: call wp.capture_begin(external=True, stream=fresh_stream) to register the capture in Warp's tracking without calling cudaStreamBeginCapture again (already done externally).

The capture sequence:

  1. Warmup run — execute _simulate_physics_only() eagerly to pre-allocate all MuJoCo-Warp scratch buffers (allocations are forbidden inside graph capture).
  2. Create a non-blocking CUDA stream via cudaStreamCreateWithFlags(..., NonBlocking).
  3. Begin capture with cudaStreamBeginCapture(..., cudaStreamCaptureModeRelaxed) — relaxed mode allows other streams to operate freely during capture.
  4. Register with Warp via wp.capture_begin(external=True, stream=...).
  5. Record physics kernels_simulate_physics_only() inside wp.ScopedStream(fresh_stream).
  6. Finalizewp.capture_end() then cudaStreamEndCapture() to obtain the graph.

Physics-only capture: _simulate_physics_only() was factored out of _simulate() to exclude Fabric sync operations (wp.synchronize_device, wp.fabricarray) that are incompatible with graph capture. After graph replay, step() marks transforms dirty, and pre_render() handles the Fabric sync eagerly.

The ctypes binding to libcudart.so is used directly because Warp's ScopedCapture doesn't expose control over capture mode or stream type.

newton-rtx-cuda-graph

Optimization 3 — GPU Transform Hierarchy via cubric: ~60 ms per frame

Problem

Even with the dirty-flag pattern reducing hierarchy updates to once per render frame, the update_world_xforms() call is still a CPU-side tree walk over the entire Fabric scene graph. For scenes with thousands of prims (typical in multi-environment RL), this CPU hierarchy propagation is a significant bottleneck.

The PhysX backend avoids this problem by using cubric — a GPU-accelerated transform hierarchy library. cubric runs the parent-child transform propagation entirely on the GPU via IAdapter::compute(), which is dramatically faster than the CPU walk. However, cubric has no Python bindings.

Solution

Pure-Python ctypes bindings to cubric's Carbonite interface (_cubric.py), allowing Newton to use the same GPU hierarchy propagation that PhysX uses.

cubric is implemented as a Carbonite plugin and exposes its API through the omni::cubric::IAdapter interface. The bindings work by:

  1. Acquiring the Carbonite Frameworklibcarb.so's acquireFramework() returns the singleton Framework*.
  2. Acquiring the IAdapter interface — calling tryAcquireInterfaceWithClient() with the interface descriptor omni::cubric::IAdapter version 0.1.
  3. Wrapping function pointers — the IAdapter struct is a C++ vtable-like struct with function pointers at known offsets.

Each function pointer is read from the struct at its byte offset and wrapped with ctypes.CFUNCTYPE to make it callable from Python.

Integration in sync_transforms_to_usd():

The sync method now mirrors PhysX's ScopedUSDRT pattern:

  1. Pause Fabric change trackingtrack_world_xform_changes(False) and track_local_xform_changes(False). This is critical: SelectPrims with ReadWrite access internally calls getAttributeArrayGpu, which marks Fabric buffers dirty. If tracking is still active, the hierarchy records the change and Kit's updateWorldXforms will do an expensive connectivity rebuild every frame.
  2. Write transforms — the existing Warp kernel writes Newton body poses to Fabric's omni:fabric:worldMatrix.
  3. Resume tracking — re-enable change tracking (in a finally block for safety).
  4. Run cubric computeIAdapter::compute() with eRigidBody | eForceUpdate options and eAll dirty mode. The eRigidBody flag tells cubric to use inverse propagation on prims tagged with PhysicsRigidBodyAPI (preserve the world matrix that Newton wrote, derive the local transform) and forward propagation on everything else (propagate parent transforms to children). eForceUpdate bypasses cubric's change-listener dirty check since we know transforms have changed.

The adapter is lazily created on the first sync_transforms_to_usd() call rather than during initialize_solver(), to avoid startup-ordering issues with the cubric plugin.

When cubric is unavailable (e.g., plugin not loaded, CPU-only), the code falls back gracefully to the CPU update_world_xforms() path.

sync_transforms_to_usd():
    ┌─────────────────────────────────┐
    │  Pause Fabric change tracking   │
    ├─────────────────────────────────┤
    │  SelectPrims (ReadWrite)        │
    │  wp.launch(_set_fabric_transforms) │  ← GPU: write Newton poses to Fabric
    │  wp.synchronize_device()        │
    ├─────────────────────────────────┤
    │  cubric IAdapter::compute()     │  ← GPU: propagate hierarchy
    ├─────────────────────────────────┤
    │  Resume Fabric change tracking  │
    └─────────────────────────────────┘

A future Kit release is expected to ship official Python bindings for cubric, at which point the ctypes approach can be replaced.

The result is a frame time of ~60 ms — slightly faster than PhysX's ~65 ms on the same scene.

newton-rtx-cubric

Summary

Optimization Frame Time Speedup vs. Baseline Key Technique
Baseline ~323 ms Sync + hierarchy after every substep
Dirty-flag deferred sync ~244 ms 1.3× Sync once per render frame, not per substep
CUDA graph (relaxed mode) ~144 ms 2.2× Eliminate per-kernel CPU launch overhead
cubric GPU hierarchy ~60 ms 5.4× GPU hierarchy propagation via ctypes bindings

All four optimizations are complementary and stack on top of each other. The final result matches or slightly beats the PhysX rendering path (~65 ms) while using Newton as the physics backend.

Co-developed with Toby Jones (NVIDIA).

@github-actions github-actions bot added the isaac-lab Related to Isaac Lab team label Mar 14, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 14, 2026

Greptile Summary

This PR implements four complementary performance optimizations for Newton + Isaac RTX rendering, reducing per-frame time from ~323 ms to ~60 ms (5.4×). The changes are well-reasoned and the engineering is sophisticated, but several fragile patterns warrant attention before merging.

What changed:

  • Dirty-flag deferred sync (_mark_transforms_dirty / pre_render): decouples physics stepping from Fabric writes so the expensive hierarchy update runs once per render frame instead of after every substep. Clean, safe change.
  • Deferred relaxed CUDA graph capture (_capture_relaxed_graph): works around RTX's incompatibility with standard ScopedCapture by using a non-blocking stream and cudaStreamCaptureModeRelaxed. The implementation is careful and well-documented.
  • GPU transform hierarchy via cubric ctypes bindings (_cubric.py): mirrors PhysX's internal GPU hierarchy path. Pure-Python ctypes bindings over undocumented C++ struct layouts; acknowledged as a stopgap until official Python bindings ship.
  • _simulate_physics_only factoring: cleanly separates GPU-capturable physics work from Fabric-sync operations.

Key concerns:

  • In sync_transforms_to_usd(), the early return when SelectPrims finds zero prims does not clear _transforms_dirty, causing repeated execution of the full setup code on every render frame in that edge case.
  • _capture_relaxed_graph mutates private attributes of a wp.Graph object (graph.graph and graph.graph_exec) that are undocumented Warp internals; a Warp version bump could silently break graph replay.
  • IAdapter::bindToStage is called on every render frame. If this call is not cheap or idempotent, it may reduce the expected speedup from the cubric optimization.
  • The byte offsets in _cubric.py (_FW_OFF_TRY_ACQUIRE, _IA_OFF_*) are hardcoded from a specific Carbonite SDK build; there is no version guard to detect when the struct layout has changed, which would produce silent incorrect behavior or crashes.

Confidence Score: 3/5

  • Merging is risky without addressing the fragile Warp internal API mutation and the missing dirty-flag clear on empty selection; the cubric byte-offset bindings could silently break on a Kit/Isaac Sim version bump.
  • The performance optimizations are well-designed and complementary. The core logic (dirty flag, deferred capture, cubric integration) is sound and the fallback paths are properly handled. However, three fragile patterns reduce confidence: (1) the graph.graph = raw_graph mutation of Warp private internals with no version guard, (2) the missing _transforms_dirty clear on empty-selection early return which could cause per-frame overhead in edge cases, and (3) the hardcoded Carbonite struct byte offsets in _cubric.py with no defensive version check. None of these are runtime crashes under the tested configuration, but they create maintenance landmines for future Warp or Kit version updates.
  • source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py (dirty-flag clear, Warp graph internals, bind_to_stage frequency) and source/isaaclab_newton/isaaclab_newton/physics/_cubric.py (hardcoded struct offsets).

Important Files Changed

Filename Overview
source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py Core file with all four optimizations: dirty-flag deferred sync, deferred relaxed CUDA graph capture, cubric GPU hierarchy integration, and _simulate_physics_only factoring. Contains fragile Warp internals mutation (graph.graph = raw_graph), a _transforms_dirty flag that isn't cleared on empty-selection early return, and bind_to_stage called every render frame.
source/isaaclab_newton/isaaclab_newton/physics/_cubric.py New file providing pure-Python ctypes bindings for the cubric IAdapter GPU transform hierarchy. Relies on hardcoded byte offsets into Carbonite Framework and IAdapter struct layouts that are undocumented and version-sensitive. Fallback to CPU update_world_xforms() is well-handled.
source/isaaclab/isaaclab/sim/simulation_context.py One-line addition of physics_manager.pre_render() at the start of render(). Safe change: all non-Newton backends inherit a no-op implementation from PhysicsManager.
source/isaaclab/isaaclab/physics/physics_manager.py Adds a no-op pre_render() classmethod to the PhysicsManager ABC. Clean, backward-compatible addition that all existing subclasses (PhysxManager) inherit safely without change.

Sequence Diagram

sequenceDiagram
    participant SC as SimulationContext
    participant PM as NewtonManager
    participant GPU as GPU (CUDA)
    participant FB as Fabric (USD)
    participant CB as cubric IAdapter
    participant RTX as RTX Renderer

    Note over SC,RTX: Per-frame loop
    SC->>PM: step()
    activate PM
    alt CUDA graph captured
        PM->>GPU: capture_launch(_graph)
        GPU-->>PM: (physics kernels replay)
        PM->>PM: _mark_transforms_dirty()
    else eager execution
        PM->>GPU: _simulate_physics_only()
        GPU-->>PM: (physics step complete)
        PM->>PM: _mark_transforms_dirty()
    end
    deactivate PM

    SC->>SC: render()
    SC->>PM: pre_render()
    activate PM
    PM->>PM: sync_transforms_to_usd()
    alt _transforms_dirty == True
        PM->>FB: track_world_xform_changes(False)
        PM->>FB: track_local_xform_changes(False)
        PM->>GPU: wp.launch(_set_fabric_transforms)
        GPU-->>FB: write body poses to worldMatrix
        PM->>GPU: wp.synchronize_device()
        PM->>PM: _transforms_dirty = False
        alt cubric available
            PM->>CB: bind_to_stage(adapter_id, fabric_id)
            PM->>CB: compute(adapter_id, eRigidBody|eForceUpdate, eAll)
            CB->>GPU: GPU hierarchy propagation
        else fallback
            PM->>FB: update_world_xforms() (CPU)
        end
        PM->>FB: track_world_xform_changes(True)
        PM->>FB: track_local_xform_changes(True)
    end
    deactivate PM
    SC->>RTX: update_visualizers() / Kit render
    RTX->>FB: read worldMatrix (already updated)
Loading

Last reviewed commit: a0248bc

Comment on lines +232 to +233
if selection.GetCount() == 0:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_transforms_dirty not cleared on empty selection early return

When selection.GetCount() == 0, the function returns via the finally block (which correctly re-enables Fabric change tracking), but _transforms_dirty is never set to False. On the next render frame, pre_render() calls sync_transforms_to_usd() again — which then repeats the full expensive setup: importing usdrt, lazily creating the cubric adapter, fetching IFabricHierarchy, pausing change tracking, calling SelectPrims, and resuming tracking — only to find zero prims again.

If this edge case occurs repeatedly (e.g., before prims are attached to Fabric), it will continuously execute overhead that the dirty-flag optimization was specifically designed to avoid.

Consider also clearing the dirty flag in the empty-selection path, or at minimum before the return:

Suggested change
if selection.GetCount() == 0:
return
if selection.GetCount() == 0:
cls._transforms_dirty = False
return

Comment on lines +737 to +739
graph.graph = raw_graph
graph.graph_exec = None
return graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile mutation of Warp Graph internals

wp.capture_end(stream=fresh_stream) returns a wp.Graph object, but since the external=True path doesn't call cudaStreamEndCapture internally, the returned object has a null/stale graph handle. The code then manually overwrites it:

graph.graph = raw_graph   # ctypes.c_void_p assigned to Warp Graph attribute
graph.graph_exec = None

This pattern relies on two undocumented Warp internals:

  1. wp.Graph.graph stores the raw cudaGraph_t handle and is read as-is by wp.capture_launch.
  2. wp.Graph.graph_exec = None triggers lazy cudaGraphInstantiate on the next capture_launch call.

Both behaviors are correct for the current Warp version but are private implementation details. A Warp update that renames these attributes, changes their types, or adds null-handle validation in capture_launch would silently break graph replay or crash at runtime.

Consider adding a version guard or a comment pinning the tested Warp version, and adding a try-wrapped wp.capture_launch(graph) on a no-op warmup step right after capture to validate the graph before it enters the hot path.

Comment on lines +248 to +250
fabric_id = cls._usdrt_stage.GetFabricId().id
cls._cubric.bind_to_stage(cls._cubric_adapter, fabric_id)
cls._cubric.compute(cls._cubric_adapter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bind_to_stage called on every render frame

IAdapter::bindToStage is part of the cubric adapter's setup lifecycle (create → bind → compute → release), and it is called here on every render frame. If bindToStage rebuilds internal data structures (bucket lists, hierarchy connectivity, GPU buffers keyed by stage contents), this per-frame call could negate some of the GPU hierarchy speedup that Optimization 3 is meant to provide.

If the Fabric ID does not change across frames (which is expected for a single-stage simulation), and assuming bindToStage is not purely idempotent/cheap, consider caching the bound fabric_id and only rebinding when it changes:

new_fabric_id = cls._usdrt_stage.GetFabricId().id
if new_fabric_id != cls._cubric_bound_fabric_id:
    cls._cubric.bind_to_stage(cls._cubric_adapter, new_fabric_id)
    cls._cubric_bound_fabric_id = new_fabric_id
cls._cubric.compute(cls._cubric_adapter)

This would require adding a _cubric_bound_fabric_id: int | None = None class variable and resetting it in clear().

Comment on lines +70 to +94
_OPT_DEFAULT = _OPT_RIGID_BODY | _OPT_FORCE_UPDATE

# AdapterDirtyMode
_DIRTY_ALL = 0 # eAll — dirty all prims in the stage
_DIRTY_COARSE = 1 # eCoarse — dirty all prims in visited buckets


# ---------------------------------------------------------------------------
# ctypes struct mirrors
# ---------------------------------------------------------------------------
class _Version(ctypes.Structure):
_fields_ = [("major", ctypes.c_uint32), ("minor", ctypes.c_uint32)]


class _InterfaceDesc(ctypes.Structure):
"""``carb::InterfaceDesc`` — {const char* name, Version version}."""
_fields_ = [
("name", ctypes.c_char_p),
("version", _Version),
]


def _read_u64(addr: int) -> int:
return ctypes.c_uint64.from_address(addr).value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded byte offsets into native C++ struct layouts

The constants _FW_OFF_TRY_ACQUIRE = 24, _IA_OFF_CREATE = 8, _IA_OFF_RELEASE = 32, _IA_OFF_BIND = 40, and _IA_OFF_COMPUTE = 56 are hardcoded byte offsets into Carbonite's Framework vtable and IAdapter's function-pointer table, derived from a specific build of the Carbonite SDK.

If NVIDIA inserts or reorders fields in either struct in a future Kit/Isaac Sim release, every call through these pointers will silently dispatch to the wrong function. Because the pointer reads via _read_u64 bypass all type safety, the failure mode is either silent mis-computation or an immediate segfault — both hard to diagnose.

A few mitigations worth considering:

  • Add a version assertion on the carb framework (acquireFramework returns a version) to bail out early when the framework version changes.
  • Add a smoke-test after acquiring the pointers: e.g., call IAdapter::getAttribute (offset 0) and verify the returned version matches the expected IAdapter 0.1 version before using the other slots.
  • Document the exact Kit/Isaac Sim version these offsets were verified against in a comment next to each constant.

@fatimaanes fatimaanes self-requested a review March 15, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant