Newton + Isaac RTX Rendering Performance Optimizations#5017
Newton + Isaac RTX Rendering Performance Optimizations#5017ncournia-nv wants to merge 8 commits intoisaac-sim:developfrom
Conversation
This reverts commit 46830e6a6e3150da5f0fb378528ab78253e0dea0.
Greptile SummaryThis PR implements four complementary performance optimizations for Newton + Isaac RTX rendering, reducing per-frame time from ~323 ms to ~60 ms (5.4×). The changes are well-reasoned and the engineering is sophisticated, but several fragile patterns warrant attention before merging. What changed:
Key concerns:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant SC as SimulationContext
participant PM as NewtonManager
participant GPU as GPU (CUDA)
participant FB as Fabric (USD)
participant CB as cubric IAdapter
participant RTX as RTX Renderer
Note over SC,RTX: Per-frame loop
SC->>PM: step()
activate PM
alt CUDA graph captured
PM->>GPU: capture_launch(_graph)
GPU-->>PM: (physics kernels replay)
PM->>PM: _mark_transforms_dirty()
else eager execution
PM->>GPU: _simulate_physics_only()
GPU-->>PM: (physics step complete)
PM->>PM: _mark_transforms_dirty()
end
deactivate PM
SC->>SC: render()
SC->>PM: pre_render()
activate PM
PM->>PM: sync_transforms_to_usd()
alt _transforms_dirty == True
PM->>FB: track_world_xform_changes(False)
PM->>FB: track_local_xform_changes(False)
PM->>GPU: wp.launch(_set_fabric_transforms)
GPU-->>FB: write body poses to worldMatrix
PM->>GPU: wp.synchronize_device()
PM->>PM: _transforms_dirty = False
alt cubric available
PM->>CB: bind_to_stage(adapter_id, fabric_id)
PM->>CB: compute(adapter_id, eRigidBody|eForceUpdate, eAll)
CB->>GPU: GPU hierarchy propagation
else fallback
PM->>FB: update_world_xforms() (CPU)
end
PM->>FB: track_world_xform_changes(True)
PM->>FB: track_local_xform_changes(True)
end
deactivate PM
SC->>RTX: update_visualizers() / Kit render
RTX->>FB: read worldMatrix (already updated)
Last reviewed commit: a0248bc |
| if selection.GetCount() == 0: | ||
| return |
There was a problem hiding this comment.
_transforms_dirty not cleared on empty selection early return
When selection.GetCount() == 0, the function returns via the finally block (which correctly re-enables Fabric change tracking), but _transforms_dirty is never set to False. On the next render frame, pre_render() calls sync_transforms_to_usd() again — which then repeats the full expensive setup: importing usdrt, lazily creating the cubric adapter, fetching IFabricHierarchy, pausing change tracking, calling SelectPrims, and resuming tracking — only to find zero prims again.
If this edge case occurs repeatedly (e.g., before prims are attached to Fabric), it will continuously execute overhead that the dirty-flag optimization was specifically designed to avoid.
Consider also clearing the dirty flag in the empty-selection path, or at minimum before the return:
| if selection.GetCount() == 0: | |
| return | |
| if selection.GetCount() == 0: | |
| cls._transforms_dirty = False | |
| return |
| graph.graph = raw_graph | ||
| graph.graph_exec = None | ||
| return graph |
There was a problem hiding this comment.
Fragile mutation of Warp Graph internals
wp.capture_end(stream=fresh_stream) returns a wp.Graph object, but since the external=True path doesn't call cudaStreamEndCapture internally, the returned object has a null/stale graph handle. The code then manually overwrites it:
graph.graph = raw_graph # ctypes.c_void_p assigned to Warp Graph attribute
graph.graph_exec = NoneThis pattern relies on two undocumented Warp internals:
wp.Graph.graphstores the rawcudaGraph_thandle and is read as-is bywp.capture_launch.wp.Graph.graph_exec = Nonetriggers lazycudaGraphInstantiateon the nextcapture_launchcall.
Both behaviors are correct for the current Warp version but are private implementation details. A Warp update that renames these attributes, changes their types, or adds null-handle validation in capture_launch would silently break graph replay or crash at runtime.
Consider adding a version guard or a comment pinning the tested Warp version, and adding a try-wrapped wp.capture_launch(graph) on a no-op warmup step right after capture to validate the graph before it enters the hot path.
| fabric_id = cls._usdrt_stage.GetFabricId().id | ||
| cls._cubric.bind_to_stage(cls._cubric_adapter, fabric_id) | ||
| cls._cubric.compute(cls._cubric_adapter) |
There was a problem hiding this comment.
bind_to_stage called on every render frame
IAdapter::bindToStage is part of the cubric adapter's setup lifecycle (create → bind → compute → release), and it is called here on every render frame. If bindToStage rebuilds internal data structures (bucket lists, hierarchy connectivity, GPU buffers keyed by stage contents), this per-frame call could negate some of the GPU hierarchy speedup that Optimization 3 is meant to provide.
If the Fabric ID does not change across frames (which is expected for a single-stage simulation), and assuming bindToStage is not purely idempotent/cheap, consider caching the bound fabric_id and only rebinding when it changes:
new_fabric_id = cls._usdrt_stage.GetFabricId().id
if new_fabric_id != cls._cubric_bound_fabric_id:
cls._cubric.bind_to_stage(cls._cubric_adapter, new_fabric_id)
cls._cubric_bound_fabric_id = new_fabric_id
cls._cubric.compute(cls._cubric_adapter)This would require adding a _cubric_bound_fabric_id: int | None = None class variable and resetting it in clear().
| _OPT_DEFAULT = _OPT_RIGID_BODY | _OPT_FORCE_UPDATE | ||
|
|
||
| # AdapterDirtyMode | ||
| _DIRTY_ALL = 0 # eAll — dirty all prims in the stage | ||
| _DIRTY_COARSE = 1 # eCoarse — dirty all prims in visited buckets | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # ctypes struct mirrors | ||
| # --------------------------------------------------------------------------- | ||
| class _Version(ctypes.Structure): | ||
| _fields_ = [("major", ctypes.c_uint32), ("minor", ctypes.c_uint32)] | ||
|
|
||
|
|
||
| class _InterfaceDesc(ctypes.Structure): | ||
| """``carb::InterfaceDesc`` — {const char* name, Version version}.""" | ||
| _fields_ = [ | ||
| ("name", ctypes.c_char_p), | ||
| ("version", _Version), | ||
| ] | ||
|
|
||
|
|
||
| def _read_u64(addr: int) -> int: | ||
| return ctypes.c_uint64.from_address(addr).value | ||
|
|
There was a problem hiding this comment.
Hardcoded byte offsets into native C++ struct layouts
The constants _FW_OFF_TRY_ACQUIRE = 24, _IA_OFF_CREATE = 8, _IA_OFF_RELEASE = 32, _IA_OFF_BIND = 40, and _IA_OFF_COMPUTE = 56 are hardcoded byte offsets into Carbonite's Framework vtable and IAdapter's function-pointer table, derived from a specific build of the Carbonite SDK.
If NVIDIA inserts or reorders fields in either struct in a future Kit/Isaac Sim release, every call through these pointers will silently dispatch to the wrong function. Because the pointer reads via _read_u64 bypass all type safety, the failure mode is either silent mis-computation or an immediate segfault — both hard to diagnose.
A few mitigations worth considering:
- Add a version assertion on the carb framework (
acquireFrameworkreturns a version) to bail out early when the framework version changes. - Add a smoke-test after acquiring the pointers: e.g., call
IAdapter::getAttribute(offset 0) and verify the returned version matches the expectedIAdapter0.1version before using the other slots. - Document the exact Kit/Isaac Sim version these offsets were verified against in a comment next to each constant.
Newton + Isaac RTX Rendering Performance Optimizations
This document describes four performance optimizations applied to the Newton physics simulator when used with the Isaac Sim RTX renderer inside Isaac Lab. Together they reduce per-frame time from ~323 ms to ~60 ms (a 5.4x speedup), making Newton's rendering path slightly faster than PhysX's equivalent (~65 ms).
All live primarily in two files:
source/isaaclab_newton/isaaclab_newton/physics/newton_manager.pysource/isaaclab_newton/isaaclab_newton/physics/_cubric.py(new)with small additions to
PhysicsManagerandSimulationContextin the coreisaaclabpackage.Baseline: ~323 ms per frame
The starting point is the unoptimized Newton + RTX rendering loop. A Nsight Systems trace reveals the structure:
update_world_xforms(). This hierarchy walk recomputes every world-space transform in the scene from parent-child relationships — even though Newton already computed the correct world transforms and wrote them directly.The per-step Fabric sync and hierarchy update dominates the frame. Because it runs after every physics step (not just before rendering), the cost is multiplied by the number of substeps.
Optimization 1 — Dirty-Flag Deferred Sync: ~244 ms per frame
Problem
Every physics substep was calling
sync_transforms_to_usd(), which writes Newton body poses to Fabric and then invokesupdate_world_xforms(). The hierarchy update is expensive and only needs to happen once before the renderer reads the scene — not after every substep.Solution
A dirty-flag pattern decouples physics stepping from Fabric synchronization:
_mark_transforms_dirty()— called at the end of each_simulate()call, sets_transforms_dirty = True. This is cheap (a boolean assignment).sync_transforms_to_usd()— now checks_transforms_dirtyat the top and returns immediately if transforms haven't changed. When dirty, it writes transforms and calls the hierarchy update, then clears the flag.pre_render()— a new method added toPhysicsManager(base class) and overridden byNewtonManager. It callssync_transforms_to_usd(). TheSimulationContext.render()method callsphysics_manager.pre_render()before updating visualizers and cameras, ensuring transforms are flushed exactly once per render frame.The key insight is that the renderer only reads scene transforms during
render(), not duringstep(). By deferring the Fabric write and hierarchy update to render time, we eliminate redundant work when multiple physics substeps run per render frame. For 2 substeps per frame, this cuts the hierarchy update count in half.Key code paths
_simulate()→_mark_transforms_dirty()(just sets a flag)SimulationContext.render()→PhysicsManager.pre_render()→NewtonManager.sync_transforms_to_usd()(runs once, clears the flag)Optimization 2 — CUDA Graph Capture (Relaxed Mode): ~144 ms per frame
Problem
Looking at the physics steps in the trace, the GPU is underutilized. Each Warp kernel launch (collision detection, constraint solve, integration, FK evaluation) incurs a round-trip to the CPU via Python — launch overhead, GIL acquisition, and driver calls. For a simulation with many small kernels per substep, this CPU-side overhead becomes the bottleneck while the GPU sits idle between dispatches.
Newton already supported CUDA graphs (pre-recording a sequence of kernel launches and replaying them with a single driver call), but CUDA graph capture was disabled when RTX rendering was active. The original code had:
This was necessary because RTX's background threads use CUDA's legacy stream (stream 0) for async operations like
cudaImportExternalMemory. Warp's standardScopedCapture()usescudaStreamCaptureModeThreadLocalon a blocking stream, which implicitly synchronizes with legacy stream 0. If RTX ops happen during capture, the CUDA runtime raises error 906 (cudaErrorStreamCaptureImplicit).Solution
A deferred, relaxed-mode CUDA graph capture strategy that is compatible with RTX:
Deferral: Graph capture is postponed from
initialize_solver()to the firststep()call. By that time, RTX has finished its initialization (allcudaImportExternalMemorycalls are done) and is idle between render frames, providing a clean capture window.Relaxed-mode capture (
_capture_relaxed_graph): This method works around two conflicting requirements:RTX compatibility: RTX threads use legacy stream 0. A blocking stream (Warp's default) implicitly syncs with it, causing capture failures. Solution: create a non-blocking stream (
cudaStreamNonBlocking = 0x01) that has no implicit synchronization with stream 0.Warp compatibility:
mujoco_warpinternally callswp.capture_while, which checks Warp'sdevice.capturesregistry to decide whether to insert a conditional graph node or synchronize eagerly. Without a registered capture, it callswp.synchronize_streamon the capturing stream — which is illegal inside graph capture. Solution: callwp.capture_begin(external=True, stream=fresh_stream)to register the capture in Warp's tracking without callingcudaStreamBeginCaptureagain (already done externally).The capture sequence:
_simulate_physics_only()eagerly to pre-allocate all MuJoCo-Warp scratch buffers (allocations are forbidden inside graph capture).cudaStreamCreateWithFlags(..., NonBlocking).cudaStreamBeginCapture(..., cudaStreamCaptureModeRelaxed)— relaxed mode allows other streams to operate freely during capture.wp.capture_begin(external=True, stream=...)._simulate_physics_only()insidewp.ScopedStream(fresh_stream).wp.capture_end()thencudaStreamEndCapture()to obtain the graph.Physics-only capture:
_simulate_physics_only()was factored out of_simulate()to exclude Fabric sync operations (wp.synchronize_device,wp.fabricarray) that are incompatible with graph capture. After graph replay,step()marks transforms dirty, andpre_render()handles the Fabric sync eagerly.The ctypes binding to
libcudart.sois used directly because Warp'sScopedCapturedoesn't expose control over capture mode or stream type.Optimization 3 — GPU Transform Hierarchy via cubric: ~60 ms per frame
Problem
Even with the dirty-flag pattern reducing hierarchy updates to once per render frame, the
update_world_xforms()call is still a CPU-side tree walk over the entire Fabric scene graph. For scenes with thousands of prims (typical in multi-environment RL), this CPU hierarchy propagation is a significant bottleneck.The PhysX backend avoids this problem by using cubric — a GPU-accelerated transform hierarchy library. cubric runs the parent-child transform propagation entirely on the GPU via
IAdapter::compute(), which is dramatically faster than the CPU walk. However, cubric has no Python bindings.Solution
Pure-Python ctypes bindings to cubric's Carbonite interface (
_cubric.py), allowing Newton to use the same GPU hierarchy propagation that PhysX uses.cubric is implemented as a Carbonite plugin and exposes its API through the
omni::cubric::IAdapterinterface. The bindings work by:libcarb.so'sacquireFramework()returns the singletonFramework*.tryAcquireInterfaceWithClient()with the interface descriptoromni::cubric::IAdapterversion0.1.IAdapterstruct is a C++ vtable-like struct with function pointers at known offsets.Each function pointer is read from the struct at its byte offset and wrapped with
ctypes.CFUNCTYPEto make it callable from Python.Integration in
sync_transforms_to_usd():The sync method now mirrors PhysX's
ScopedUSDRTpattern:track_world_xform_changes(False)andtrack_local_xform_changes(False). This is critical:SelectPrimswithReadWriteaccess internally callsgetAttributeArrayGpu, which marks Fabric buffers dirty. If tracking is still active, the hierarchy records the change and Kit'supdateWorldXformswill do an expensive connectivity rebuild every frame.omni:fabric:worldMatrix.finallyblock for safety).IAdapter::compute()witheRigidBody | eForceUpdateoptions andeAlldirty mode. TheeRigidBodyflag tells cubric to use inverse propagation on prims tagged withPhysicsRigidBodyAPI(preserve the world matrix that Newton wrote, derive the local transform) and forward propagation on everything else (propagate parent transforms to children).eForceUpdatebypasses cubric's change-listener dirty check since we know transforms have changed.The adapter is lazily created on the first
sync_transforms_to_usd()call rather than duringinitialize_solver(), to avoid startup-ordering issues with the cubric plugin.When cubric is unavailable (e.g., plugin not loaded, CPU-only), the code falls back gracefully to the CPU
update_world_xforms()path.A future Kit release is expected to ship official Python bindings for cubric, at which point the ctypes approach can be replaced.
The result is a frame time of ~60 ms — slightly faster than PhysX's ~65 ms on the same scene.
Summary
All four optimizations are complementary and stack on top of each other. The final result matches or slightly beats the PhysX rendering path (~65 ms) while using Newton as the physics backend.
Co-developed with Toby Jones (NVIDIA).