[STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI#5788
[STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI#5788hujc7 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
🔬 Stress Test PR Review
Understanding: This PR intentionally enables the cuda:1 FabricFrameView tests that are known to hang, serving as an in-CI reproduction of the SelectPrims deadlock on non-zero CUDA device indices. The [STRESS / DO NOT MERGE] label correctly signals this is infrastructure validation, not landable work.
What's Good
-
Docker-based approach is sound — Using
nvcr.io/nvidian/isaac-lab:latest-developwith the workspace mounted at/workspace/isaaclabis cleaner than bare-metal pip installs and consistent with the repo's other containerized workflows. -
Editable install strategy is correct — The
pip install --no-deps -e source/isaaclab -e source/isaaclab_physxpattern properly overlays PR changes onto the baked-in image without pulling conflicting dependencies. -
Runner label fix — Removing the
gpulabel from theruns-onarray (per @nv-apoddubny's feedback) resolves the queue stall issue where both labels matched zero registered runners. -
EULA bypass is properly configured — All three environment variables (
OMNI_KIT_ACCEPT_EULA,ACCEPT_EULA,ISAAC_SIM_HEADLESS) are set, matching the pattern inbuild.yaml. -
Fork PR handling — The NGC_API_KEY fallback to runner cache is a reasonable compromise that keeps fork CI functional while noting the infra dependency.
Observations
The 25-minute timeout is intentional — Given the documented deadlock behavior (first cuda:1 test hangs indefinitely), the timeout serves as the test's pass/fail signal. This is appropriate for surfacing the regression.
GPU verification placement — Running nvidia-smi -L | wc -l before the Docker pull ensures early-exit on misconfigured runners. The line-count approach is sufficient for "≥2 GPUs" verification.
SHA-pinned checkout — Using actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd aligns with the repo's security policy for pinning Actions by SHA.
Status
This PR correctly sets up the infrastructure to reproduce the FabricFrameView hang in CI. Once the underlying issue in fabric_frame_view.py (the SelectPrims USDRT call on non-zero device indices) is fixed, this workflow will go green and the env-var flip can land as production gating.
No blocking issues identified for the stated purpose.
Update (9c6a69e): Reviewed incremental changes from dd9cd39 to 9c6a69e. This commit includes a substantial batch of unrelated CI and codebase improvements that appear to have been merged/rebased into this branch:
Changes Summary
-
CI Infrastructure Improvements:
- Added
exclude-patterninput torun-testsandrun-package-testsactions for finer test filtering - Folded standalone
verify-base-non-rootandverify-curobo-non-rootjobs into their respective test jobs (reducing runner burn) - Added JUnit XML report upload artifacts across CI workflows
- Added
-
OVRTX Renderer Cleanup:
- Removed legacy OVRTX 0.2.x code paths (methods like
extract_all_depth_tiles_kernel_legacy,generate_random_colors_from_ids_kernel_legacy) - Fixed default log file path to use cross-platform
tempfile.gettempdir()instead of hardcoded/tmp - Removed
_usd_handlestracking in favor ofreset_stage()API
- Removed legacy OVRTX 0.2.x code paths (methods like
-
Documentation Reorganization:
- Moved visualization docs from
features/tocore-concepts/ - Restructured bleeding-edge docs to focus on
isaaclab_contribpackage - Added dedicated RLinf VLA post-training documentation page
- Moved visualization docs from
-
Test Fixes:
- Fixed noise tests to avoid NaN from
0/0when using scale operations (torch.onesinstead oftorch.rand) - Added OVRTX log redirection helper for pytest capture in kitless rendering tests
- Removed rendering correctness tests from skip lists
- Fixed noise tests to avoid NaN from
-
Minor Fixes:
- Added
libgmp-devto arm64 Dockerfile for pytetwild builds - Changelog housekeeping for
isaaclab_teleop
- Added
Stress Test Workflow (This PR's Focus)
The test-fabric-multi-gpu.yaml workflow itself has only minor comment refinements — the core docker-based approach and 25-minute timeout strategy remain unchanged from the previous review.
Recommendation: The non-test-infrastructure changes in this commit are significant and orthogonal to the FabricFrameView stress test purpose. Consider splitting these into a separate PR to keep the stress test PR minimal and easier to revert if needed.
Update (689333e): Reviewed incremental changes from 9c6a69e to 689333e5. This commit cleans up the branch by removing the unrelated changes that were previously merged in.
Changes Summary
The workflow file test-fabric-multi-gpu.yaml now shows a focused diff compared to develop:
- Workflow enabled — Re-enables the
pull_requesttrigger with path filters for the FabricFrameView source, tests, and workflow file itself - Docker-based execution — Switches from bare-metal
./isaaclab.sh --installto running inside thenvcr.io/nvidian/isaac-lab:latest-developcontainer - Simplified GPU check — Uses direct
nvidia-smi -L | wc -linstead of Python/torch-based detection - NGC login + caching — Handles missing
NGC_API_KEYgracefully with runner cache fallback - Editable reinstall — Overlays PR changes onto container with
pip install --no-deps -e
Assessment
✅ This is the correct state for a stress-test PR. The branch is now clean and focused solely on the FabricFrameView multi-GPU workflow changes. The unrelated CI/docs/test changes from the prior commit are gone.
The workflow logic is sound for its intended purpose (reproduce the cuda:1 hang in CI). Ready for the stress test run once a multi-GPU runner picks it up.
Update (d9dccba): Reviewed incremental changes from 689333e5 to d9dccba0. Significant architectural shift — the workflow has moved from Docker-based execution back to bare-metal pip installs.
Key Changes
-
Removed Docker execution — The containerized approach using
nvcr.io/nvidian/isaac-lab:latest-developis gone. Now runs directly on runner withsetup-python@v5(Python 3.12). -
Added cmake pip install workaround —
pip install cmakebypasses thesudo apt-getpath ininstall.py:35that would fail on restricted runners. -
Minimal install strategy —
./isaaclab.sh --install nonepulls only core submodules, avoiding robomimic's egl_probe wheel (requires libEGL/X11 headers the runner lacks). -
Direct Isaac Sim pip install — Uses
isaacsim[all,extscache]==${{ vars.ISAACSIM_BASE_VERSION || '6.0.0' }}from PyPI/NVIDIA index, with explicit note about Python 3.12 compatibility (5.x series requires 3.11). -
Increased timeout — Now 60 minutes (was 25 min in Docker version), reflecting longer setup time for pip-based installs.
-
Simplified runner labels —
[self-hosted, linux, x64, multi-gpu](removedgpulabel).
Observations
Tradeoffs vs Docker approach:
- ✅ Pro: Simpler, no NGC authentication/caching complexity, no Docker image dependency
⚠️ Con: Relies on runner having correct system libraries; less reproducible across different runner configs
Isaac Sim version pinning rationale is sound — The comment explains the 6.0.0 pin (only 3.12-compatible release) and notes the stale 5.1.0 pin in source/isaaclab/setup.py.
EXP_PATH issue documented — Good comment explaining why Isaac Sim must be installed separately (AppLauncher silently suppresses missing module, leaving EXP_PATH unset).
Assessment
This is a reasonable alternative approach for runners where Docker is problematic. The extensive inline comments explain each workaround, making the intent clear. The 60-minute timeout should accommodate the pip install overhead while still catching the cuda:1 hang.
4d2e594 to
689333e
Compare
Re-enables the pull_request trigger on test-fabric-multi-gpu.yaml and wires it to run the FabricFrameView contract tests with ISAACLAB_TEST_MULTI_GPU=1, which activates the three cuda:1 -parameterised tests added in isaac-sim#5514. The cuda:1 tests target FabricFrameView's SelectPrims path on non-zero CUDA device indices. They currently hang indefinitely on real multi-GPU hardware (reproduced locally on 3x RTX 6000 Pro Blackwell and on the multi-GPU runner pool); the 60-min workflow timeout will cancel the job and surface the regression in CI for the FabricFrameView maintainers. Install pipeline matches isaac-sim#5738's proven-working layout: - Pin Python 3.12 via SHA-pinned actions/setup-python. - Pre-install cmake via pip to skip install.py's sudo apt-get branch. - ./isaaclab.sh --install none (core only, avoids egl_probe libEGL). - pip install isaacsim[all,extscache]==${vars.ISAACSIM_BASE_VERSION || '6.0.0'} --extra-index-url https://pypi.nvidia.com. - Bypass Kit's interactive EULA via OMNI_KIT_ACCEPT_EULA / ACCEPT_EULA / ISAAC_SIM_HEADLESS. Status: this PR is expected to fail with the 60-min workflow timeout. Land once the underlying hang in fabric_frame_view.py is fixed.
689333e to
d9dccba
Compare
1. Summary
develop: workflow that runs the FabricFrameView contract tests on the multi-GPU runner withISAACLAB_TEST_MULTI_GPU=1.source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.pyadded by Enable mgpu in FrameView #5514.2. Status
This PR is expected to fail with the 60-minute workflow timeout. It exists as an in-CI reproduction of the FabricFrameView cuda:1 hang, not as landable work.
Reproductions:
[self-hosted, linux, x64, multi-gpu]runner, 60-min workflow timeout cancellationThe first cuda:1 test (
test_fabric_cuda1_world_pose_roundtrip[cuda:1]) deadlocks before completing. Tests 2 and 3 never run. pytest with-vonly prints a test ID on completion, so the visible log signature is: cpu+cuda:0 tests pass throughtest_fabric_rebuild_after_topology_change[cuda:0] PASSED [92%], then silence until the workflow timeout fires.3. Local reproduction
Conda env:
env_isaaclab(Python 3.12, isaacsim 6.0.0.0). Hardware: any host withtorch.cuda.device_count() >= 2.Targets the three cuda:1-specific tests explicitly (the only tests in the repo parametrized over
["cuda:1"]), so the test selection is unambiguous:Expected behavior:
test_fabric_cuda1_world_pose_roundtrip[cuda:1]starts.timeout 90sends SIGTERM (exit 143).Pre-flight sanity (optional) — confirms cuda:1 is usable from torch + warp on the host, so the hang is isolated to the Fabric path:
4. Bug surface
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.pySelectPrimsinvocation when the device index is not 0_fabric_supported_devices = ("cpu", "cuda", "cuda:0")allowlist with the rationale "USDRT SelectPrims now accepts any CUDA device index", but the call still deadlocks in practice.5. Resolution path
Land #5738 first (multi-GPU CI infrastructure with cuda:1 gated off → green). When the underlying FabricFrameView hang is fixed in a separate PR, this PR's CI goes green and the env-var flip can land — restoring multi-GPU regression gating for FabricFrameView.
6. Dependencies
ISAACLAB_TEST_MULTI_GPU=1set to actually run the cuda:1 tests.7. Test plan