Skip to content

[STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI#5788

Closed
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/fabric-mgpu-docker
Closed

[STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI#5788
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/fabric-mgpu-docker

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented May 26, 2026

1. Summary

  • Single squashed commit on top of develop: workflow that runs the FabricFrameView contract tests on the multi-GPU runner with ISAACLAB_TEST_MULTI_GPU=1.
  • Activates the three cuda:1-parameterised tests in source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py added by Enable mgpu in FrameView #5514.
  • Surfaces the FabricFrameView SelectPrims hang on non-zero CUDA device indices as a CI signal so maintainers can iterate on a fix.

2. Status

This PR is expected to fail with the 60-minute workflow timeout. It exists as an in-CI reproduction of the FabricFrameView cuda:1 hang, not as landable work.

Reproductions:

  • Local: 3× RTX 6000 Pro Blackwell on Horde DGXC, 90s SIGTERM
  • CI: [self-hosted, linux, x64, multi-gpu] runner, 60-min workflow timeout cancellation

The first cuda:1 test (test_fabric_cuda1_world_pose_roundtrip[cuda:1]) deadlocks before completing. Tests 2 and 3 never run. pytest with -v only prints a test ID on completion, so the visible log signature is: cpu+cuda:0 tests pass through test_fabric_rebuild_after_topology_change[cuda:0] PASSED [92%], then silence until the workflow timeout fires.

3. Local reproduction

Conda env: env_isaaclab (Python 3.12, isaacsim 6.0.0.0). Hardware: any host with torch.cuda.device_count() >= 2.

Targets the three cuda:1-specific tests explicitly (the only tests in the repo parametrized over ["cuda:1"]), so the test selection is unambiguous:

cd <IsaacLab worktree>   # must be on a checkout that has PR #5514 merged (latest develop)
conda activate env_isaaclab

timeout 90 env \
  ISAACLAB_TEST_MULTI_GPU=1 \
  OMNI_KIT_ACCEPT_EULA=yes \
  ACCEPT_EULA=Y \
  ISAAC_SIM_HEADLESS=1 \
  ./isaaclab.sh -p -m pytest \
    "source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py::test_fabric_cuda1_world_pose_roundtrip" \
    "source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py::test_fabric_cuda1_no_usd_writeback" \
    "source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py::test_fabric_cuda1_scales_roundtrip" \
    -v --tb=short

Expected behavior:

  • 3 tests collected, 3 selected.
  • test_fabric_cuda1_world_pose_roundtrip[cuda:1] starts.
  • No further pytest output for the remainder of the timeout (the first test never completes; tests 2 and 3 never start).
  • Outer timeout 90 sends SIGTERM (exit 143).

Pre-flight sanity (optional) — confirms cuda:1 is usable from torch + warp on the host, so the hang is isolated to the Fabric path:

./isaaclab.sh -p -c "
import torch, warp as wp
print(f'cuda count: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
    torch.cuda.set_device(i)
    assert torch.zeros(10, device=f'cuda:{i}').device.index == i
wp.init()
for i in range(torch.cuda.device_count()):
    with wp.ScopedDevice(f'cuda:{i}'):
        assert str(wp.zeros(10, dtype=wp.float32).device) == f'cuda:{i}'
print('cuda:N alloc OK for torch and warp')
"

4. Bug surface

  • File: source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
  • Code path: USDRT SelectPrims invocation when the device index is not 0
  • Origin: Enable mgpu in FrameView #5514 removed the _fabric_supported_devices = ("cpu", "cuda", "cuda:0") allowlist with the rationale "USDRT SelectPrims now accepts any CUDA device index", but the call still deadlocks in practice.

5. Resolution path

Land #5738 first (multi-GPU CI infrastructure with cuda:1 gated off → green). When the underlying FabricFrameView hang is fixed in a separate PR, this PR's CI goes green and the env-var flip can land — restoring multi-GPU regression gating for FabricFrameView.

6. Dependencies

7. Test plan

  • Workflow appears in checks on this draft PR.
  • cpu + cuda:0 tests pass (~30s in pytest).
  • First cuda:1 test enters deadlock; job cancelled at 60-min timeout.
  • Local reproduction recipe in §3 confirms the same hang outside CI (verified on Horde, 90s SIGTERM).

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔬 Stress Test PR Review

Understanding: This PR intentionally enables the cuda:1 FabricFrameView tests that are known to hang, serving as an in-CI reproduction of the SelectPrims deadlock on non-zero CUDA device indices. The [STRESS / DO NOT MERGE] label correctly signals this is infrastructure validation, not landable work.

What's Good

  1. Docker-based approach is sound — Using nvcr.io/nvidian/isaac-lab:latest-develop with the workspace mounted at /workspace/isaaclab is cleaner than bare-metal pip installs and consistent with the repo's other containerized workflows.

  2. Editable install strategy is correct — The pip install --no-deps -e source/isaaclab -e source/isaaclab_physx pattern properly overlays PR changes onto the baked-in image without pulling conflicting dependencies.

  3. Runner label fix — Removing the gpu label from the runs-on array (per @nv-apoddubny's feedback) resolves the queue stall issue where both labels matched zero registered runners.

  4. EULA bypass is properly configured — All three environment variables (OMNI_KIT_ACCEPT_EULA, ACCEPT_EULA, ISAAC_SIM_HEADLESS) are set, matching the pattern in build.yaml.

  5. Fork PR handling — The NGC_API_KEY fallback to runner cache is a reasonable compromise that keeps fork CI functional while noting the infra dependency.

Observations

The 25-minute timeout is intentional — Given the documented deadlock behavior (first cuda:1 test hangs indefinitely), the timeout serves as the test's pass/fail signal. This is appropriate for surfacing the regression.

GPU verification placement — Running nvidia-smi -L | wc -l before the Docker pull ensures early-exit on misconfigured runners. The line-count approach is sufficient for "≥2 GPUs" verification.

SHA-pinned checkout — Using actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd aligns with the repo's security policy for pinning Actions by SHA.

Status

This PR correctly sets up the infrastructure to reproduce the FabricFrameView hang in CI. Once the underlying issue in fabric_frame_view.py (the SelectPrims USDRT call on non-zero device indices) is fixed, this workflow will go green and the env-var flip can land as production gating.

No blocking issues identified for the stated purpose.


Update (9c6a69e): Reviewed incremental changes from dd9cd39 to 9c6a69e. This commit includes a substantial batch of unrelated CI and codebase improvements that appear to have been merged/rebased into this branch:

Changes Summary

  1. CI Infrastructure Improvements:

    • Added exclude-pattern input to run-tests and run-package-tests actions for finer test filtering
    • Folded standalone verify-base-non-root and verify-curobo-non-root jobs into their respective test jobs (reducing runner burn)
    • Added JUnit XML report upload artifacts across CI workflows
  2. OVRTX Renderer Cleanup:

    • Removed legacy OVRTX 0.2.x code paths (methods like extract_all_depth_tiles_kernel_legacy, generate_random_colors_from_ids_kernel_legacy)
    • Fixed default log file path to use cross-platform tempfile.gettempdir() instead of hardcoded /tmp
    • Removed _usd_handles tracking in favor of reset_stage() API
  3. Documentation Reorganization:

    • Moved visualization docs from features/ to core-concepts/
    • Restructured bleeding-edge docs to focus on isaaclab_contrib package
    • Added dedicated RLinf VLA post-training documentation page
  4. Test Fixes:

    • Fixed noise tests to avoid NaN from 0/0 when using scale operations (torch.ones instead of torch.rand)
    • Added OVRTX log redirection helper for pytest capture in kitless rendering tests
    • Removed rendering correctness tests from skip lists
  5. Minor Fixes:

    • Added libgmp-dev to arm64 Dockerfile for pytetwild builds
    • Changelog housekeeping for isaaclab_teleop

Stress Test Workflow (This PR's Focus)

The test-fabric-multi-gpu.yaml workflow itself has only minor comment refinements — the core docker-based approach and 25-minute timeout strategy remain unchanged from the previous review.

Recommendation: The non-test-infrastructure changes in this commit are significant and orthogonal to the FabricFrameView stress test purpose. Consider splitting these into a separate PR to keep the stress test PR minimal and easier to revert if needed.


Update (689333e): Reviewed incremental changes from 9c6a69e to 689333e5. This commit cleans up the branch by removing the unrelated changes that were previously merged in.

Changes Summary

The workflow file test-fabric-multi-gpu.yaml now shows a focused diff compared to develop:

  1. Workflow enabled — Re-enables the pull_request trigger with path filters for the FabricFrameView source, tests, and workflow file itself
  2. Docker-based execution — Switches from bare-metal ./isaaclab.sh --install to running inside the nvcr.io/nvidian/isaac-lab:latest-develop container
  3. Simplified GPU check — Uses direct nvidia-smi -L | wc -l instead of Python/torch-based detection
  4. NGC login + caching — Handles missing NGC_API_KEY gracefully with runner cache fallback
  5. Editable reinstall — Overlays PR changes onto container with pip install --no-deps -e

Assessment

This is the correct state for a stress-test PR. The branch is now clean and focused solely on the FabricFrameView multi-GPU workflow changes. The unrelated CI/docs/test changes from the prior commit are gone.

The workflow logic is sound for its intended purpose (reproduce the cuda:1 hang in CI). Ready for the stress test run once a multi-GPU runner picks it up.


Update (d9dccba): Reviewed incremental changes from 689333e5 to d9dccba0. Significant architectural shift — the workflow has moved from Docker-based execution back to bare-metal pip installs.

Key Changes

  1. Removed Docker execution — The containerized approach using nvcr.io/nvidian/isaac-lab:latest-develop is gone. Now runs directly on runner with setup-python@v5 (Python 3.12).

  2. Added cmake pip install workaroundpip install cmake bypasses the sudo apt-get path in install.py:35 that would fail on restricted runners.

  3. Minimal install strategy./isaaclab.sh --install none pulls only core submodules, avoiding robomimic's egl_probe wheel (requires libEGL/X11 headers the runner lacks).

  4. Direct Isaac Sim pip install — Uses isaacsim[all,extscache]==${{ vars.ISAACSIM_BASE_VERSION || '6.0.0' }} from PyPI/NVIDIA index, with explicit note about Python 3.12 compatibility (5.x series requires 3.11).

  5. Increased timeout — Now 60 minutes (was 25 min in Docker version), reflecting longer setup time for pip-based installs.

  6. Simplified runner labels[self-hosted, linux, x64, multi-gpu] (removed gpu label).

Observations

Tradeoffs vs Docker approach:

  • Pro: Simpler, no NGC authentication/caching complexity, no Docker image dependency
  • ⚠️ Con: Relies on runner having correct system libraries; less reproducible across different runner configs

Isaac Sim version pinning rationale is sound — The comment explains the 6.0.0 pin (only 3.12-compatible release) and notes the stale 5.1.0 pin in source/isaaclab/setup.py.

EXP_PATH issue documented — Good comment explaining why Isaac Sim must be installed separately (AppLauncher silently suppresses missing module, leaving EXP_PATH unset).

Assessment

This is a reasonable alternative approach for runners where Docker is problematic. The extensive inline comments explain each workaround, making the intent clear. The 60-minute timeout should accommodate the pip install overhead while still catching the cuda:1 hang.

⚠️ One concern: If the multi-GPU runner environment changes (missing headers, Python version drift), this workflow will be more fragile than the Docker approach. Consider keeping the Docker-based workflow as a fallback option.

@hujc7 hujc7 force-pushed the jichuanh/fabric-mgpu-docker branch 3 times, most recently from 4d2e594 to 689333e Compare May 26, 2026 22:45
Re-enables the pull_request trigger on test-fabric-multi-gpu.yaml and
wires it to run the FabricFrameView contract tests with
ISAACLAB_TEST_MULTI_GPU=1, which activates the three cuda:1
-parameterised tests added in isaac-sim#5514.

The cuda:1 tests target FabricFrameView's SelectPrims path on non-zero
CUDA device indices.  They currently hang indefinitely on real
multi-GPU hardware (reproduced locally on 3x RTX 6000 Pro Blackwell
and on the multi-GPU runner pool); the 60-min workflow timeout will
cancel the job and surface the regression in CI for the
FabricFrameView maintainers.

Install pipeline matches isaac-sim#5738's proven-working layout:
- Pin Python 3.12 via SHA-pinned actions/setup-python.
- Pre-install cmake via pip to skip install.py's sudo apt-get branch.
- ./isaaclab.sh --install none (core only, avoids egl_probe libEGL).
- pip install isaacsim[all,extscache]==${vars.ISAACSIM_BASE_VERSION
  || '6.0.0'} --extra-index-url https://pypi.nvidia.com.
- Bypass Kit's interactive EULA via OMNI_KIT_ACCEPT_EULA / ACCEPT_EULA
  / ISAAC_SIM_HEADLESS.

Status: this PR is expected to fail with the 60-min workflow timeout.
Land once the underlying hang in fabric_frame_view.py is fixed.
@hujc7 hujc7 force-pushed the jichuanh/fabric-mgpu-docker branch from 689333e to d9dccba Compare May 26, 2026 23:11
@hujc7 hujc7 closed this May 27, 2026
@hujc7 hujc7 deleted the jichuanh/fabric-mgpu-docker branch May 27, 2026 22:43
@pv-nvidia pv-nvidia self-assigned this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants