Skip to content

[Diag][DO NOT MERGE] Negative arm — revert #5633 torch-defer to validate #5656 numpy!=2.3.5 fix#5800

Draft
hujc7 wants to merge 2 commits into
isaac-sim:developfrom
hujc7:jichuanh/validate-5656-revert-5647
Draft

[Diag][DO NOT MERGE] Negative arm — revert #5633 torch-defer to validate #5656 numpy!=2.3.5 fix#5800
hujc7 wants to merge 2 commits into
isaac-sim:developfrom
hujc7:jichuanh/validate-5656-revert-5647

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented May 27, 2026

1. Purpose

Negative arm of the validation for PR #5656 (the numpy!=2.3.5 exclusion).

Develop currently carries two independent mitigations for the OpenBLAS atfork SIGSEGV:

  1. PR Fixes OmniHub startup in Docker tests #5633 (Antoine Richard, merge a5eb9add4c3) which carried PR Prevents early numpy imports to avoid Kit crash #5620's content — defers import torch in AppLauncher until after SimulationApp starts, so numpy/OpenBLAS isn't loaded in pytest's master process before the Kit fork.
  2. PR [Fix] Exclude numpy 2.3.5 from every IsaacLab install path #5656 — excludes numpy 2.3.5 from every install path, so even an early numpy import resolves to a non-buggy version.

If either mitigation alone is sufficient, the other is redundant. This PR reverts (1) so we can see whether (2) holds on its own.

2. Setup

  • Branch base: develop (latest).
  • Commit 1: surgical revert of source/isaaclab/isaaclab/app/app_launcher.py to its state just before a5eb9add4c3 (Fixes OmniHub startup in Docker tests #5633). Other files from Fixes OmniHub startup in Docker tests #5633 left alone — the defer-torch logic in AppLauncher.__init__ is the only piece relevant to the atfork bug.
  • Commit 2: source/isaaclab/test/conftest.py that imports numpy and prints version + bundled OpenBLAS hash at pytest session start.

The conftest is deliberately placed at source/isaaclab/test/ rather than source/ to avoid the namespace-package shadowing that broke PR #5655 — anchoring pytest's rootdir at source/ prepends it to sys.path, making every source/<pkg>/ (no __init__.py) shadow the real pip-installed package and turn from isaaclab_teleop import IsaacTeleopCfg into an ImportError before any test runs.

3. Expected result

  • Install resolves to numpy 2.3.4 (highest 2.3.x that satisfies !=2.3.5 under cmeel-boost's cap from [Fix] Exclude numpy 2.3.5 from every IsaacLab install path #5656).
  • [dep-manifest] log line shows numpy 2.3.4 + libscipy_openblas64_-8fb3d286.so.
  • import torch happens in AppLauncher.__init__ BEFORE SimulationApp fork, but resolves to non-buggy numpy 2.3.4, so canary jobs (isaaclab_physx, isaaclab_newton, isaaclab_rl) pass.

If any canary still SIGSEGVs in libomni.platforminfo fork, the numpy!=2.3.5 exclusion alone is not sufficient and #5633's torch-defer is also necessary — keep both.

If all canaries pass, the numpy!=2.3.5 exclusion alone is sufficient and #5633's torch-defer is redundant — could be reverted in a follow-up cleanup.

4. Lifecycle

Diagnostic-only. Do not merge. Closes once evidence is captured.

5. Related

@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 27, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔬 Diagnostic PR Review

Summary

This diagnostic PR reverts the torch-defer mitigation from PR #5633 to validate whether the numpy!=2.3.5 exclusion (PR #5656) alone is sufficient to prevent the OpenBLAS pthread_atfork SIGSEGV.

Changes Analysis

1. source/isaaclab/isaaclab/app/app_launcher.py (-43/+6 lines)

The revert correctly undoes the deferred torch import pattern:

  • Removes _deferred_cuda_device_id tracking
  • Removes _set_deferred_cuda_device() method
  • Removes apply_rtx_determinism_settings() static method
  • Removes --deterministic CLI argument and related handling
  • Moves torch.cuda.set_device() back into _resolve_device_settings() (pre-SimulationApp fork)

This restores the original behavior where import torch (and transitively numpy/OpenBLAS) occurs before Kit's fork() call, which would trigger the SIGSEGV if numpy 2.3.5's buggy OpenBLAS atfork handler is registered.

2. source/isaaclab/test/conftest.py (+29 lines, new file)

Well-designed diagnostic fixture that:

  • Prints numpy version at pytest session start
  • Identifies the bundled OpenBLAS .so filename (contains commit hash)
  • Correctly placed under source/isaaclab/test/ to avoid the namespace-package shadowing issue documented in PR #5655

The docstring clearly explains the rootdir anchoring rationale.

CI Status Notes

  • Check changelog fragments failing is expected for diagnostic PRs
  • Docker/canary jobs still pending — these will provide the actual validation signal

Validation Criteria

Per the PR description:

  • Pass scenario: numpy resolves to 2.3.4, canaries pass → numpy!=2.3.5 exclusion is sufficient
  • Fail scenario: SIGSEGV in libomni.platforminfo fork → both mitigations needed

Recommendations

  1. No merge — correctly labeled as diagnostic-only
  2. Monitor canary jobs (isaaclab_physx, isaaclab_newton, isaaclab_rl) for the actual SIGSEGV validation
  3. Document outcome in a comment once CI completes for future reference

Clean diagnostic implementation. The conftest placement fix from #5655 learnings is a nice detail.

@hujc7 hujc7 force-pushed the jichuanh/validate-5656-revert-5647 branch from ede21ac to 4e72d56 Compare May 27, 2026 07:11
hujc7 added 2 commits May 27, 2026 16:14
…e atfork bug

Surgical hunk-level revert of commit a5eb9ad ("Fixes OmniHub startup
in Docker tests", isaac-sim#5633) applied only to
source/isaaclab/isaaclab/app/app_launcher.py.

Removes the defer-torch mechanism so that `import torch` (and
transitively `import numpy`) happens in AppLauncher.__init__ BEFORE
SimulationApp's fork() through libomni.platforminfo. If the resolved
numpy is 2.3.5, its bundled OpenBLAS pthread_atfork handler will SIGSEGV
the canary jobs.

Unlike the prior whole-file revert, this preserves PR isaac-sim#5449's
`--deterministic` CLI flag and RTX-determinism logic, which landed after
isaac-sim#5633 and was wiped as collateral damage in the previous attempt.

Companion to the relocated diagnostic conftest in the next commit:
source/isaaclab/test/conftest.py prints the resolved numpy + OpenBLAS
hash so we can confirm which numpy actually landed.

Refs PR isaac-sim#5656 (numpy!=2.3.5 exclusion fix being validated).
Prints the resolved numpy version + bundled OpenBLAS .so filename at pytest
session start. Located at the repo root so every subprocess pytest spawned
by tools/conftest.py discovers and loads it, regardless of which package's
tests are running.

Repo root has no isaaclab_* subdirectories, so importmode=prepend placing
the repo root on sys.path does NOT shadow the real pip-installed packages —
unlike source/conftest.py, where source/<pkg>/ (no __init__.py) gets
promoted to a namespace package and breaks
`from isaaclab_teleop import IsaacTeleopCfg`-style imports.

Companion to the previous commit (app_launcher torch-defer revert).
@hujc7 hujc7 force-pushed the jichuanh/validate-5656-revert-5647 branch from 4e72d56 to a6332aa Compare May 27, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant