[Diag][DO NOT MERGE] Negative arm — revert #5633 torch-defer to validate #5656 numpy!=2.3.5 fix#5800
[Diag][DO NOT MERGE] Negative arm — revert #5633 torch-defer to validate #5656 numpy!=2.3.5 fix#5800hujc7 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
🔬 Diagnostic PR Review
Summary
This diagnostic PR reverts the torch-defer mitigation from PR #5633 to validate whether the numpy!=2.3.5 exclusion (PR #5656) alone is sufficient to prevent the OpenBLAS pthread_atfork SIGSEGV.
Changes Analysis
1. source/isaaclab/isaaclab/app/app_launcher.py (-43/+6 lines)
The revert correctly undoes the deferred torch import pattern:
- Removes
_deferred_cuda_device_idtracking - Removes
_set_deferred_cuda_device()method - Removes
apply_rtx_determinism_settings()static method - Removes
--deterministicCLI argument and related handling - Moves
torch.cuda.set_device()back into_resolve_device_settings()(pre-SimulationApp fork)
This restores the original behavior where import torch (and transitively numpy/OpenBLAS) occurs before Kit's fork() call, which would trigger the SIGSEGV if numpy 2.3.5's buggy OpenBLAS atfork handler is registered.
2. source/isaaclab/test/conftest.py (+29 lines, new file)
Well-designed diagnostic fixture that:
- Prints numpy version at pytest session start
- Identifies the bundled OpenBLAS
.sofilename (contains commit hash) - Correctly placed under
source/isaaclab/test/to avoid the namespace-package shadowing issue documented in PR #5655
The docstring clearly explains the rootdir anchoring rationale.
CI Status Notes
Check changelog fragmentsfailing is expected for diagnostic PRs- Docker/canary jobs still pending — these will provide the actual validation signal
Validation Criteria
Per the PR description:
- ✅ Pass scenario: numpy resolves to 2.3.4, canaries pass →
numpy!=2.3.5exclusion is sufficient - ❌ Fail scenario: SIGSEGV in
libomni.platforminfofork → both mitigations needed
Recommendations
- No merge — correctly labeled as diagnostic-only
- Monitor canary jobs (
isaaclab_physx,isaaclab_newton,isaaclab_rl) for the actual SIGSEGV validation - Document outcome in a comment once CI completes for future reference
Clean diagnostic implementation. The conftest placement fix from #5655 learnings is a nice detail.
ede21ac to
4e72d56
Compare
…e atfork bug Surgical hunk-level revert of commit a5eb9ad ("Fixes OmniHub startup in Docker tests", isaac-sim#5633) applied only to source/isaaclab/isaaclab/app/app_launcher.py. Removes the defer-torch mechanism so that `import torch` (and transitively `import numpy`) happens in AppLauncher.__init__ BEFORE SimulationApp's fork() through libomni.platforminfo. If the resolved numpy is 2.3.5, its bundled OpenBLAS pthread_atfork handler will SIGSEGV the canary jobs. Unlike the prior whole-file revert, this preserves PR isaac-sim#5449's `--deterministic` CLI flag and RTX-determinism logic, which landed after isaac-sim#5633 and was wiped as collateral damage in the previous attempt. Companion to the relocated diagnostic conftest in the next commit: source/isaaclab/test/conftest.py prints the resolved numpy + OpenBLAS hash so we can confirm which numpy actually landed. Refs PR isaac-sim#5656 (numpy!=2.3.5 exclusion fix being validated).
Prints the resolved numpy version + bundled OpenBLAS .so filename at pytest session start. Located at the repo root so every subprocess pytest spawned by tools/conftest.py discovers and loads it, regardless of which package's tests are running. Repo root has no isaaclab_* subdirectories, so importmode=prepend placing the repo root on sys.path does NOT shadow the real pip-installed packages — unlike source/conftest.py, where source/<pkg>/ (no __init__.py) gets promoted to a namespace package and breaks `from isaaclab_teleop import IsaacTeleopCfg`-style imports. Companion to the previous commit (app_launcher torch-defer revert).
4e72d56 to
a6332aa
Compare
1. Purpose
Negative arm of the validation for PR #5656 (the
numpy!=2.3.5exclusion).Develop currently carries two independent mitigations for the OpenBLAS atfork SIGSEGV:
a5eb9add4c3) which carried PR Prevents early numpy imports to avoid Kit crash #5620's content — defersimport torchinAppLauncheruntil afterSimulationAppstarts, so numpy/OpenBLAS isn't loaded in pytest's master process before the Kit fork.numpy 2.3.5from every install path, so even an early numpy import resolves to a non-buggy version.If either mitigation alone is sufficient, the other is redundant. This PR reverts (1) so we can see whether (2) holds on its own.
2. Setup
develop(latest).source/isaaclab/isaaclab/app/app_launcher.pyto its state just beforea5eb9add4c3(Fixes OmniHub startup in Docker tests #5633). Other files from Fixes OmniHub startup in Docker tests #5633 left alone — the defer-torch logic inAppLauncher.__init__is the only piece relevant to the atfork bug.source/isaaclab/test/conftest.pythat imports numpy and prints version + bundled OpenBLAS hash at pytest session start.The conftest is deliberately placed at
source/isaaclab/test/rather thansource/to avoid the namespace-package shadowing that broke PR #5655 — anchoring pytest's rootdir atsource/prepends it tosys.path, making everysource/<pkg>/(no__init__.py) shadow the real pip-installed package and turnfrom isaaclab_teleop import IsaacTeleopCfginto an ImportError before any test runs.3. Expected result
!=2.3.5under cmeel-boost's cap from [Fix] Exclude numpy 2.3.5 from every IsaacLab install path #5656).[dep-manifest]log line showsnumpy 2.3.4+libscipy_openblas64_-8fb3d286.so.import torchhappens inAppLauncher.__init__BEFORESimulationAppfork, but resolves to non-buggy numpy 2.3.4, so canary jobs (isaaclab_physx,isaaclab_newton,isaaclab_rl) pass.If any canary still SIGSEGVs in
libomni.platforminfofork, thenumpy!=2.3.5exclusion alone is not sufficient and #5633's torch-defer is also necessary — keep both.If all canaries pass, the
numpy!=2.3.5exclusion alone is sufficient and #5633's torch-defer is redundant — could be reverted in a follow-up cleanup.4. Lifecycle
Diagnostic-only. Do not merge. Closes once evidence is captured.
5. Related
numpy!=2.3.5exclusion being validatedsource/conftest.pyplacement there broke test discovery before the bug could surface)