Skip to content

Fixed device binding for active learning loop#1692

Open
mehdiataei wants to merge 1 commit into
NVIDIA:mainfrom
mehdiataei:mehdiataei-active-learning-bug-device
Open

Fixed device binding for active learning loop#1692
mehdiataei wants to merge 1 commit into
NVIDIA:mainfrom
mehdiataei:mehdiataei-active-learning-bug-device

Conversation

@mehdiataei
Copy link
Copy Markdown
Collaborator

PhysicsNeMo Pull Request

Fixed device binding. DistributedManager.device binds the descriptor, not the torch.device.

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mehdiataei
Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

Greptile Summary

This PR fixes a bug where DefaultTrainingLoop.__init__ read DistributedManager.device — the raw property descriptor object on the class — instead of DistributedManager().device, the actual torch.device returned by the instance property. A new test confirms the resolved value is a torch.device and matches the singleton's device.

  • loop.py: One-line change from DistributedManager.device to DistributedManager().device; safe because the call is already guarded by DistributedManager.is_initialized().
  • test_loop.py: New test initializes the singleton when needed and asserts both the type and value of loop.device; teardown uses _shared_state.clear() rather than DistributedManager.cleanup() (see inline comment).

Important Files Changed

Filename Overview
physicsnemo/active_learning/loop.py Single-line fix: DistributedManager.device (class-level property descriptor) replaced with DistributedManager().device (instance property returning a torch.device); already guarded by is_initialized().
test/active_learning/test_loop.py New test verifies the fix; teardown uses _shared_state.clear() rather than the official DistributedManager.cleanup() API.
CHANGELOG.md CHANGELOG entry added correctly under the unreleased "Fixed" section.

Reviews (1): Last reviewed commit: "Fixed device binding" | Re-trigger Greptile

Comment on lines +181 to +183
finally:
if not was_initialized:
DistributedManager._shared_state.clear()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The teardown accesses the internal Borg-pattern dict directly with .clear() instead of using DistributedManager.cleanup(), the official API. cleanup() also handles process-group teardown when _distributed is True, so the manual .clear() could skip that step in an environment where distributed training is actually active (e.g. when this test runs as part of a multi-process suite).

Suggested change
finally:
if not was_initialized:
DistributedManager._shared_state.clear()
finally:
if not was_initialized:
DistributedManager.cleanup()

Signed-off-by: Mehdi Ataei <ataei8@gmail.com>
@mehdiataei mehdiataei force-pushed the mehdiataei-active-learning-bug-device branch from 9785f5b to 8ed3be0 Compare June 2, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant