Skip to content

[CP 1242] Fix Node Assignments calculations on operator restart#507

Open
ci-penbot-01 wants to merge 1 commit intoROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1242.rocm.gpu-operator.main
Open

[CP 1242] Fix Node Assignments calculations on operator restart#507
ci-penbot-01 wants to merge 1 commit intoROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1242.rocm.gpu-operator.main

Conversation

@ci-penbot-01
Copy link
Copy Markdown
Contributor

cp of pensando/gpu-operator#1242


Source PR Description (pensando/gpu-operator#1242):

Fix two bugs in DeviceConfig node assignment management:

  1. buildNodeAssignments now logs and skips node assignment conflicts instead of returning a fatal error. A CR-level conflict should not block the entire operator — the runtime validateNodeAssignments check already handles this per-CR during reconciliation.

  2. Remove premature updateNodeAssignments call during finalization that freed nodes from the in-memory map before the finalizer was removed. Node cleanup is now handled solely via the NotFound path after CR garbage collection, preventing other DeviceConfigs from claiming nodes mid-finalization.

Also adds DRA driver DaemonSet cleanup to the finalization path, which was previously only handled during normal reconciliation.

Cherrypick triggered by: ACP-Automation

Fix two bugs in DeviceConfig node assignment management:

1. buildNodeAssignments now logs and skips node assignment conflicts
   instead of returning a fatal error. A CR-level conflict should not
   block the entire operator — the runtime validateNodeAssignments
   check already handles this per-CR during reconciliation.

2. Remove premature updateNodeAssignments call during finalization
   that freed nodes from the in-memory map before the finalizer was
   removed. Node cleanup is now handled solely via the NotFound path
   after CR garbage collection, preventing other DeviceConfigs from
   claiming nodes mid-finalization.

Also adds DRA driver DaemonSet cleanup to the finalization path,
which was previously only handled during normal reconciliation.

(cherry picked from commit a945553c4fe6061b783fd3826a761ba4d6cf5df9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants