Skip to content

HDDS-14834. Fix race condition between DeadNodeHandler and HealthyReadOnlyNodeHandler on NetworkTopology#9926

Open
ivandika3 wants to merge 1 commit intoapache:masterfrom
ivandika3:HDDS-14834
Open

HDDS-14834. Fix race condition between DeadNodeHandler and HealthyReadOnlyNodeHandler on NetworkTopology#9926
ivandika3 wants to merge 1 commit intoapache:masterfrom
ivandika3:HDDS-14834

Conversation

@ivandika3
Copy link
Contributor

@ivandika3 ivandika3 commented Mar 15, 2026

What changes were proposed in this pull request?

DeadNodeHandler and HealthyReadOnlyNodeHandler run on separate SingleThreadExecutors, which can lead to a race condition where a resurrected datanode is removed from the NetworkTopology after being re-added. This leaves the node reachable but invisible to the placement policy.

Fix: DeadNodeHandler now checks the current node state before removing it from the topology, skipping removal if the node is no longer DEAD. HealthyReadOnlyNodeHandler uses unconditional add (idempotent) instead of a contains-then-add check, closing the TOCTOU gap.

Made-with: Cursor

There is still a very small chance that race condition might still happen since there is no synchronization method (i.e. lock), but the chance is reduced compared to the previous implementation.

Alternative considered approaches

  • Use a shared SingleThreadExecutor for both DeadNodeHandler: This requires a large change in the SCM event
    framework and might delay event processing

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14834

How was this patch tested?

UT (Clean CI: https://github.com/ivandika3/ozone/actions/runs/23047544006)

…dOnlyNodeHandler on NetworkTopology

DeadNodeHandler and HealthyReadOnlyNodeHandler run on separate
SingleThreadExecutors, which can lead to a race condition where a
resurrected datanode is removed from the NetworkTopology after being
re-added. This leaves the node reachable but invisible to the placement
policy.

Fix: DeadNodeHandler now checks the current node state before removing
it from the topology, skipping removal if the node is no longer DEAD.
HealthyReadOnlyNodeHandler uses unconditional add (idempotent) instead
of a contains-then-add check, closing the TOCTOU gap.

Made-with: Cursor
@ivandika3 ivandika3 self-assigned this Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant