IGNITE-28711 Introduced IgniteFeature related data structures by petrov-mg · Pull Request #13195 · apache/ignite

petrov-mg · 2026-05-31T20:42:44Z

Thank you for submitting the pull request to the Apache Ignite.

In order to streamline the review of the contribution
we ask you to ensure the following steps have been taken:

The Contribution Checklist

There is a single JIRA ticket related to the pull request.
The web-link to the pull request is attached to the JIRA ticket.
The JIRA ticket has the Patch Available state.
The pull request body describes changes that have been made.
The description explains WHAT and WHY was made instead of HOW.
The pull request title is treated as the final commit message.
The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
A reviewer has been mentioned through the JIRA comments
(see the Maintainers list)
The pull request has been checked by the Teamcity Bot and
the green visa attached to the JIRA ticket (see tab PR Check at TC.Bot - Instance 1 or TC.Bot - Instance 2)

Notes

If you need any help, please email dev@ignite.apache.org or ask anу advice on http://asf.slack.com #ignite channel.

chesnokoff · 2026-06-02T13:31:09Z

+    private volatile boolean isNodeFenceActive;

-    /** Pair with current and target versions. {@code null} when rolling upgrade is disabled. */
-    @Nullable private volatile IgnitePair<IgniteProductVersion> rollUpVers;
+    /** */
+    private volatile boolean isVersionUpgradeEnabled;


Should we keep these two fields in metastorage to restore them in persistence mode after full restart of cluster?

Storing RU state in metastorage has a drawback - any data stored in metastorage requires full backward compatibility because it is written to PDS. Furthermore, there is currently no strong reasons to support RU state recovery after a full cluster restart. This has also been confirmed by potential users of the RU mechanism. In any case, if such a need arises, it can be easily implemented.

chesnokoff · 2026-06-02T13:43:25Z

-    @Override public @Nullable IgniteNodeValidationResult validateNode(ClusterNode node) {
-        synchronized (lock) {
-            lastJoiningNode = node;
+            joiningNodes.add(joiningNode);


The problem is that a joining node may leave the cluster without producing any discovery event.

See org.apache.ignite.spi.discovery.tcp.ServerImpl.RingMessageWorker#processJoinRequestMessage, around lines 4522-4595:

err = spi.getSpiContext().validateNode(node); ... if (!Objects.equals(locMarsh, rmtMarsh)) { ... // Send message "Local node's marshaller differs from remote node's marshaller" trySendMessageDirectly(node, new TcpDiscoveryCheckFailedMessage(locNodeId, sndMsg)); return; }

In this path the node has already passed component validation, but it is rejected later by TCP discovery checks and never joins the topology. As a result, the coordinator does not receive NODE_JOINED/NODE_LEFT/NODE_FAILED for this node, so it cannot remove it from joiningNodes by listening to discovery events only.

This is why the previous implementation tracked lastJoiningNode and lastJoiningNodeTimestamp. They were needed to eventually forget such abandoned join attempts

Not relevant after IGNITE-28751

chesnokoff · 2026-06-02T13:57:56Z

-    /** Pair with current and target versions. {@code null} when rolling upgrade is disabled. */
-    @Nullable private volatile IgnitePair<IgniteProductVersion> rollUpVers;
+    /** */
+    private volatile boolean isVersionUpgradeEnabled;


Suggested change

private volatile boolean isVersionUpgradeEnabled;

private volatile boolean isVerUpgradeEnabled;

chesnokoff · 2026-06-15T10:06:59Z

+        private IgniteInternalFuture<Message> executePreparePhase(Message req) {
+            synchronized (topGuard) {
+                if (isNodeFenceActive) {
+                    return new GridFinishedFuture<>(new IgniteCheckedException(
+                        "Cluster version finalization procedure is already in progress"));
+                }

-        synchronized (lock) {
-            minMaxVerPair = resolveMinMaxNodeVersions();
+                Set<IgniteProductVersion> distinctNodeVersions = distinctClusterProductVersions();

-            if (!minMaxVerPair.get1().equals(minMaxVerPair.get2()))
-                throw new IgniteCheckedException("Can't disable rolling upgrade with different versions in cluster: "
-                    + minMaxVerPair.get1() + ", " + minMaxVerPair.get2());
+                if (distinctNodeVersions.size() > 1) {
+                    return new GridFinishedFuture<>(new IgniteCheckedException(
+                        "Cluster version finalization failed. The topology contains nodes running multiple different" +
+                            " versions [distinctNodeVersions=" + distinctNodeVersions + "]"
+                    ));
+                }

-            if (lastJoiningNode != null) {
-                IgniteProductVersion lastJoiningNodeVer = IgniteProductVersion.fromString(lastJoiningNode.attribute(ATTR_BUILD_VER));
+                isNodeFenceActive = true;

-                if (!minMaxVerPair.get1().equals(lastJoiningNodeVer))
-                    throw new IgniteCheckedException("Can't disable rolling upgrade with different versions in cluster: "
-                        + minMaxVerPair.get1() + ", " + lastJoiningNodeVer);
+                return new GridFinishedFuture<>();
            }
+        }


Do we need to execute the prepare phase on every node?

Consider a 3-node cluster A, B, C where A is the coordinator. A starts finalization and sends the prepare phase to all nodes. If prepare succeeds on B but fails on C for some reasons, B has already set isNodeFenceActive=true, while the coordinator aborts finalization because one node failed. Since the complete phase is not started, B never resets the fence. A later finalization attempt will fail because B now reports "Cluster version finalization procedure is already in progress", and joins routed through B will also be rejected.

Could you clarify why prepare must validate cluster and set the fence independently on every node? If this is intended to protect coordinator changes, we probably still need a rollback mechanism for isNodeFenceActive when prepare fails on any node

If prepare succeeds on B but fails on C for some reasons

For what reasons? This is crucial, as we rely on message delivery guarantee provided by Discovery.

coordinator aborts finalization because one node failed

Currently the distributed process is not aborted if nodes leave the cluster.

My point is not about discovery messaging.

The prepare phase can finish with an exception in GridFinishedFuture (the version check). So isNodeFenceActive becomes false on the node that failed, but true on the nodes that passed.

In my scenario, coordinator A may fail its own prepare phase, if its local joiningNodes still has a node with a different version. Only the coordinator handles joiningNodes, so the discovery delivery guarantee does not help here.

Because A returns an error, finishPreparePhase does not start completePhase. And completePhase is the only place that sets isNodeFenceActive back to false. So B and C keep the fence forever, while A is not even fenced

chesnokoff · 2026-06-17T08:42:37Z

+
+        if (log.isInfoEnabled())
+            log.info("Cluster version was successfully finalized [activeLogicalVer=" + clusterLogicalVersion() + ']');
+    }


Should we allow finalizeProc.start() from non-coordinator nodes?

finalizeClusterVersion can currently be called from different nodes at the same time. In that case we start two different distributed processes, for example finProc1 and finProc2.

finProc1 may execute prepare first and set isInProgress=true and isNodeFenceActive=true. Then finProc2 executes prepare and fails with finalization procedure is already in progress. However, when finProc2 finishes, finishProcess resets isInProgress=false for any reqId. Also, the prepare error path resets isNodeFenceActive=false.

As a result, the failed concurrent finalization can clear the fence of the still running finalization process.

Maybe we should allow finalize only from the coordinator, like the old RU implementation did, or keep isInProgress and isNodeFenceActive to a specific reqId and reset them only for the process that set them

…rocess ID explicitly.

sergey-chugunov-1985 · 2026-06-18T14:33:50Z

+        spi(grid(1)).blockMessages((node, msg) -> msg instanceof SingleNodeMessage);
+
+        try {
+


Suggested change

chesnokoff · 2026-06-19T12:56:24Z

+            else if (U.isLocalNodeCoordinator(ctx.discovery()))
+                completePhase.start(reqId, null);


I have some doubts about these lines.
Let's assume we start finalization and executePreparePhase completed on all nodes. After that we start finishPreparePhase on all nodes. If coordinator disconnects at this moment, other nodes may not execute start of completePhase. As a result, none of grids will start the completePhase. Moreover, if we try to call finalization again, it will be canceled because of enabled fences. And we also won't be able to connect new nodes to cluster. And probably the only way to fix it will be full restart of cluster during rolling upgrade

Of course, looks like the scenario is rare and we have the same pattern in other processes. But here it will break concept of rolling upgrade

WDYT?

The issue you describe can indeed occur. A known example is when the coordinator enqueues an InitMessage corresponding to the finalization completion phase to the RingMessageWorker and then fails. There may be other such cases.

Even so, I propose keeping the current logic as is. There's no point in further complicating the RU logic to fix this issue. As you noted, the chances of such a scenario are low.

I think it worth it to implement @sergey-chugunov-1985 suggestion and introduce a mechanism for manually aborting the cluster version finalization process. This will give us a backup plan for such corner cases.

The PR contains an implementation of the mentioned proposal.

chesnokoff · 2026-06-19T13:24:42Z

+            if (isNodeFenceActive) {
+                return new IgniteNodeValidationResult(
+                    joiningNode.id(),
+                    "Node joins are not allowed during cluster version finalization [joiningNode=" + joiningNode + ']');


We have received user feedback that some error messages do not explain what to do next. In this case, the finalization process is an internal operation, so it is unclear whether the user should retry or take another action. Could we add an actionable hint, such as Wait for the current finalization process to complete and try again later?

We can also add similar tips to other RU errors, but it is up to you

petrov-mg force-pushed the IGNITE-28711 branch 5 times, most recently from 7446109 to 3abdb8e Compare June 2, 2026 11:18

chesnokoff reviewed Jun 2, 2026

View reviewed changes

petrov-mg force-pushed the IGNITE-28711 branch 6 times, most recently from 3790ced to 24e6cec Compare June 10, 2026 22:27

chesnokoff reviewed Jun 15, 2026

View reviewed changes

petrov-mg added 3 commits June 16, 2026 17:08

IGNITE-28711 Introduced IgniteFeature related data structures

7f04a33

IGNITE-28711 Fixed code style

5789d84

IGNITE-28711 Fixed reset of node fence state

21b45ed

petrov-mg force-pushed the IGNITE-28711 branch from ce91437 to 21b45ed Compare June 16, 2026 14:12

chesnokoff reviewed Jun 17, 2026

View reviewed changes

petrov-mg added 2 commits June 17, 2026 17:26

IGNITE-28711 Bounded active finalization process to its distributed p…

47d54f8

…rocess ID explicitly.

IGNITE-28711 Added test

daeb9f0

petrov-mg force-pushed the IGNITE-28711 branch from 7c2ea9a to daeb9f0 Compare June 17, 2026 16:10

IGNITE-28711 Fixed naming.

cadf7b5

sergey-chugunov-1985 requested changes Jun 18, 2026

View reviewed changes

petrov-mg added 2 commits June 18, 2026 22:44

IGNITE-28711 Fixed minor issues.

8175698

IGNITE-28711 Supported client node reconnect handling.

d7d1d97

petrov-mg force-pushed the IGNITE-28711 branch from 2093c8f to d7d1d97 Compare June 18, 2026 22:33

chesnokoff reviewed Jun 19, 2026

View reviewed changes

IGNITE-28711 Improved error message.

0a783a7

petrov-mg force-pushed the IGNITE-28711 branch from 770b421 to 0a783a7 Compare June 19, 2026 15:34

petrov-mg added 2 commits June 21, 2026 01:01

IGNITE-28711 Added support for finalization abort.

c37dfa3

IGNITE-28711 Removed local features provider.

20fb226

	private volatile boolean isVersionUpgradeEnabled;
	private volatile boolean isVerUpgradeEnabled;

		spi(grid(1)).blockMessages((node, msg) -> msg instanceof SingleNodeMessage);

		try {

		else if (U.isLocalNodeCoordinator(ctx.discovery()))
		completePhase.start(reqId, null);

Conversation

petrov-mg commented May 31, 2026

The Contribution Checklist

Notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petrov-mg Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chesnokoff Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petrov-mg Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

petrov-mg Jun 15, 2026 •

edited

Loading

chesnokoff Jun 19, 2026 •

edited

Loading

petrov-mg Jun 20, 2026 •

edited

Loading