IGNITE-28711 Introduced IgniteFeature related data structures#13195
IGNITE-28711 Introduced IgniteFeature related data structures#13195petrov-mg wants to merge 11 commits into
Conversation
7446109 to
3abdb8e
Compare
| private volatile boolean isNodeFenceActive; | ||
|
|
||
| /** Pair with current and target versions. {@code null} when rolling upgrade is disabled. */ | ||
| @Nullable private volatile IgnitePair<IgniteProductVersion> rollUpVers; | ||
| /** */ | ||
| private volatile boolean isVersionUpgradeEnabled; |
There was a problem hiding this comment.
Should we keep these two fields in metastorage to restore them in persistence mode after full restart of cluster?
There was a problem hiding this comment.
Storing RU state in metastorage has a drawback - any data stored in metastorage requires full backward compatibility because it is written to PDS. Furthermore, there is currently no strong reasons to support RU state recovery after a full cluster restart. This has also been confirmed by potential users of the RU mechanism. In any case, if such a need arises, it can be easily implemented.
| @Override public @Nullable IgniteNodeValidationResult validateNode(ClusterNode node) { | ||
| synchronized (lock) { | ||
| lastJoiningNode = node; | ||
| joiningNodes.add(joiningNode); |
There was a problem hiding this comment.
The problem is that a joining node may leave the cluster without producing any discovery event.
See org.apache.ignite.spi.discovery.tcp.ServerImpl.RingMessageWorker#processJoinRequestMessage, around lines 4522-4595:
err = spi.getSpiContext().validateNode(node);
...
if (!Objects.equals(locMarsh, rmtMarsh)) {
...
// Send message "Local node's marshaller differs from remote node's marshaller"
trySendMessageDirectly(node, new TcpDiscoveryCheckFailedMessage(locNodeId, sndMsg));
return;
}In this path the node has already passed component validation, but it is rejected later by TCP discovery checks and never joins the topology. As a result, the coordinator does not receive NODE_JOINED/NODE_LEFT/NODE_FAILED for this node, so it cannot remove it from joiningNodes by listening to discovery events only.
This is why the previous implementation tracked lastJoiningNode and lastJoiningNodeTimestamp. They were needed to eventually forget such abandoned join attempts
| /** Pair with current and target versions. {@code null} when rolling upgrade is disabled. */ | ||
| @Nullable private volatile IgnitePair<IgniteProductVersion> rollUpVers; | ||
| /** */ | ||
| private volatile boolean isVersionUpgradeEnabled; |
There was a problem hiding this comment.
| private volatile boolean isVersionUpgradeEnabled; | |
| private volatile boolean isVerUpgradeEnabled; |
3790ced to
24e6cec
Compare
| private IgniteInternalFuture<Message> executePreparePhase(Message req) { | ||
| synchronized (topGuard) { | ||
| if (isNodeFenceActive) { | ||
| return new GridFinishedFuture<>(new IgniteCheckedException( | ||
| "Cluster version finalization procedure is already in progress")); | ||
| } | ||
|
|
||
| synchronized (lock) { | ||
| minMaxVerPair = resolveMinMaxNodeVersions(); | ||
| Set<IgniteProductVersion> distinctNodeVersions = distinctClusterProductVersions(); | ||
|
|
||
| if (!minMaxVerPair.get1().equals(minMaxVerPair.get2())) | ||
| throw new IgniteCheckedException("Can't disable rolling upgrade with different versions in cluster: " | ||
| + minMaxVerPair.get1() + ", " + minMaxVerPair.get2()); | ||
| if (distinctNodeVersions.size() > 1) { | ||
| return new GridFinishedFuture<>(new IgniteCheckedException( | ||
| "Cluster version finalization failed. The topology contains nodes running multiple different" + | ||
| " versions [distinctNodeVersions=" + distinctNodeVersions + "]" | ||
| )); | ||
| } | ||
|
|
||
| if (lastJoiningNode != null) { | ||
| IgniteProductVersion lastJoiningNodeVer = IgniteProductVersion.fromString(lastJoiningNode.attribute(ATTR_BUILD_VER)); | ||
| isNodeFenceActive = true; | ||
|
|
||
| if (!minMaxVerPair.get1().equals(lastJoiningNodeVer)) | ||
| throw new IgniteCheckedException("Can't disable rolling upgrade with different versions in cluster: " | ||
| + minMaxVerPair.get1() + ", " + lastJoiningNodeVer); | ||
| return new GridFinishedFuture<>(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Do we need to execute the prepare phase on every node?
Consider a 3-node cluster A, B, C where A is the coordinator. A starts finalization and sends the prepare phase to all nodes. If prepare succeeds on B but fails on C for some reasons, B has already set isNodeFenceActive=true, while the coordinator aborts finalization because one node failed. Since the complete phase is not started, B never resets the fence. A later finalization attempt will fail because B now reports "Cluster version finalization procedure is already in progress", and joins routed through B will also be rejected.
Could you clarify why prepare must validate cluster and set the fence independently on every node? If this is intended to protect coordinator changes, we probably still need a rollback mechanism for isNodeFenceActive when prepare fails on any node
There was a problem hiding this comment.
If prepare succeeds on B but fails on C for some reasons
For what reasons? This is crucial, as we rely on message delivery guarantee provided by Discovery.
coordinator aborts finalization because one node failed
Currently the distributed process is not aborted if nodes leave the cluster.
There was a problem hiding this comment.
My point is not about discovery messaging.
The prepare phase can finish with an exception in GridFinishedFuture (the version check). So isNodeFenceActive becomes false on the node that failed, but true on the nodes that passed.
In my scenario, coordinator A may fail its own prepare phase, if its local joiningNodes still has a node with a different version. Only the coordinator handles joiningNodes, so the discovery delivery guarantee does not help here.
Because A returns an error, finishPreparePhase does not start completePhase. And completePhase is the only place that sets isNodeFenceActive back to false. So B and C keep the fence forever, while A is not even fenced
|
|
||
| if (log.isInfoEnabled()) | ||
| log.info("Cluster version was successfully finalized [activeLogicalVer=" + clusterLogicalVersion() + ']'); | ||
| } |
There was a problem hiding this comment.
Should we allow finalizeProc.start() from non-coordinator nodes?
finalizeClusterVersion can currently be called from different nodes at the same time. In that case we start two different distributed processes, for example finProc1 and finProc2.
finProc1 may execute prepare first and set isInProgress=true and isNodeFenceActive=true. Then finProc2 executes prepare and fails with finalization procedure is already in progress. However, when finProc2 finishes, finishProcess resets isInProgress=false for any reqId. Also, the prepare error path resets isNodeFenceActive=false.
As a result, the failed concurrent finalization can clear the fence of the still running finalization process.
Maybe we should allow finalize only from the coordinator, like the old RU implementation did, or keep isInProgress and isNodeFenceActive to a specific reqId and reset them only for the process that set them
| spi(grid(1)).blockMessages((node, msg) -> msg instanceof SingleNodeMessage); | ||
|
|
||
| try { | ||
|
|
There was a problem hiding this comment.
| else if (U.isLocalNodeCoordinator(ctx.discovery())) | ||
| completePhase.start(reqId, null); |
There was a problem hiding this comment.
I have some doubts about these lines.
Let's assume we start finalization and executePreparePhase completed on all nodes. After that we start finishPreparePhase on all nodes. If coordinator disconnects at this moment, other nodes may not execute start of completePhase. As a result, none of grids will start the completePhase. Moreover, if we try to call finalization again, it will be canceled because of enabled fences. And we also won't be able to connect new nodes to cluster. And probably the only way to fix it will be full restart of cluster during rolling upgrade
Of course, looks like the scenario is rare and we have the same pattern in other processes. But here it will break concept of rolling upgrade
WDYT?
There was a problem hiding this comment.
The issue you describe can indeed occur. A known example is when the coordinator enqueues an InitMessage corresponding to the finalization completion phase to the RingMessageWorker and then fails. There may be other such cases.
Even so, I propose keeping the current logic as is. There's no point in further complicating the RU logic to fix this issue. As you noted, the chances of such a scenario are low.
I think it worth it to implement @sergey-chugunov-1985 suggestion and introduce a mechanism for manually aborting the cluster version finalization process. This will give us a backup plan for such corner cases.
The PR contains an implementation of the mentioned proposal.
| if (isNodeFenceActive) { | ||
| return new IgniteNodeValidationResult( | ||
| joiningNode.id(), | ||
| "Node joins are not allowed during cluster version finalization [joiningNode=" + joiningNode + ']'); |
There was a problem hiding this comment.
We have received user feedback that some error messages do not explain what to do next. In this case, the finalization process is an internal operation, so it is unclear whether the user should retry or take another action. Could we add an actionable hint, such as Wait for the current finalization process to complete and try again later?
We can also add similar tips to other RU errors, but it is up to you
Thank you for submitting the pull request to the Apache Ignite.
In order to streamline the review of the contribution
we ask you to ensure the following steps have been taken:
The Contribution Checklist
The description explains WHAT and WHY was made instead of HOW.
The following pattern must be used:
IGNITE-XXXX Change summarywhereXXXX- number of JIRA issue.(see the Maintainers list)
the
green visaattached to the JIRA ticket (see tabPR Checkat TC.Bot - Instance 1 or TC.Bot - Instance 2)Notes
If you need any help, please email dev@ignite.apache.org or ask anу advice on http://asf.slack.com #ignite channel.