Description
A CockroachDB node can become permanently stuck in the DECOMMISSIONING membership state if the user scales up the cluster (cockroach_cr.nodes) while a previous downscale operation is still in progress.
The operator logic for ReconcileDecommssion relies on the condition currentReplicas > cr.nodes. If a user upscales (making cr.nodes > currentReplicas) while a node is in the intermediate DECOMMISSIONING state, the operator exits the reconcile loop for decommissioning. Consequently, the node is neither fully decommissioned nor returned to ACTIVE status.
Root Cause Analysis
The issue lies in the entry conditions for the ReconcileDecommssion action. The operator triggers this action only if:
- The Cluster is initialized.
stsStatus.replicas == stsStatus.currentReplicas
stsStatus.currentReplicas > cockroach_cr.nodes (Intent to downscale).
The Failure Scenario:
- User initiates downscale (e.g., 5 -> 3). Condition (3) is met.
- Operator calls
Decommission(node). The node enters the DECOMMISSIONING state.
Decommission(node) returns an error (e.g., network timeout, data moving too slowly). The operator returns early to retry later.
- User initiates upscale (e.g., 3 -> 5).
- On the next reconcile, Condition (3) evaluates to
false because currentReplicas is no longer greater than cr.replicas.
- The operator skips the
ReconcileDecommssion block entirely. The node remains DECOMMISSIONING indefinitely.
Steps to Reproduce
Prerequisites:
- A running CockroachDB cluster managed by the operator (5 nodes).
- (Optional) ChaosMesh installed to inject network faults.
- Initialize Workload:
Run the movr workload to generate sufficient data:
cockroach workload init movr --num-histories 1000000 --num-rides 100000 --num-users 100000 --num-vehicles 100000
- Inject Fault:
Use ChaosMesh to limit the Pod bandwidth to 1kbps. This ensures the decommissioning process stalls or errors out due to slow data replication.
- Trigger Downscale:
Update the CockroachDB CR to reduce the replica count (e.g., 5 -> 3).
- Verify State:
Wait until the target node enters the DECOMMISSIONING state:
cockroach node status --insecure --decommission
- Trigger Upscale:
Immediately update the CockroachDB CR to increase the replica count (e.g., back to 5 or higher).
Observed Behavior
The node previously targeted for removal remains in the DECOMMISSIONING state while new nodes are added. It does not revert to ACTIVE.
Log Output:
bash-5.1$ cockroach node status --insecure --decommission
id | address | sql_address | build | started_at | updated_at | locality | attrs | is_available | is_live | gossiped_replicas | is_decommissioning | membership | is_draining
-----+-----------------------------------------------------------+-----------------------------------------------------------+---------+--------------------------------------+--------------------------------------+----------+-------+--------------+---------+-------------------+--------------------+-----------------+--------------
1 | cockroachdb-0.cockroachdb.cockroach-operator-system:26258 | cockroachdb-0.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:54:10.174444 +0000 UTC | 2026-01-12 03:07:17.954084 +0000 UTC | | [] | true | true | 104 | false | active | false
2 | cockroachdb-1.cockroachdb.cockroach-operator-system:26258 | cockroachdb-1.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:55:32.643908 +0000 UTC | 2026-01-12 03:07:17.797383 +0000 UTC | | [] | true | true | 98 | false | active | false
3 | cockroachdb-2.cockroachdb.cockroach-operator-system:26258 | cockroachdb-2.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:55:34.215002 +0000 UTC | 2026-01-12 03:07:16.41831 +0000 UTC | | [] | true | true | 100 | false | active | false
4 | cockroachdb-4.cockroachdb.cockroach-operator-system:26258 | cockroachdb-4.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:39:19.128573 +0000 UTC | 2026-01-12 03:07:16.193822 +0000 UTC | | [] | true | true | 40 | true | decommissioning | false
5 | cockroachdb-3.cockroachdb.cockroach-operator-system:26258 | cockroachdb-3.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:56:23.412081 +0000 UTC | 2026-01-12 03:07:18.188583 +0000 UTC | | [] | true | true | 104 | false | active | false
(5 rows)
Expected Behavior
If the operator detects that the desired replica count has increased (upscale) while a node is currently DECOMMISSIONING:
- The operator should detect the intermediate state.
- It should explicitly recommission the node (cancel the decommission) to return it to
ACTIVE status before proceeding with the upscale.
Severity
Major
(But this is not a production failure)
Description
A CockroachDB node can become permanently stuck in the
DECOMMISSIONINGmembership state if the user scales up the cluster (cockroach_cr.nodes) while a previous downscale operation is still in progress.The operator logic for
ReconcileDecommssionrelies on the conditioncurrentReplicas > cr.nodes. If a user upscales (makingcr.nodes > currentReplicas) while a node is in the intermediateDECOMMISSIONINGstate, the operator exits the reconcile loop for decommissioning. Consequently, the node is neither fully decommissioned nor returned toACTIVEstatus.Root Cause Analysis
The issue lies in the entry conditions for the
ReconcileDecommssionaction. The operator triggers this action only if:stsStatus.replicas == stsStatus.currentReplicasstsStatus.currentReplicas > cockroach_cr.nodes(Intent to downscale).The Failure Scenario:
Decommission(node). The node enters theDECOMMISSIONINGstate.Decommission(node)returns an error (e.g., network timeout, data moving too slowly). The operator returns early to retry later.falsebecausecurrentReplicasis no longer greater thancr.replicas.ReconcileDecommssionblock entirely. The node remainsDECOMMISSIONINGindefinitely.Steps to Reproduce
Prerequisites:
Run the
movrworkload to generate sufficient data:Use ChaosMesh to limit the Pod bandwidth to 1kbps. This ensures the decommissioning process stalls or errors out due to slow data replication.
Update the
CockroachDBCR to reduce the replica count (e.g., 5 -> 3).Wait until the target node enters the
DECOMMISSIONINGstate:Immediately update the
CockroachDBCR to increase the replica count (e.g., back to 5 or higher).Observed Behavior
The node previously targeted for removal remains in the
DECOMMISSIONINGstate while new nodes are added. It does not revert toACTIVE.Log Output:
Expected Behavior
If the operator detects that the desired replica count has increased (upscale) while a node is currently
DECOMMISSIONING:ACTIVEstatus before proceeding with the upscale.Severity
Major
(But this is not a production failure)