feat(cluster): disconnect replication connections to orphan slaves by RiversJin · Pull Request #2850 · apache/kvrocks

RiversJin · 2025-03-23T11:50:08Z

it solves #2841

RiversJin · 2025-03-23T11:59:08Z


  Status Execute([[maybe_unused]] engine::Context &ctx, [[maybe_unused]] Server *srv, Connection *conn,
                 std::string *output) override {
-    if (port_ != 0) {


Based on the replconf command definition, it's impossible to reach this code path without port_ being set, so I've removed the validation check for port_.

PragmaTwice · 2025-03-23T12:30:20Z

Thank you for your contribution! Could you add some test cases for it?

RiversJin · 2025-03-23T13:35:31Z

Thank you for your contribution! Could you add some test cases for it?

Of course, but save the boring tasks for weekdays :P.

RiversJin · 2025-03-24T07:27:47Z

Thank you for your contribution! Could you add some test cases for it?

@PragmaTwice After a closer look, this PR doesn't introduce any new changes other than disconnecting data replication for nodes removed from the cluster. I added a small snippet of assertion code in cluster_test.go for this scenario. Do you think there are any other areas that might need more testing?

mapleFU · 2025-03-24T07:49:18Z

-ClusterNode::ClusterNode(std::string id, std::string host, int port, int role, std::string master_id,
-                         const std::bitset<kClusterSlots> &slots)
-    : id(std::move(id)), host(std::move(host)), port(port), role(role), master_id(std::move(master_id)), slots(slots) {}
+ClusterNode::ClusterNode(std::string &&id, std::string &&host, int port, int role, std::string &&master_id,


why these are changed?

Some notes:

for std::bitset, in most impl it is stored on the stack memory instead heap so move ctor has no difference from the copy ctor. usually you don't need to move it.

for ClassA(std::string x) : x(x) {}, you can pass either lval or rval to it. for rval, it involves two move ctors. And for ClassA(std::string &&x) : x(x) {}, you can only pass rval to it, and also the performance difference is quite little or none.

So these changes are not so useful.

mapleFU · 2025-03-24T07:49:59Z

+  if (version_ >= 0 && nodes_->count(node_id) > 0) {
+    myself_ = nodes_->at(node_id);


if the performance is important, why not use find() and iterator based operations? Which could make it just search once

mapleFU · 2025-03-24T07:51:40Z

+  auto it = nodes_->find(node_id);
+  if (it == nodes_->end()) {
    return {Status::NotOK, "No this node in the cluster"};
  }

+  auto to_assign_node = it->second;


Suggested change

auto it = nodes_->find(node_id);

if (it == nodes_->end()) {

return {Status::NotOK, "No this node in the cluster"};

}

auto to_assign_node = it->second;

std::shared_ptr<ClusterNode> to_assign_node;

if (auto it = nodes_.find(); it != nodes_.end()) {

to_assign_node = ..

} else {

return {Status::NotOK, "No this node in the cluster"};

}

mapleFU · 2025-03-24T07:52:25Z

+    auto it = nodes_->find(myid_);
+    if (it != nodes_->end()) {


if(auto it = ;

mapleFU · 2025-03-24T07:52:46Z

+    if (!is_slave) {
+      srv_->CleanupOrphanSlaves(version_, *nodes_);
+    }


So this is what this patch actually does?

mapleFU · 2025-03-24T07:53:53Z

+  std::string_view ip_;
+  std::string_view addr_;


This looks really unsafe..Who would owns memory for them?

mapleFU · 2025-03-24T07:56:48Z

+    const auto peer_info = slave_thread->GetConn()->GetPeerInfo();
+    auto peer_version = peer_info->GetPeerVersion();
+    if (peer_version < 0 || peer_version > version) {
+      // The peer version is greater than the current version,


what does peer_version < 0 means?

mapleFU · 2025-03-24T07:58:32Z

+  std::lock_guard<std::mutex> lg(slave_threads_mu_);
+
+  for (auto &slave_thread : slave_threads_) {
+    const auto peer_info = slave_thread->GetConn()->GetPeerInfo();


do we need check GetPeerInfo != nullptr?

mapleFU · 2025-03-24T07:59:30Z

+    SetPeerInfo(std::make_unique<PeerInfo>(ip_, port_, "", -1));
+    return peer_info_.get();


Emmm when would this happens?

… update GetPeerInfo to return by reference

git-hulk · 2025-03-28T07:40:52Z

 inline constexpr const char *errClusterNoInitialized = "The cluster is not initialized";
 inline constexpr const char *errInvalidClusterNodeInfo = "Invalid cluster nodes info";
 inline constexpr const char *errInvalidImportState = "Invalid import state";
+inline constexpr const char *errYouAreFired = "You are fired";


I think it's not a right wording. Could be something like disconnected due to the topology change or others.

git-hulk · 2025-03-28T07:47:03Z

@RiversJin, I found that this PR introduces many code styles and fixes, except for the disconnecting feature. Would you mind separating them into a few dedicated PRs that would be easy to review and clarify the context?

Co-authored-by: hulk <hulk.website@gmail.com>

RiversJin · 2025-03-28T08:16:38Z

I found that this PR introduces many code styles and fixes, except for the disconnecting feature. Would you mind separating them into a few dedicated PRs that would be easy to review and clarify the context?

@git-hulk Sure! The unrelated tweaks were legacy from our internal codebase. I've removed the obvious ones, but the iterator-based index optimization in Cluster::nodes_ (avoiding double lookups) is interleaved in some places. It's technically out of scope, but actually improves performance. but need some time to remove those...

sonarqubecloud · 2025-03-29T14:03:57Z

Quality Gate passed

Issues
6 New issues
0 Accepted issues

Measures
0 Security Hotspots
51.8% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

PragmaTwice · 2025-04-01T09:59:33Z

Yeah these changes in parse_utils.h looks good to me, so maybe you can open a separate PR for it.

RiversJin added 2 commits March 23, 2025 19:48

feat(cluster): disconnect replication connections to orphan slaves

de4272b

[squash me] Replace ToString() with GetStringView() in PeerInfo

41ddd7b

RiversJin commented Mar 23, 2025

View reviewed changes

RiversJin added 2 commits March 23, 2025 20:03

[squash me] removing custom hash and equality structs

b5bb943

Merge branch 'unstable' into fix/orphan_slave

8af6476

[squash me] fix unit test issue

a1bd734

mapleFU reviewed Mar 24, 2025

View reviewed changes

RiversJin added 5 commits March 24, 2025 16:32

[squash me] simplify node retrieval logic in SetNodeId and SetSlotRanges

1f77f03

Merge branch 'unstable' into fix/orphan_slave

2ede94c

Merge branch 'unstable' into fix/orphan_slave

7cc4315

[sqash me] remove unnecessary move semantic parameters

bf7d6f8

[squash me] change ClusterNode slots parameter to const reference and…

eaec818

… update GetPeerInfo to return by reference

RiversJin force-pushed the fix/orphan_slave branch from 7bddf14 to eaec818 Compare March 28, 2025 07:24

git-hulk reviewed Mar 28, 2025

View reviewed changes

RiversJin and others added 3 commits March 28, 2025 15:54

[squash me] revert unrelated modifications

9c4db61

[squash me] Update src/cluster/cluster.cc

7f11538

Co-authored-by: hulk <hulk.website@gmail.com>

[sqush me] update error message for decommissioned nodes

19e81f3

RiversJin force-pushed the fix/orphan_slave branch from 251ff86 to 19e81f3 Compare March 28, 2025 08:07

[squash me] revert modification in cluser

14a7750

RiversJin force-pushed the fix/orphan_slave branch from 3b0cb8f to 14a7750 Compare March 28, 2025 09:07

RiversJin added 4 commits March 28, 2025 17:28

[squash me] rename isNodeFired to isNodeDecommissioned

23a1fa8

[squash me] format

05e8a58

Merge branch 'unstable' into fix/orphan_slave

37c8cae

[squash me] Add missing includes in replication.cc

b2c0d9b

Merge branch 'unstable' into fix/orphan_slave

9571ac8

		if (version_ >= 0 && nodes_->count(node_id) > 0) {
		myself_ = nodes_->at(node_id);

		SetPeerInfo(std::make_unique<PeerInfo>(ip_, port_, "", -1));
		return peer_info_.get();

Conversation

RiversJin commented Mar 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PragmaTwice commented Mar 23, 2025

Uh oh!

RiversJin commented Mar 23, 2025

Uh oh!

RiversJin commented Mar 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PragmaTwice Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

git-hulk commented Mar 28, 2025

Uh oh!

RiversJin commented Mar 28, 2025

Uh oh!

sonarqubecloud Bot commented Mar 29, 2025

Quality Gate passed

Uh oh!

PragmaTwice commented Apr 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PragmaTwice Mar 27, 2025 •

edited

Loading