Skip to content

[Subtask]: Optimizations and troubleshooting for the Master-Slave mode. #4174

Open
wardlican wants to merge 3 commits intoapache:masterfrom
wardlican:amoro#4171
Open

[Subtask]: Optimizations and troubleshooting for the Master-Slave mode. #4174
wardlican wants to merge 3 commits intoapache:masterfrom
wardlican:amoro#4171

Conversation

@wardlican
Copy link
Copy Markdown
Contributor

Why are the changes needed?

Close #4171 .

Brief change log

  1. For existing historical tables, if Master-Slave mode is enabled, a bucket_id must be assigned to them.
  2. Fixed an issue where concurrent addition of new tables could result in multiple tables being assigned the same bucket_id, leading to bucket imbalance.
  3. Fixed an issue in Master-Slave mode where the -msm flag was not passed during the optimizer's startup.
  4. Fixed an issue where tables were erroneously deleted following a migration.

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before making a pull request

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@github-actions github-actions bot added type:docs Improvements or additions to documentation module:ams-server Ams server module module:ams-optimizer AMS optimizer module labels Apr 10, 2026
@wardlican wardlican changed the title [Subtask]: Optimizations and troubleshooting for the Master-Slave mode. #4171 [Subtask]: Optimizations and troubleshooting for the Master-Slave mode. Apr 10, 2026
@czy006
Copy link
Copy Markdown
Contributor

czy006 commented Apr 14, 2026

Offline nodes with missing last_update_time may never be reclaimed

In AmsAssignService.detectNodeChanges (around lines 528-545), a node is marked offline only when lastUpdateTime > 0 && (currentTime - lastUpdateTime) > nodeOfflineTimeoutMs.

However, both DBBucketAssignStore#getLastUpdateTime and ZkBucketAssignStore#getLastUpdateTime return 0 when the timestamp is missing. In that case, a node that is already absent from the alive-node list will never enter the offline branch, so its buckets are never redistributed. This can leave bucket ownership stranded and prevent load recovery.

Treat lastUpdateTime <= 0 as an offline-eligible case when the node is not in aliveNodeKeys, and reclaim buckets directly; or Keep a short grace period, but after the grace period, force offline even if timestamp is missing.

@wardlican
Copy link
Copy Markdown
Contributor Author

Offline nodes with missing last_update_time may never be reclaimed

In AmsAssignService.detectNodeChanges (around lines 528-545), a node is marked offline only when lastUpdateTime > 0 && (currentTime - lastUpdateTime) > nodeOfflineTimeoutMs.

However, both DBBucketAssignStore#getLastUpdateTime and ZkBucketAssignStore#getLastUpdateTime return 0 when the timestamp is missing. In that case, a node that is already absent from the alive-node list will never enter the offline branch, so its buckets are never redistributed. This can leave bucket ownership stranded and prevent load recovery.

Treat lastUpdateTime <= 0 as an offline-eligible case when the node is not in aliveNodeKeys, and reclaim buckets directly; or Keep a short grace period, but after the grace period, force offline even if timestamp is missing.

Okay, I will fix this issue.

@czy006 czy006 requested review from czy006 and xxubai April 15, 2026 06:04
Copy link
Copy Markdown
Contributor

@czy006 czy006 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ams-optimizer AMS optimizer module module:ams-server Ams server module type:docs Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Subtask]: Optimizations and troubleshooting for the Master-Slave mode.

3 participants