branch-4.1: [fix](fe) avoid concurrent tablet stat iteration failures #63298#63560
Open
github-actions[bot] wants to merge 1 commit into
Open
branch-4.1: [fix](fe) avoid concurrent tablet stat iteration failures #63298#63560github-actions[bot] wants to merge 1 commit into
github-actions[bot] wants to merge 1 commit into
Conversation
### What problem does this PR solve? Issue Number: #59138 Problem Summary: TabletStatMgr.runAfterCatalogReady() is a periodic master-FE daemon that iterates every tablet/replica to pull statistics. When DDL runs concurrently with this daemon, two races fire: Iteration race (CME). MaterializedIndex.tablets and LocalTablet.replicas were plain ArrayLists whose getters returned the internal list. A concurrent addTablet / addReplica / deleteReplica (clone, repair, schema change, restore, report handler) during iteration triggered the fail-fast iterator and threw ConcurrentModificationException. TOCTOU race. In updateTabletStat, a getTabletMeta(id) != null check is followed by getReplica(id, beId). If the tablet is removed in between, getReplica hits Preconditions.checkState(...) and throws IllegalStateException. When the daemon throws, the current cycle leaves stale tablet/partition/table sizes and skewed MetricRepo metrics until the next cycle. Solution: Close the CME race for good with copy-on-write via a volatile snapshot. A first attempt returned a defensive copy (Lists.newArrayList(...)), but the copy itself iterates the source list and can still CME mid-copy — the window shrank but did not close. This PR instead: Makes LocalTablet.replicas and MaterializedIndex.tablets volatile. Writers (addReplica / deleteReplica / deleteReplicaByBackendId / addTablet / clearTabletsForRestore) are synchronized, build a new list, and atomically swap the volatile reference — they never mutate a list in place. Readers (getReplicas() / getTablets()) do a single volatile read and return an immutable snapshot (Collections.unmodifiableList). Iteration is lock-free and can never CME, and the hot read path no longer copies elements. synchronized on writers is required (not just volatile) because some write paths do not hold the OlapTable write lock — verified by tracing call sites: InternalCatalog.createPartitionWithIndices and RestoreJob.resetPartitionForRestore call addReplica/addTablet without the table write lock, so concurrent writers are real and a plain volatile field would allow lost updates. Writers are infrequent (DDL / repair / restore), so the lock cost is negligible; reads stay lock-free. TOCTOU race is handled by catching IllegalStateException around getReplica (kept from the original fix) and counting the skip via a new TabletStatMgr.staleTabletStatSkipped counter, which makes the race observable (>0 proves the window was actually hit) instead of relying solely on log scraping. Cloud path: CloudTabletStatMgr.updateStatInfo iterates tablet.getReplicas() and is covered by the same snapshot fix; its updateTabletStat uses getReplicasByTabletId (locked, returns empty list, no checkState) and is already safe. ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into --> --------- Co-authored-by: morningman <yunyou@selectdb.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
|
run buildall |
Contributor
FE Regression Coverage ReportIncrement line coverage |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picked from #63298