Skip to content

[server] Add DoL loopback to ensure new leader is fully caught up on VT#2314

Merged
sushantmane merged 7 commits intolinkedin:mainfrom
sushantmane:ST-AckBack-DuringLeaderPromotion
Feb 14, 2026
Merged

[server] Add DoL loopback to ensure new leader is fully caught up on VT#2314
sushantmane merged 7 commits intolinkedin:mainfrom
sushantmane:ST-AckBack-DuringLeaderPromotion

Conversation

@sushantmane
Copy link
Copy Markdown
Contributor

@sushantmane sushantmane commented Nov 24, 2025

[server] Add DoL loopback to ensure new leader is fully caught up on VT

Newly elected leaders were re-consuming the NR source topic because the
promotion logic relied only on elapsed time since the last consumed
message before switching to the remote version topic (VT). This
time-based heuristic is insufficient and can cause duplicate consumption
and data inconsistencies.

Fix the issue by requiring the new leader to produce a Declaration-of-
Leadership (DoL) marker to the local VT and wait until it consumes that
same marker back. This provides a deterministic guarantee that the
leader has fully caught up on VT before switching to RT or NR sources.

New Configuration Keys

Config Key Default Description
server.leader.handover.use.dol.mechanism.for.system.stores true Enable DoL mechanism for system stores (meta stores, etc.)
server.leader.handover.use.dol.mechanism.for.user.stores true Enable DoL mechanism for user stores

Code changes

  • Added new code behind a config. Config names and default values listed above.
  • Introduced new log lines.
    • Confirmed logs are rate limited (DoL events are low-frequency, once per leader transition).

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues. Volatile fields with one-way transitions.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed. Volatile sufficient for this use case.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added. (DolStampTest.java, VeniceServerConfigTest updates)
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility - DoL can be disabled via config to use legacy time-based mechanism.

Does this PR introduce any user-facing or breaking changes?

  • No. DoL is enabled by default but falls back to legacy mechanism on failure. Can be disabled via config if needed.

Copilot AI review requested due to automatic review settings November 24, 2025 18:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Declaration-of-Leadership (DoL) loopback mechanism to ensure new leader replicas are fully caught up on the Version Topic (VT) before switching to consume from remote VT or Real-Time (RT) topics. This replaces the previous time-based heuristic with a deterministic approach that eliminates the risk of duplicate consumption and data inconsistencies during leader transitions.

Key Changes:

  • New DoL control message type with unique GUID that leaders produce to local VT during STANDBY→LEADER transition
  • Leader waits to consume its own DoL message back (loopback confirmation) before switching to remote sources
  • Configurable rollout via separate flags for system stores and user stores (SERVER_LEADER_HANDOVER_USE_DOL_MECHANISM_FOR_SYSTEM_STORES and SERVER_LEADER_HANDOVER_USE_DOL_MECHANISM_FOR_USER_STORES)

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
DolStamp.java New class tracking DoL state (produced/consumed flags, leadership term, host ID) during leader transition
DolGuidGenerator.java GUID generator for DoL control messages using UUID type 3
DoLStampGuidGenerator.java Duplicate GUID generator implementation (identical to DolGuidGenerator)
KafkaKey.java Adds DOL_STAMP constant for DoL control message key
VeniceWriter.java Implements sendDoLStamp() and getDoLStampKME() for producing DoL messages
StoreIngestionTask.java Adds checkAndHandleDoLMessage() to detect and handle consumed DoL messages, validates DoL messages like heartbeats
LeaderFollowerStoreIngestionTask.java Orchestrates DoL mechanism: initializes DoL state, sends DoL stamp, checks readiness in canSwitchToLeaderTopic(), falls back to legacy behavior when DoL disabled
PartitionConsumptionState.java Tracks DoL state and highest observed leadership term per partition
LeaderFollowerPartitionStateModel.java Uses Helix message creation timestamp as leadership term
SharedKafkaConsumer.java Adds region name and index to toString() for better debugging
ConfigKeys.java Defines two new config flags for DoL mechanism enablement
VeniceServerConfig.java Exposes DoL config flags via getters
VeniceServerWrapper.java Enables DoL mechanism for both system and user stores in integration tests
VeniceClusterWrapper.java Increases timeout for version wait from 60s to 120s to accommodate DoL latency
TestHybrid.java Adds timeout and unique store name for log compaction test
TestHybridMultiRegion.java Updates test to use sendEmptyPushAndWait() and improves error message assertion
TestTopicRequestOnHybridDelete.java Removes unused imports and deletes deleteStoreAfterStartedPushAllowsNewPush test
log4j2.properties Updates logging configuration (contains hardcoded user path)
StoreIngestionTaskTest.java Renames test method from resolveRtTopicPartitionWithPubSubBrokerAddress to resolveTopicPartitionWithPubSubBrokerAddress
SharedKafkaConsumerTest.java Updates test to pass region name and index to SharedKafkaConsumer constructor
ActiveActiveStoreIngestionTask.java Updates method calls from resolveRtTopicPartitionWithPubSubBrokerAddress to resolveTopicPartitionWithPubSubBrokerAddress
KafkaConsumerService.java Passes region name and index when creating SharedKafkaConsumer instances
PartitionWiseKafkaConsumerService.java Adds consumer instance to log output for better debugging
HelixReadWriteSchemaRepository.java Adds store name to exception log message

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/venice-common/src/main/java/com/linkedin/venice/writer/VeniceWriter.java Outdated
Comment thread internal/venice-test-common/src/integrationTest/resources/log4j2.properties Outdated
@sushantmane sushantmane force-pushed the ST-AckBack-DuringLeaderPromotion branch from 0998fab to 1bbff55 Compare November 24, 2025 21:22
Copy link
Copy Markdown
Contributor

@sixpluszero sixpluszero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your change! A great work to improve ingestion stability. I left some comments for clarification.

@github-actions
Copy link
Copy Markdown

Hi there. This pull request has been inactive for 30 days. To keep our review queue healthy, we plan to close it in 7 days unless there is new activity. If you are still working on this, please push a commit, leave a comment, or convert it to draft to signal intent. Thank you for your time and contributions.

@github-actions github-actions Bot added the stale label Jan 16, 2026
@github-actions
Copy link
Copy Markdown

Closing this pull request due to 37 days of inactivity. This is not a judgment on the value of the work. If you would like to continue, please reopen or open a new PR and we will be happy to take another look. Thank you again for contributing.

@github-actions github-actions Bot closed this Jan 23, 2026
@sushantmane sushantmane reopened this Jan 25, 2026
Copilot AI review requested due to automatic review settings January 25, 2026 10:14
@sushantmane sushantmane force-pushed the ST-AckBack-DuringLeaderPromotion branch from 1bbff55 to daadb36 Compare January 25, 2026 10:14
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 12 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/venice-test-common/src/integrationTest/resources/log4j2.properties Outdated
Copilot AI review requested due to automatic review settings January 26, 2026 01:32
@sushantmane sushantmane force-pushed the ST-AckBack-DuringLeaderPromotion branch from 5b07505 to aa253a3 Compare January 26, 2026 01:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/venice-test-common/src/integrationTest/resources/log4j2.properties Outdated
Comment thread internal/venice-common/src/main/java/com/linkedin/venice/writer/VeniceWriter.java Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java Outdated
Comment thread internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java Outdated
Comment thread internal/venice-common/src/main/java/com/linkedin/venice/writer/VeniceWriter.java Outdated
@sushantmane sushantmane force-pushed the ST-AckBack-DuringLeaderPromotion branch from d77efec to 8cd08d6 Compare February 7, 2026 11:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sixpluszero
sixpluszero previously approved these changes Feb 12, 2026
Copy link
Copy Markdown
Contributor

@sixpluszero sixpluszero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm.

Logic wise I think it makes sense. Left some small comments..

Comment thread internal/venice-test-common/src/integrationTest/resources/log4j2.properties Outdated
Copilot AI review requested due to automatic review settings February 13, 2026 09:21
@sushantmane sushantmane force-pushed the ST-AckBack-DuringLeaderPromotion branch from 83e2c30 to 3011f14 Compare February 13, 2026 09:21
@sushantmane sushantmane added the review-addressed Author has addressed review comments label Feb 13, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 13, 2026 10:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sushantmane and others added 7 commits February 13, 2026 15:57
Add SERVER_LEADER_HANDOVER_USE_DOL_MECHANISM config to enable the new
Declaration of Leadership (DoL) mechanism for fast leader handover.

Changes:
- Add SERVER_LEADER_HANDOVER_USE_DOL_MECHANISM config key in ConfigKeys.java
- Add leaderHandoverUseDoLMechanism field and getter in VeniceServerConfig
- Refactor canSwitchToLeaderTopic() to check config and route logic
- Extract canSwitchToLeaderTopicLegacy() with original time-based logic
- Add comprehensive design document for DoL mechanism

Default: false (maintains backward compatibility with legacy time-based
mechanism)

This is step 1 of the DoL implementation. The actual DoL loopback logic
will be implemented in subsequent commits when the config is enabled.

Address review comments - [guid] Remove duplicate DolGuidGenerator class

Consolidated duplicate GUID generator classes:
- Removed DolGuidGenerator.java (duplicate)
- Updated KafkaKey.java to use DoLStampGuidGenerator
- Both VeniceWriter and KafkaKey now use the same DoLStampGuidGenerator class

This ensures consistent GUID generation for Declaration of Leadership (DoL)
messages across the codebase.
- StoreIngestionTask: Fix NPE when hostName is null in LeaderMetadata
- StoreIngestionTask: Use Objects.equals for null-safe host comparison
- LeaderFollowerStoreIngestionTask: Add null check for produceResult
- LeaderFollowerStoreIngestionTask: Change log level from WARN to DEBUG
  to avoid log spam during legacy fallback
- DoLStampGuidGenerator: Use StandardCharsets.UTF_8 for consistent GUID
  generation across platforms
- ConfigKeys: Fix javadoc to reflect actual default values (true)
- VeniceWriter: Fix bug where completedFuture() result was not returned
The comment was only mentioning heartbeat records but the code also
skips validation for DoL stamp records. Updated for clarity.
- Fix javadoc: "Consumer/drainer thread" -> "Drainer thread" in PCS
- Cache DoL GUID as static final instead of recomputing on every call
- Remove commented-out Kafka logger settings in integration test log4j2
Acquire partitionLocks[partition] before producing DoL stamp message
to prevent potential races with concurrent writes to the same partition.
…delay

The waitForStateVersion background thread polls every 60s by default.
Override it to 100ms in the test static block, matching the existing
SCHEMA_POLLING_DELAY_MS override.
Copilot AI review requested due to automatic review settings February 14, 2026 00:01
@sushantmane sushantmane force-pushed the ST-AckBack-DuringLeaderPromotion branch from 8bed3c1 to 2373ca1 Compare February 14, 2026 00:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sushantmane
Copy link
Copy Markdown
Contributor Author

Thanks a lot for the reviewer, @sixpluszero!

@sushantmane sushantmane added approved PR has been approved and removed review-addressed Author has addressed review comments labels Feb 14, 2026
@sushantmane sushantmane enabled auto-merge (squash) February 14, 2026 02:36
@sushantmane sushantmane merged commit 3df58ef into linkedin:main Feb 14, 2026
103 of 107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved PR has been approved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants