Skip to content

fix: Restarted the watch after stream reset #19055#19056

Open
vivek807 wants to merge 7 commits into
apache:masterfrom
deep-bi:deep/feature/19055-service-discovery-watch-not-recovering-from-stream-reset
Open

fix: Restarted the watch after stream reset #19055#19056
vivek807 wants to merge 7 commits into
apache:masterfrom
deep-bi:deep/feature/19055-service-discovery-watch-not-recovering-from-stream-reset

Conversation

@vivek807

Copy link
Copy Markdown
Contributor

Fixes #19055.

Description

Restarted the watch after stream reset

Fixed the bug #19055

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes issue #19055 by improving resiliency of the Kubernetes service-discovery watch so broker node inventory can recover after watch stream disruptions.

Changes:

  • Updated WatchResult to be usable with try-with-resources by extending AutoCloseable.
  • Refactored NodeRoleWatcher.keepWatching to use try-with-resources and added explicit handling for HTTP/2 stream reset conditions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/WatchResult.java Adjusts the watch iterator contract to support automatic resource management.
extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/K8sDruidNodeDiscoveryProvider.java Updates watch loop to auto-close resources and attempts to restart after stream resets/timeouts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@FrankChen021

Copy link
Copy Markdown
Member

@vivek807 Thanks for reporting the issue and submitting this PR for fix. can you address above comments?

@vivek807

vivek807 commented Mar 4, 2026

Copy link
Copy Markdown
Contributor Author

@vivek807 Thanks for reporting the issue and submitting this PR for fix. can you address above comments?

updated, please recheck.

@vivek807 vivek807 requested a review from FrankChen021 March 4, 2026 11:00
@vivek807 vivek807 force-pushed the deep/feature/19055-service-discovery-watch-not-recovering-from-stream-reset branch from e79643a to dc86628 Compare March 4, 2026 12:02
@vivek807 vivek807 force-pushed the deep/feature/19055-service-discovery-watch-not-recovering-from-stream-reset branch from dc86628 to 268b1ff Compare March 4, 2026 12:52

@capistrant capistrant left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think we need to push further in removing the okhttp internals from this PR. My initial thought is that we just need to handle all IOExceptions with a full re-list. We can keep the one off handling of the known ok IOException that a simple retry of the watch from same resource version will work for. But for all others force a re-list? My biggest fear is that this is an overreaction.

…-not-recovering-from-stream-reset

# Conflicts:
#	extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/K8sDruidNodeDiscoveryProvider.java
#	extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/WatchResult.java
Remove hard dependency of okhttp3 internals

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up handled: the OkHttp internal StreamResetException dependency was removed, and the watch handling now treats non-interrupt IOExceptions as restartable channel resets while preserving timeout handling. No further inline reply needed from me.


This is an automated review by Codex GPT-5


This is an automated review by Codex GPT-5

@vivek807

vivek807 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@FrankChen021 @gianm @capistrant Please let me know if there’s anything I can do to help move this PR forward.

@vivek807 vivek807 changed the title Restarted the watch after stream reset #19055 fix: Restarted the watch after stream reset #19055 Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Broker retains stale Historical IP after pod recreation; service discovery watch not recovering from stream reset

5 participants