Skip to content

fix(ha): Fix flaky TestWatchPrefixNilPanicWithMemberlist#7493

Merged
friedrichg merged 1 commit intomasterfrom
fix/flaky-test-watch-prefix-nil-panic
May 8, 2026
Merged

fix(ha): Fix flaky TestWatchPrefixNilPanicWithMemberlist#7493
friedrichg merged 1 commit intomasterfrom
fix/flaky-test-watch-prefix-nil-panic

Conversation

@yeya24
Copy link
Copy Markdown
Contributor

@yeya24 yeya24 commented May 8, 2026

Summary

Fix flaky TestWatchPrefixNilPanicWithMemberlist test in pkg/ha.

Root Cause

The test has a race condition between the WatchPrefix watcher registration in loop() and the CheckReplica call. When the HATracker starts, StartAndAwaitRunning returns as soon as the service transitions to Running state, but the loop() goroutine hasn't yet registered its watcher channel in the memberlist KV. If CheckReplica's CAS + notifyWatchers fires before the watcher is registered, the notification is lost and the key never appears in the elected cache, causing the 3-second poll to time out.

Fix

  1. Added a 100ms sleep before CheckReplica to allow the WatchPrefix goroutine to register its watcher channel. This is the same pattern used in pkg/ring/kv/memberlist/memberlist_client_test.go (line 1650).
  2. Increased the poll timeout from 3s to 5s for additional CI robustness.

Testing

  • Ran the test 30 consecutive times with -count=30 — all pass.
  • Full pkg/ha test suite passes.

The test was flaky due to a race between the WatchPrefix watcher
registration in loop() and the CheckReplica call. StartAndAwaitRunning
returns before the WatchPrefix goroutine registers its watcher channel
in the memberlist KV. If CheckReplica's CAS + notifyWatchers fires
before the watcher is registered, the notification is lost and the key
never appears in the elected cache.

Fix by adding a 100ms sleep before CheckReplica to allow the WatchPrefix
goroutine to register its watcher channel (same pattern used in
memberlist_client_test.go), and increasing the poll timeout from 3s to
5s for CI robustness.

Signed-off-by: Ben Ye <benye@amazon.com>
Signed-off-by: Friedrich Gonzalez <1517449+friedrichg@users.noreply.github.com>
@friedrichg friedrichg force-pushed the fix/flaky-test-watch-prefix-nil-panic branch from d17174d to 82b5283 Compare May 8, 2026 21:30
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 8, 2026
@friedrichg friedrichg merged commit c6a8275 into master May 8, 2026
37 checks passed
@friedrichg friedrichg deleted the fix/flaky-test-watch-prefix-nil-panic branch May 8, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/ha-tracker lgtm This PR has been approved by a maintainer size/XS type/flaky-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants