Skip to content

Give more wait time and retry for kafka setup for ExactlyOnceKafkaRealtimeClusterIntegrationTest and separate it from the test suite run#17752

Closed
xiangfu0 wants to merge 1 commit intoapache:masterfrom
xiangfu0:codex/flaky-exactly-once-integration-test
Closed

Give more wait time and retry for kafka setup for ExactlyOnceKafkaRealtimeClusterIntegrationTest and separate it from the test suite run#17752
xiangfu0 wants to merge 1 commit intoapache:masterfrom
xiangfu0:codex/flaky-exactly-once-integration-test

Conversation

@xiangfu0
Copy link
Contributor

@xiangfu0 xiangfu0 commented Feb 24, 2026

This pull request improves the reliability and configurability of Kafka cluster startup in integration tests, particularly to better support resource-constrained CI environments. The main changes include making Kafka startup parameters overridable, updating the test execution order in Maven to ensure a specific test runs first, and customizing Kafka startup behavior for CI.

Kafka startup configurability and reliability:

  • Added new protected methods (getKafkaStartMaxAttempts, getKafkaStartRetryWaitMs, getKafkaClusterReadyTimeoutMs) in BaseClusterIntegrationTest.java to allow subclasses to override Kafka broker startup attempts, retry wait time, and cluster readiness timeout. All usages of the previous constants in Kafka startup logic now use these methods. [1] [2] [3] [4] [5] [6]

  • Overrode the new Kafka startup configuration methods in ExactlyOnceKafkaRealtimeClusterIntegrationTest.java to provide more generous retry and timeout values when running in CI environments (detected via GITHUB_ACTIONS), improving test reliability under resource constraints.

Test execution order improvements:

  • Updated the Maven Surefire plugin configuration in pinot-integration-tests/pom.xml to disable the default test execution, and instead:
    • Run ExactlyOnceKafkaRealtimeClusterIntegrationTest first in its own execution.
    • Exclude this test from the subsequent set of tests, so it only runs once and before all others. [1] [2]

@codecov-commenter
Copy link

codecov-commenter commented Feb 24, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
10836 2 10834 66
View the top 1 failed test(s) by shortest run time
org.apache.pinot.integration.tests.BaseRealtimeClusterIntegrationTest::@BeforeClass setUp
Stack Traces | 1256s run time
Failed to load 115545 documents; current count=0 for table=mytable expected [115545] but found [0]
View the full list of 1 ❄️ flaky test(s)
org.apache.pinot.integration.tests.ExactlyOnceKafkaRealtimeClusterIntegrationTest::setUp

Flake rate in main: 100.00% (Passed 0 times, Failed 40 times)

Stack Traces | 1256s run time
Failed to load 115545 documents; current count=0 for table=mytable expected [115545] but found [0]

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@xiangfu0 xiangfu0 force-pushed the codex/flaky-exactly-once-integration-test branch 4 times, most recently from 6bae0d4 to a53312b Compare February 24, 2026 18:33
@xiangfu0 xiangfu0 changed the title trying to give more wait time and retry for kafka setup for ExactlyOnceKafkaRealtimeClusterIntegrationTest Give more wait time and retry for kafka setup for ExactlyOnceKafkaRealtimeClusterIntegrationTest and separate it from the test suite run Feb 24, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses flakiness in the ExactlyOnceKafkaRealtimeClusterIntegrationTest when running in GitHub Actions CI environments. The test uses a 3-broker Kafka cluster with transactions, which requires more resources and time to start reliably in resource-constrained CI environments.

Changes:

  • Introduced configurable Kafka startup parameters (max attempts, retry wait time, cluster ready timeout) in BaseClusterIntegrationTest
  • Overrode these parameters in ExactlyOnceKafkaRealtimeClusterIntegrationTest to use more generous timeouts when running in GitHub Actions
  • Restructured Maven test execution to run ExactlyOnceKafkaRealtimeClusterIntegrationTest first in isolation before other tests

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
pinot-integration-test-base/src/test/java/org/apache/pinot/integration/tests/BaseClusterIntegrationTest.java Added three protected methods to allow subclasses to customize Kafka startup configuration (max attempts, retry wait time, cluster ready timeout); updated all usages to call these methods instead of using constants directly
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/ExactlyOnceKafkaRealtimeClusterIntegrationTest.java Overrode Kafka configuration methods to use higher values (5 attempts, 5s retry wait, 180s timeout) when GITHUB_ACTIONS environment variable is true
pinot-integration-tests/pom.xml Restructured integration-tests-set-1 profile to run ExactlyOnceKafkaRealtimeClusterIntegrationTest in a separate execution first, then exclude it from the remaining E*Test.java pattern

@xiangfu0 xiangfu0 force-pushed the codex/flaky-exactly-once-integration-test branch 2 times, most recently from a8cf5c3 to 0625763 Compare February 25, 2026 09:40
@xiangfu0 xiangfu0 force-pushed the codex/flaky-exactly-once-integration-test branch from 0625763 to a4dbc4a Compare February 25, 2026 11:45
@xiangfu0 xiangfu0 added ui UI related issue and removed ui UI related issue labels Feb 28, 2026
@xiangfu0 xiangfu0 closed this Mar 1, 2026
@xiangfu0 xiangfu0 deleted the codex/flaky-exactly-once-integration-test branch March 1, 2026 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants