Skip to content

FLINK-38872: waitForAllTaskRunning Hang Indefinitely#27462

Open
teamconfx wants to merge 1 commit intoapache:masterfrom
teamconfx:fix-flink-38872
Open

FLINK-38872: waitForAllTaskRunning Hang Indefinitely#27462
teamconfx wants to merge 1 commit intoapache:masterfrom
teamconfx:fix-flink-38872

Conversation

@teamconfx
Copy link
Copy Markdown

This PR fixes FLINK-38872.

Issue:

CommonTestUtils.waitForAllTaskRunning() could hang indefinitely if the job is lost, cancelled, or enters an unexpected terminal state.

Changes Made:

Added timeout parameters to waitUntilCondition and waitForAllTaskRunning methods.

  1. CommonTestUtils.java (flink-runtime/src/test/java/org/apache/flink/runtime/testutils/)

Added:

  • DEFAULT_WAIT_FOR_TASKS_TIMEOUT constant (5 minutes) at line 75
  • New overload waitUntilCondition(condition, Duration timeout) at lines 161-165
  • New overload waitUntilCondition(condition, Duration retryInterval, Duration timeout) at lines 167-180 - throws TimeoutException when deadline is exceeded
  • New overload waitForAllTaskRunning(MiniCluster, JobID, boolean, Duration) at lines 211-215
  • New overload waitForAllTaskRunning(SupplierWithException, boolean, Duration) at lines 230-269
  • New overload waitForAllTaskRunning(SupplierWithException, Duration) at lines 276-298

Backward compatibility: Existing methods without timeout parameters now delegate to the new timeout-aware versions using DEFAULT_WAIT_FOR_TASKS_TIMEOUT.

  1. CommonTestUtilsTest.java (new file)

Created unit tests to verify the timeout functionality:

  • testWaitUntilConditionWithTimeoutSucceeds - verifies successful completion
  • testWaitUntilConditionWithTimeoutThrowsOnTimeout - verifies TimeoutException is thrown
  • testWaitUntilConditionWithRetryIntervalAndTimeout - tests with custom retry interval
  • testWaitUntilConditionWithRetryIntervalAndTimeoutThrowsOnTimeout - verifies timeout with custom interval

@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Jan 23, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

private static final long RETRY_INTERVAL = 100L;

/** Default timeout for waiting on tasks to reach running state. */
public static final Duration DEFAULT_WAIT_FOR_TASKS_TIMEOUT = Duration.ofMinutes(5);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a workaround, similar to review comments in PR 27433. We should address the root cause of each test that causes this hang. @snuyanzin do you agree with this assessment? I assume the example scenario in the Jira is not one we currently have in our unit tests.

Also I am curious how you decided on 5 minutes as the default?

Copy link
Copy Markdown
Author

@teamconfx teamconfx Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a timeout would be as a good "guard" for limit the unit test timing (especially when there would have infinite dead waiting).
For this specific case, I found that if I setup a condition (e.g., node restart) in some Flink uni tests, then the job would "loss" and the waiting would hang forever, as I described in the JIRA.

One concrete example I can show you is that in this unit test:
org.apache.flink.test.streaming.runtime.SinkMetricsITCase, if you inject a node restart (restart the taskmanager) between line 110~line 111, then this corrupted jobID will hang the unit test forever.

I set to 5 minutes as I would think 5 minutes should be a good timeout value for a unit test, as I see most unit tests in Flink suite can finish under this window.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a timeout would be as a good "guard" for limit the unit test timing (especially when there would have infinite dead waiting).

we have global Azure timeouts for that
more details are here https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#avoid-timeouts-in-junit-tests

can you elaborate why it doesn't help?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teamconfx , do you know how often the hanging problem happen and if it happens what's the error after Azure timeout?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding timeout for wait until a condition isn't a bad idea than let it hanging forever. Just the duration needs to be carefully chosen not to timeout too soon. Similar to #27767

@github-actions github-actions Bot added the community-reviewed PR has been reviewed by the community. label Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants