FLINK-38872: waitForAllTaskRunning Hang Indefinitely#27462
FLINK-38872: waitForAllTaskRunning Hang Indefinitely#27462teamconfx wants to merge 1 commit intoapache:masterfrom
Conversation
| private static final long RETRY_INTERVAL = 100L; | ||
|
|
||
| /** Default timeout for waiting on tasks to reach running state. */ | ||
| public static final Duration DEFAULT_WAIT_FOR_TASKS_TIMEOUT = Duration.ofMinutes(5); |
There was a problem hiding this comment.
This seems like a workaround, similar to review comments in PR 27433. We should address the root cause of each test that causes this hang. @snuyanzin do you agree with this assessment? I assume the example scenario in the Jira is not one we currently have in our unit tests.
Also I am curious how you decided on 5 minutes as the default?
There was a problem hiding this comment.
I think a timeout would be as a good "guard" for limit the unit test timing (especially when there would have infinite dead waiting).
For this specific case, I found that if I setup a condition (e.g., node restart) in some Flink uni tests, then the job would "loss" and the waiting would hang forever, as I described in the JIRA.
One concrete example I can show you is that in this unit test:
org.apache.flink.test.streaming.runtime.SinkMetricsITCase, if you inject a node restart (restart the taskmanager) between line 110~line 111, then this corrupted jobID will hang the unit test forever.
I set to 5 minutes as I would think 5 minutes should be a good timeout value for a unit test, as I see most unit tests in Flink suite can finish under this window.
There was a problem hiding this comment.
I think a timeout would be as a good "guard" for limit the unit test timing (especially when there would have infinite dead waiting).
we have global Azure timeouts for that
more details are here https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#avoid-timeouts-in-junit-tests
can you elaborate why it doesn't help?
There was a problem hiding this comment.
@teamconfx , do you know how often the hanging problem happen and if it happens what's the error after Azure timeout?
There was a problem hiding this comment.
I think adding timeout for wait until a condition isn't a bad idea than let it hanging forever. Just the duration needs to be carefully chosen not to timeout too soon. Similar to #27767
This PR fixes FLINK-38872.
Issue:
CommonTestUtils.waitForAllTaskRunning() could hang indefinitely if the job is lost, cancelled, or enters an unexpected terminal state.
Changes Made:
Added timeout parameters to waitUntilCondition and waitForAllTaskRunning methods.
Added:
Backward compatibility: Existing methods without timeout parameters now delegate to the new timeout-aware versions using DEFAULT_WAIT_FOR_TASKS_TIMEOUT.
Created unit tests to verify the timeout functionality: