-
Notifications
You must be signed in to change notification settings - Fork 13.9k
[FLINK-38870] Uninformative error message when job is suspended due to JobManager leadership loss #27463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| jobStatus == null || jobStatus.isGloballyTerminalState(), | ||
| "jobStatus must be globally terminal or unknow(null)"); | ||
| jobStatus == null || jobStatus.isTerminalState(), | ||
| "jobStatus must be terminal or unknown(null)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does unknown(null) mean? Can we say null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thanks for pointing this out! I think null would be good enough! I will change this.
| new JobExecutionException( | ||
| jobId, | ||
| "Job completed with illegal status: " + jobStatus + '.', | ||
| "Job completed with unexpected status: " + jobStatus + '.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is jobStatus is null then this will look like null."some cause". If there no cause then we should not put out the . cause. I suggest to ternary ifs to deal with these situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
|
@davidradl Thank you so much for the review. Can you take a look at my new changes? |
This PR fixes FLINK-38870.
Problem
When a JobManager loses leadership, jobs enter SUSPENDED state. The old error message "Job completed with illegal status: null" was uninformative and confusing.
Solution
My approach improves on the JIRA proposal by:
Files Modified
Job is in state SUSPENDED. This commonly happens when the JobManager lost leadership.
The job may recover automatically if High Availability and a persistent job store are configured.
If recovery is not possible (e.g., non-persistent ExecutionPlanStore), the job needs to be resubmitted.
Test Results
Error Message Comparison
Before:
After: