Skip to content

Conversation

@teamconfx
Copy link

This PR fixes FLINK-38870.

Problem

When a JobManager loses leadership, jobs enter SUSPENDED state. The old error message "Job completed with illegal status: null" was uninformative and confusing.

Solution

My approach improves on the JIRA proposal by:

  1. Preserving the actual SUSPENDED JobStatus instead of losing it (the proposal only added a field but kept setting it to null)
  2. Adding serialization support to preserve SUSPENDED state across REST API calls
  3. Maintaining backward compatibility with older clients

Files Modified

  1. JobResult.java
  • Constructor validation: Changed from requiring "globally terminal" to just "terminal" states, allowing SUSPENDED
  • createFrom(): Now stores actual JobStatus including SUSPENDED (was setting null for non-globally-terminal states)
  • toJobExecutionResult(): Added specific handling for SUSPENDED with detailed error message:
    Job is in state SUSPENDED. This commonly happens when the JobManager lost leadership.
    The job may recover automatically if High Availability and a persistent job store are configured.
    If recovery is not possible (e.g., non-persistent ExecutionPlanStore), the job needs to be resubmitted.
  1. JobResultSerializer.java
  • Added new job-status field to preserve actual JobStatus in JSON (alongside existing application-status for backward compatibility)
  1. JobResultDeserializer.java
  • Reads new job-status field if present (takes priority)
  • Falls back to application-status for backward compatibility with older messages
  1. Tests Added
  • JobResultTest: 3 new tests for SUSPENDED state handling
  • JobResultDeserializerTest: 3 new tests for serialization with SUSPENDED state

Test Results

  Tests run: 17, Failures: 0, Errors: 0, Skipped: 0
  BUILD SUCCESS

Error Message Comparison

Before:

  Job completed with illegal status: null.

After:

  Job is in state SUSPENDED. This commonly happens when the JobManager lost leadership.
  The job may recover automatically if High Availability and a persistent job store are configured.
  If recovery is not possible (e.g., non-persistent ExecutionPlanStore), the job needs to be resubmitted.

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 23, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

jobStatus == null || jobStatus.isGloballyTerminalState(),
"jobStatus must be globally terminal or unknow(null)");
jobStatus == null || jobStatus.isTerminalState(),
"jobStatus must be terminal or unknown(null)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does unknown(null) mean? Can we say null?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks for pointing this out! I think null would be good enough! I will change this.

new JobExecutionException(
jobId,
"Job completed with illegal status: " + jobStatus + '.',
"Job completed with unexpected status: " + jobStatus + '.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is jobStatus is null then this will look like null."some cause". If there no cause then we should not put out the . cause. I suggest to ternary ifs to deal with these situations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Jan 23, 2026
@teamconfx
Copy link
Author

@davidradl Thank you so much for the review. Can you take a look at my new changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants