fix: Multi-Node Jobs ended_at time not being set#184
fix: Multi-Node Jobs ended_at time not being set#184yuechao-qin wants to merge 1 commit intomasterfrom
ended_at time not being set#184Conversation
| _MULTI_NODE_NODE_INDEX_ENV_VAR_NAME = "_TANGLE_MULTI_NODE_NODE_INDEX" | ||
|
|
||
|
|
||
| class JobConditionType(str, enum.Enum): |
| FAILURE_TARGET = "FailureTarget" | ||
|
|
||
|
|
||
| class ConditionStatus(str, enum.Enum): |
| return datetime.datetime.now(tz=datetime.timezone.utc) | ||
|
|
||
|
|
||
| def _set_terminal_state( |
There was a problem hiding this comment.
I'm not sure this function adds any value. It might also be adding confusion as it's not obvious (from the sall site) what exactly it is doing. Let's revert it.
There was a problem hiding this comment.
I messed up on this function. I meant to make the ended_at also required, not optional. So that it's clear for any terminal state transitions it requires 1) status and 2) ended time changes.
Do you think this function is helpful then?
Else, do you have any suggestions on how to 1) make it clear to the developers where the terminal transitions are for a job and 2) what required changes are needed when transitioning to a terminal state?
|
Nit: This looks like a fix combined with a refactoring. This increases the diff and makes it harder to see what has changed and what hasn't. The fix changes are hidden among refactoring changes. |
|
|
||
| with ( | ||
| mock.patch( | ||
| "cloud_pipelines_backend.orchestrator_sql._get_current_time", |
There was a problem hiding this comment.
What happens if we make this a real obejct, not a string? This would be more robust if something is renamed.
| info.is_dir = False | ||
| info.total_size = 10 | ||
| info.hashes = {"md5": "abc123"} | ||
| storage.make_uri.return_value.get_reader.return_value.get_info.return_value = ( |
There was a problem hiding this comment.
This looks brittle. It might be easier to creake a mock storage provider (or a mocked GCS storage provider) and give it to the launcher/orchestrator.
| return datetime.datetime(2026, 3, 20, hour, 0, tzinfo=datetime.timezone.utc) | ||
|
|
||
|
|
||
| def _make_execution_node( |
There was a problem hiding this comment.
_make_mock_execution_node?
| return lc | ||
|
|
||
|
|
||
| def _make_mock_session() -> mock.MagicMock: |
There was a problem hiding this comment.
It might be easier to work with a real in-memory Sqlite DB session than mock it. That would also allow testing more end-to-end with less mocking.

Background
Closes https://github.com/Shopify/oasis-frontend/issues/542
Kubernetes Job conditions use
"Complete"for successful jobs and"Failed"for failed jobs. TheLaunchedKubernetesJob.ended_atproperty was checking for"Succeeded"instead of"Complete", which meantended_atwas alwaysNonefor successful K8s Jobs. This caused the API to return no end time for completed executions.Separately, when the orchestrator hit an internal error processing a running execution (SYSTEM_ERROR), it set the status but never recorded
ended_at, making it impossible to tell when the failure occurred.Fix
ended_atnow correctly resolves for successful K8s Jobs by matching the"Complete"condition typeended_atso the API shows when the error occurredstatus,ended_at,exit_code,started_at) are set through a single helper to prevent future inconsistenciesChanges
cloud_pipelines_backend/launchers/kubernetes_launchers.pyJobConditionTypeandConditionStatusenums with K8s API doc references, replacing magic strings throughout the fileended_atto check("Complete", "Failed")instead of("Succeeded", "Failed")cloud_pipelines_backend/orchestrator_sql.py_set_terminal_state()helper that sets all terminal fields on aContainerExecutionin one placeended_at(was previously missing)Test