fix(minion): fail single-segment task when segment upload fails#18813
fix(minion): fail single-segment task when segment upload fails#18813tarun11Mavani wants to merge 3 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18813 +/- ##
============================================
+ Coverage 64.76% 64.80% +0.04%
- Complexity 1319 1335 +16
============================================
Files 3392 3392
Lines 210949 210953 +4
Branches 33119 33118 -1
============================================
+ Hits 136611 136716 +105
+ Misses 63323 63208 -115
- Partials 11015 11029 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
xiangfu0
left a comment
There was a problem hiding this comment.
Found one high-signal operability issue; see inline comment.
|
|
||
| if (uploadSuccessful) { | ||
| LOGGER.info("Done executing {} on table: {}, segment: {}", taskType, tableNameWithType, segmentName); | ||
| throw e; |
There was a problem hiding this comment.
In METADATA push mode this rethrow turns controller-side push failures into task retries, but the retry comes back through moveSegmentToOutputPinotFS() with the same <segment>.tar.gz target. overwriteOutput is normally left false in MinionTaskUtils.getPushTaskConfig(), so if the first attempt already copied the tar before failing, the retry now dies on Output file already exists before it can resend metadata. That makes transient metadata-push failures sticky instead of self-healing. Can we make the staged tar idempotent across retries or clean it up before rethrowing?
There was a problem hiding this comment.
Good catch — fixed in 9863fcc.
uploadSegmentWithMetadata now deletes the staged tar from the output PinotFS if the metadata send fails, before rethrowing, so a retry re-stages cleanly instead of hitting "Output file already exists".
I went with cleanup-on-failure rather than forcing overwriteOutput=true, so the existing guard against unrelated collisions in the output dir stays intact — we only remove the tar when it's our own stale partial from a failed attempt.
Added testExecuteTaskCleansUpStagedTarWhenMetadataPushFails, verified it fails without the cleanup.
BaseSingleSegmentConversionExecutor caught upload exceptions, logged and metered them, then returned the conversion result normally -- so the minion task reported SUCCESS even though the converted segment was never uploaded. Helix marked the task COMPLETED and never retried, silently leaving the segment un-refreshed/un-purged/un-compacted. The swallow was an accidental control-flow change introduced in apache#10978 (observability for upload/download failures): the download branch metered, logged, and rethrew, but the upload branch omitted the rethrow. Rethrow the upload exception, mirroring the download path, so the task fails and is retried. Move the tarred-file cleanup into a finally block so it still runs on the failure path, and remove the now-dead uploadSuccessful flag. Affects all single-segment conversion tasks (RealtimeToOffline, Purge, RefreshSegment, UpsertCompaction, etc.). On upload failure these now report task FAILURE (and retry) instead of SUCCESS; operators alerting on task state will see failures that were previously hidden.
…andling Adds BaseSingleSegmentConversionExecutorTest covering executeTask's upload- failure path: a failed segment upload must propagate (task fails and retries) instead of being silently reported as success. The test static-mocks SegmentConversionUtils.uploadSegment and uses a test-only executor that stubs the download/CRC/convert/ZK-modifier hooks so executeTask reaches the upload step without a server, controller, or deep store. Includes a success-path control test. Verified to fail on the pre-fix executor and pass on the fixed one.
In METADATA push mode, uploadSegmentWithMetadata stages the converted tar to the output PinotFS before sending metadata to the controller. With the upload-failure rethrow, a failed metadata send now triggers a task retry whose moveSegmentToOutputPinotFS would fail with "Output file already exists" (overwriteOutput defaults to false), making transient push failures permanently stuck. Delete the staged tar on metadata-push failure before rethrowing so the retry can re-stage it and self-heal. Cleanup-on-failure is used instead of forcing overwriteOutput=true so the existing collision guard stays intact. Add a regression test verifying the staged tar is deleted on metadata-push failure (fails without the cleanup).
9863fcc to
ba5a53b
Compare
Summary
BaseSingleSegmentConversionExecutorcaught segment-upload exceptions, logged and metered them, and then returned the conversion result normally — so the minion task was reported as SUCCESS even though the converted segment was never uploaded. Helix marked the task COMPLETED and never retried it, silently leavingthe segment un-refreshed / un-purged / un-compacted.
This affects all single-segment conversion tasks:
RealtimeToOfflineSegments,PurgeTask,RefreshSegment,UpsertCompaction, etc.Root cause
The swallow was an accidental control-flow change introduced in #10978 ("Add minion observability for segment upload/download failures"). That PR wrapped both the download and the upload calls to add metrics + logging:
So the upload path has silently swallowed failures since June 2023. The download-path asymmetry shows this was an oversight, not an intentional "don't retry on upload failure" design — the PR's stated intent was observability only.
Change
finallyblock so it still runs on the failure path (the file also lives undertempDataDir, which the outerfinallyalready deletes, so cleanup is preserved either way).uploadSuccessfulflag.Testing
Added
BaseSingleSegmentConversionExecutorTest(new file):testExecuteTaskRethrowsWhenUploadFails— static-mocksSegmentConversionUtils.uploadSegmentto throw and assertsexecuteTaskpropagates the exception (the regression guard for this fix).testExecuteTaskSucceedsWhenUploadSucceeds— control test asserting thesuccess path returns the conversion result and the upload is invoked.
The test uses a test-only executor that stubs the download / CRC / convert / ZK-metadata-modifier hooks so
executeTaskreaches the upload step without a server, controller, or deep store, and restores the mutated process-global state in teardown.