Description
Currently, when a worker drops unexpectedly (e.g., due to OOM kills or hardware failures), the scheduler tracks processing_keys internally in remove_worker but fails to emit them to the logs.
This creates an observability gap: cluster operators can see the worker disconnect, but cannot easily identify which specific task caused the failure without waiting for the task's retry limit to be completely exhausted.
Expected Behavior
The scheduler should emit a targeted warning log containing the processing_keys at the exact time of the worker's death. This should maintain consistency with the existing telemetry style of the adjacent recompute_keys and lost_keys logging blocks, significantly improving failure provenance and debugging workflows for large clusters.
Description
Currently, when a worker drops unexpectedly (e.g., due to OOM kills or hardware failures), the scheduler tracks
processing_keysinternally inremove_workerbut fails to emit them to the logs.This creates an observability gap: cluster operators can see the worker disconnect, but cannot easily identify which specific task caused the failure without waiting for the task's retry limit to be completely exhausted.
Expected Behavior
The scheduler should emit a targeted warning log containing the
processing_keysat the exact time of the worker's death. This should maintain consistency with the existing telemetry style of the adjacentrecompute_keysandlost_keyslogging blocks, significantly improving failure provenance and debugging workflows for large clusters.