Skip to content

[Observability] Log processing keys on unexpected worker disconnects #9263

@prince8273

Description

@prince8273

Description

Currently, when a worker drops unexpectedly (e.g., due to OOM kills or hardware failures), the scheduler tracks processing_keys internally in remove_worker but fails to emit them to the logs.

This creates an observability gap: cluster operators can see the worker disconnect, but cannot easily identify which specific task caused the failure without waiting for the task's retry limit to be completely exhausted.

Expected Behavior

The scheduler should emit a targeted warning log containing the processing_keys at the exact time of the worker's death. This should maintain consistency with the existing telemetry style of the adjacent recompute_keys and lost_keys logging blocks, significantly improving failure provenance and debugging workflows for large clusters.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions