Skip to content

[Observability] Log interrupted processing tasks on unexpected worker…#9261

Open
prince8273 wants to merge 2 commits into
dask:mainfrom
prince8273:main
Open

[Observability] Log interrupted processing tasks on unexpected worker…#9261
prince8273 wants to merge 2 commits into
dask:mainfrom
prince8273:main

Conversation

@prince8273
Copy link
Copy Markdown

@prince8273 prince8273 commented May 14, 2026

Description

Fixes #9263

This PR introduces targeted observability for interrupted processing tasks during unexpected worker disconnections (e.g., OOM kills or hardware failures).

Previously, when remove_worker() was triggered with expected=False, the scheduler correctly tracked the processing_keys but failed to emit them to the logs. This created an observability gap where cluster operators could see a worker disconnect, but could not identify which specific task caused the failure without waiting for the task's retry limit to be exhausted.

This change adds a surgical logger.warning to surface the interrupted processing_keys precisely at the time of the worker's death, significantly improving failure provenance and debugging workflows for large clusters.

Implementation Details

  • Added a conditional log in distributed/scheduler.py::remove_worker to print processing_keys when a worker drops unexpectedly (expected=False).

  • Maintained consistency with the existing telemetry style of the adjacent recompute_keys and lost_keys logging blocks.

  • Moved and updated test_log_remove_worker to distributed/tests/test_scheduler.py to correctly test scheduler-level logs and expect the new observability warning during ungraceful shutdowns.

  • Tests added / passed

  • Passes pre-commit run --all-files

… death

When a worker drops off the cluster unexpectedly (e.g., due to an OOM kill), the scheduler tracks the processing_keys but previously did not log them to the console. This change surfaces exactly which tasks were interrupted, significantly improving debugging provenance for cluster hangs and memory crashes.
@prince8273 prince8273 requested a review from fjetter as a code owner May 14, 2026 06:39
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    31 files  ±0      31 suites  ±0   11h 17m 10s ⏱️ - 13m 52s
 4 121 tests ±0   4 014 ✅ +2    105 💤 ±0  2 ❌  - 2 
59 815 runs  ±0  57 325 ✅ +4  2 488 💤 ±0  2 ❌  - 4 

For more details on these failures, see this check.

Results for commit 8213721. ± Comparison against base commit cf508b9.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
distributed.tests.test_worker ‑ test_log_remove_worker
distributed.tests.test_scheduler ‑ test_log_remove_worker

♻️ This comment has been updated with latest results.

@prince8273
Copy link
Copy Markdown
Author

prince8273 commented May 14, 2026

Hi @fjetter! The 3 failing checks are pre-existing flakes unrelated to this PR:

test_RetireWorker_with_actor[True] (Ubuntu 3.14) — Worker.close() teardown timeout, a known slow test (~60s)
test_simple in test_scheduler_bokeh.py (Windows 3.10) — HTTP dashboard fetch timeout

Neither test touches the files changed in this PR (scheduler.py, test_scheduler.py).
Could you please trigger a re-run?
Thank you! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Observability] Log processing keys on unexpected worker disconnects

1 participant