[Observability] Log interrupted processing tasks on unexpected worker… by prince8273 · Pull Request #9261 · dask/distributed

prince8273 · 2026-05-14T06:39:53Z

Description

This PR introduces targeted observability for interrupted processing tasks during unexpected worker disconnections (e.g., OOM kills or hardware failures).

Previously, when remove_worker() was triggered with expected=False, the scheduler correctly tracked the processing_keys but failed to emit them to the logs. This created an observability gap where cluster operators could see a worker disconnect, but could not identify which specific task caused the failure without waiting for the task's retry limit to be exhausted.

This change adds a surgical logger.warning to surface the interrupted processing_keys precisely at the time of the worker's death, significantly improving failure provenance and debugging workflows for large clusters.

Implementation Details

Added a conditional log in distributed/scheduler.py::remove_worker to print processing_keys when a worker drops unexpectedly (expected=False).
Maintained consistency with the existing telemetry style of the adjacent recompute_keys and lost_keys logging blocks.
Moved and updated test_log_remove_worker to distributed/tests/test_scheduler.py to correctly test scheduler-level logs and expect the new observability warning during ungraceful shutdowns.
Tests added / passed
Passes pre-commit run --all-files

… death When a worker drops off the cluster unexpectedly (e.g., due to an OOM kill), the scheduler tracks the processing_keys but previously did not log them to the console. This change surfaces exactly which tasks were interrupted, significantly improving debugging provenance for cluster hangs and memory crashes.

github-actions · 2026-05-14T07:39:23Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

31 files ±0 31 suites ±0 11h 17m 10s ⏱️ - 13m 52s
4 121 tests ±0 4 014 ✅ +2 105 💤 ±0 2 ❌ - 2
59 815 runs ±0 57 325 ✅ +4 2 488 💤 ±0 2 ❌ - 4

For more details on these failures, see this check.

Results for commit 8213721. ± Comparison against base commit cf508b9.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.

distributed.tests.test_worker ‑ test_log_remove_worker

distributed.tests.test_scheduler ‑ test_log_remove_worker

♻️ This comment has been updated with latest results.

… death Fixes dask#9263

prince8273 · 2026-05-14T13:33:17Z

Hi @fjetter! The 3 failing checks are pre-existing flakes unrelated to this PR:

test_RetireWorker_with_actor[True] (Ubuntu 3.14) — Worker.close() teardown timeout, a known slow test (~60s)
test_simple in test_scheduler_bokeh.py (Windows 3.10) — HTTP dashboard fetch timeout

Neither test touches the files changed in this PR (scheduler.py, test_scheduler.py).
Could you please trigger a re-run?
Thank you! 😊

prince8273 requested a review from fjetter as a code owner May 14, 2026 06:39

[Observability] Log interrupted processing tasks on unexpected worker…

8213721

… death Fixes dask#9263

prince8273 force-pushed the main branch from 0aa7b75 to 8213721 Compare May 14, 2026 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Observability] Log interrupted processing tasks on unexpected worker…#9261

[Observability] Log interrupted processing tasks on unexpected worker…#9261
prince8273 wants to merge 2 commits into
dask:mainfrom
prince8273:main

prince8273 commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

prince8273 commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

prince8273 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Implementation Details

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

prince8273 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

prince8273 commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

prince8273 commented May 14, 2026 •

edited

Loading