Skip to content

Fix WorkItemDispatcher.DispatchAsync silent termination due to unhandled exceptions#1322

Open
berndverst wants to merge 4 commits intoAzure:mainfrom
berndverst:fix/dispatch-loop-silent-termination
Open

Fix WorkItemDispatcher.DispatchAsync silent termination due to unhandled exceptions#1322
berndverst wants to merge 4 commits intoAzure:mainfrom
berndverst:fix/dispatch-loop-silent-termination

Conversation

@berndverst
Copy link
Member

@berndverst berndverst commented Mar 18, 2026

Summary

Fixes #1320

The activity dispatcher's DispatchAsync loop in WorkItemDispatcher can silently terminate due to unhandled exceptions in code paths outside the fetch try/catch block. Because the dispatch loop runs as a fire-and-forget Task.Run, any unhandled exception is silently swallowed — no error is logged, no DispatcherStopped event is emitted, and no recovery mechanism exists.

When this happens, the activity dispatcher dies permanently while the orchestration dispatcher continues running normally. Orchestrations keep scheduling activities, but no activities are ever picked up from the work-item queue. The only recovery is a full host restart.

Problem

In WorkItemDispatcher.DispatchAsync, the fetch operation is wrapped in a try/catch, but several code paths outside this try/catch can throw unhandled exceptions:

Code Path Exception Risk
concurrencyLock.WaitAsync(TimeSpan.FromSeconds(5)) ObjectDisposedException if the semaphore is disposed
SafeReleaseWorkItem(workItem) Any exception from the release handler
Task.Delay(TimeSpan.FromSeconds(delaySecs)) ObjectDisposedException if the CancellationTokenSource is disposed
concurrencyLock.Release() SemaphoreFullException on double-release

When any of these throw, the DispatchAsync Task transitions to Faulted state and the exception is silently swallowed by .NET's unobserved exception handler. The dispatch loop terminates with zero telemetry — no errors logged, no DispatcherStopped event.

Changes

WorkItemDispatcher.cs

  • Outer try/catch in DispatchAsync: Wrapped the entire while loop body in a catch (Exception exception) when (!Utils.IsFatal(exception)) block that:
    • Logs the error via both LogHelper.DispatcherLoopFailed() and TraceHelper.TraceException() at Error level
    • Releases the concurrency semaphore if acquired but not handed off to ProcessWorkItemAsync (prevents permanent concurrency reduction)
    • Backs off for 10 seconds with a cancellable Task.Delay that respects shutdownCancellationTokenSource.Token (so shutdown isn't delayed by the backoff)
    • Guards against ObjectDisposedException from a disposed CTS during rapid shutdown
    • Allows the loop to continue running instead of silently dying
  • Semaphore leak prevention: Added semaphoreAcquired and scheduledWorkItem tracking at the outer loop scope. The outer catch only releases the semaphore when it was acquired (semaphoreAcquired) but not yet transferred to ProcessWorkItemAsync (!scheduledWorkItem), preventing both leaks and double-releases.
  • ContinueWith on Task.Run in StartAsync: Added .ContinueWith(..., OnlyOnFaulted) as a last-resort safety net that logs at Critical level if the dispatch loop terminates due to a fatal exception (i.e., OutOfMemoryException or StackOverflowException) that bypasses the when (!Utils.IsFatal(exception)) filter

New structured logging infrastructure

Added DispatcherLoopFailed (EventId 30) across all logging layers:

File Change
Logging/EventIds.cs New DispatcherLoopFailed = 30 constant
Logging/LogEvents.cs New DispatcherLoopFailed structured log event class (LogLevel.Error) with Dispatcher and Details fields
Logging/StructuredEventSource.cs New ETW event at EventLevel.Error
Logging/LogHelper.cs New DispatcherLoopFailed(context, exception) method

New test file: test/DurableTask.Core.Tests/WorkItemDispatcherTests.cs

7 unit tests covering the dispatch loop resilience and the new log event:

Test What it verifies
DispatchLoop_SurvivesOuterException_ViaFaultyDelayCallback Outer catch handles exceptions escaping from inner catch callbacks, logs DispatcherLoopFailed, and the loop recovers
DispatchLoop_SurvivesMultipleExceptionTypes_AndContinuesProcessing Dispatch loop survives InvalidOperationException, TimeoutException, and TaskCanceledException in sequence
DispatchLoop_LogsFetchWorkItemFailure_WhenFetchThrows Inner catch logs FetchWorkItemFailure when fetch throws
DispatchLoop_ProcessesWorkItemsSuccessfully Normal work item processing still functions correctly
DispatchLoop_StopsGracefully_WhenNoExceptions DispatcherStopped is logged and DispatcherLoopFailed is NOT logged during graceful shutdown
DispatcherLoopFailed_LogEvent_HasCorrectProperties New log event has correct EventId, LogLevel.Error, and formatted message
DispatcherLoopFailed_LogEvent_ExposesStructuredFields Structured log fields (Dispatcher, Details) are accessible via IReadOnlyDictionary

All tests pass on both net6.0 and net48. Tests use event-driven signaling (no fixed Task.Delay waits) with a shared 10-second timeout for CI reliability.

Diagnostic Improvement

After this fix, the previously-silent failure scenario will now produce:

  1. Error-level structured log: DispatcherLoopFailed with full exception details and dispatcher identity
  2. ETW event: At EventLevel.Error for ETW consumers
  3. TraceHelper trace: At TraceEventType.Error level for classic trace listeners
  4. Automatic recovery: The dispatch loop retries after a cancellable 10-second backoff instead of dying permanently

If a fatal exception somehow bypasses the catch filter, the ContinueWith continuation logs at Critical level so there is always visibility into dispatch loop termination.

Behavioral Notes

  • No breaking changes: All existing public API signatures and event IDs are unchanged
  • Backward compatible: The new EventId (30) fills a gap in the existing 20-29 range for dispatcher events
  • Existing behavior preserved: The inner try/catch for fetch operations is unchanged; this fix only adds an outer safety net
  • Utils.IsFatal filter preserved: OutOfMemoryException and StackOverflowException are not caught, matching the pattern used in ProcessWorkItemAsync
  • Shutdown-friendly: The backoff delay is cancellable via the shutdown token, so StopAsync doesn't wait the full 10 seconds

…led exceptions

Wrap the DispatchAsync loop body in an outer try/catch to prevent
the dispatch loop from silently dying when exceptions occur outside
the inner fetch try/catch block. Add ContinueWith to Task.Run in
StartAsync as a last-resort safety net for fatal exceptions.

Add new DispatcherLoopFailed structured log event (EventId 30) at
Error level with full exception details for telemetry visibility.

Add comprehensive unit tests for the dispatch loop resilience.

Fixes Azure#1320
Copilot AI review requested due to automatic review settings March 18, 2026 22:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens WorkItemDispatcher<T>.DispatchAsync against unhandled exceptions that can otherwise silently terminate the fire-and-forget dispatch loop, and adds a new structured/ETW log event for this failure mode.

Changes:

  • Add an outer try/catch around the dispatch loop body to log and retry after unexpected (non-fatal) exceptions.
  • Add a ContinueWith(..., OnlyOnFaulted) safety net to log if the dispatch task faults.
  • Introduce DispatcherLoopFailed (EventId 30) across structured logging + ETW, and add new unit tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/DurableTask.Core/WorkItemDispatcher.cs Wraps dispatch-loop body in an outer exception handler + adds faulted-task continuation logging.
src/DurableTask.Core/Logging/EventIds.cs Adds new EventId constant DispatcherLoopFailed = 30.
src/DurableTask.Core/Logging/LogEvents.cs Adds structured log event type DispatcherLoopFailed.
src/DurableTask.Core/Logging/StructuredEventSource.cs Adds ETW event for DispatcherLoopFailed.
src/DurableTask.Core/Logging/LogHelper.cs Adds helper method to emit DispatcherLoopFailed.
Test/DurableTask.Core.Tests/WorkItemDispatcherTests.cs Adds tests intended to validate dispatcher resilience and the new log event.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Bernd Verst added 2 commits March 18, 2026 15:26
- Replace hard-coded 30s timeouts with a shared 10s TestTimeout constant
- Rewrite broken SurvivesSafeReleaseWorkItemException test that always hit
  its 15s timeout (safeReleaseThrew was never set to true)  replaced with
  SurvivesMultipleExceptionTypes that tests various exception types and
  completes in <10ms
- Replace fixed Task.Delay(500) + Task.Delay(2000) in StopsGracefully test
  with event-driven signaling: a TaskCompletionSource confirms the loop is
  running before stopping, and polling with 50ms intervals replaces the 2s
  fixed sleep for DispatcherStopped detection
- Total test suite time reduced from ~20s to ~2-3s
- Fix semaphore leak in outer catch: track whether concurrencyLock was
  acquired and release it in the catch to avoid permanently reducing
  available concurrency
- Pass shutdownCancellationTokenSource.Token to Task.Delay in the outer
  catch backoff so shutdown requests don't wait the full 10s interval
- Rewrite test 1 to exercise the actual outer catch path by throwing from
  GetDelayInSecondsAfterOnFetchException (which escapes the inner catch)
  and assert that DispatcherLoopFailed is logged
- Rename test 3 from LogsErrorAndRetries_WhenOuterExceptionOccurs to
  LogsFetchWorkItemFailure_WhenFetchThrows to accurately describe what
  it tests (inner catch, not outer catch)
- Fix InMemoryLogger.BeginScope to return a no-op IDisposable instead
  of null to avoid NREs when scopes are disposed
Copilot AI review requested due to automatic review settings March 18, 2026 23:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens WorkItemDispatcher<T>.DispatchAsync so the dispatch loop doesn’t silently die when exceptions escape the existing fetch-only try/catch, adding a new structured/ETW log event (DispatcherLoopFailed) and introducing unit tests intended to validate the new resilience behavior.

Changes:

  • Wrap the dispatch loop body in an outer non-fatal exception handler that logs DispatcherLoopFailed and backs off before retrying.
  • Add a fault-only continuation to the fire-and-forget Task.Run to log critical failures if the dispatch loop faults.
  • Add new logging plumbing (EventId + structured log event + ETW event) and a new unit test file for dispatcher resilience.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/DurableTask.Core/WorkItemDispatcher.cs Adds outer catch/backoff around dispatch loop and a faulted-task continuation for last-resort visibility.
src/DurableTask.Core/Logging/EventIds.cs Introduces DispatcherLoopFailed = 30.
src/DurableTask.Core/Logging/LogEvents.cs Adds structured log event DispatcherLoopFailed with Dispatcher and Details fields.
src/DurableTask.Core/Logging/StructuredEventSource.cs Adds ETW event for DispatcherLoopFailed at Error level.
src/DurableTask.Core/Logging/LogHelper.cs Adds LogHelper.DispatcherLoopFailed(...) helper method.
Test/DurableTask.Core.Tests/WorkItemDispatcherTests.cs Adds new tests for dispatch loop recovery and new log event.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- Fix double-release risk: move scheduledWorkItem declaration outside the
  inner try block so the outer catch can check whether the semaphore was
  handed off to ProcessWorkItemAsync before releasing it
- Guard CTS.Token against ObjectDisposedException in backoff delay: catch
  ObjectDisposedException alongside OperationCanceledException so a
  disposed CTS during rapid shutdown doesn't fault the dispatch loop
- Move test file from Test/ (uppercase) to test/ (lowercase) to match the
  csproj location and ensure tests compile on case-sensitive file systems
- Rename test from ViaFaultySafeReleaseWorkItem to ViaFaultyDelayCallback
  to accurately describe the mechanism being tested
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WorkItemDispatcher.DispatchAsync can silently terminate due to unhandled exceptions in fire-and-forget Task.Run

2 participants