Fix WorkItemDispatcher.DispatchAsync silent termination due to unhandled exceptions#1322
Fix WorkItemDispatcher.DispatchAsync silent termination due to unhandled exceptions#1322berndverst wants to merge 4 commits intoAzure:mainfrom
Conversation
…led exceptions Wrap the DispatchAsync loop body in an outer try/catch to prevent the dispatch loop from silently dying when exceptions occur outside the inner fetch try/catch block. Add ContinueWith to Task.Run in StartAsync as a last-resort safety net for fatal exceptions. Add new DispatcherLoopFailed structured log event (EventId 30) at Error level with full exception details for telemetry visibility. Add comprehensive unit tests for the dispatch loop resilience. Fixes Azure#1320
There was a problem hiding this comment.
Pull request overview
This PR hardens WorkItemDispatcher<T>.DispatchAsync against unhandled exceptions that can otherwise silently terminate the fire-and-forget dispatch loop, and adds a new structured/ETW log event for this failure mode.
Changes:
- Add an outer
try/catcharound the dispatch loop body to log and retry after unexpected (non-fatal) exceptions. - Add a
ContinueWith(..., OnlyOnFaulted)safety net to log if the dispatch task faults. - Introduce
DispatcherLoopFailed(EventId 30) across structured logging + ETW, and add new unit tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/DurableTask.Core/WorkItemDispatcher.cs | Wraps dispatch-loop body in an outer exception handler + adds faulted-task continuation logging. |
| src/DurableTask.Core/Logging/EventIds.cs | Adds new EventId constant DispatcherLoopFailed = 30. |
| src/DurableTask.Core/Logging/LogEvents.cs | Adds structured log event type DispatcherLoopFailed. |
| src/DurableTask.Core/Logging/StructuredEventSource.cs | Adds ETW event for DispatcherLoopFailed. |
| src/DurableTask.Core/Logging/LogHelper.cs | Adds helper method to emit DispatcherLoopFailed. |
| Test/DurableTask.Core.Tests/WorkItemDispatcherTests.cs | Adds tests intended to validate dispatcher resilience and the new log event. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Replace hard-coded 30s timeouts with a shared 10s TestTimeout constant - Rewrite broken SurvivesSafeReleaseWorkItemException test that always hit its 15s timeout (safeReleaseThrew was never set to true) replaced with SurvivesMultipleExceptionTypes that tests various exception types and completes in <10ms - Replace fixed Task.Delay(500) + Task.Delay(2000) in StopsGracefully test with event-driven signaling: a TaskCompletionSource confirms the loop is running before stopping, and polling with 50ms intervals replaces the 2s fixed sleep for DispatcherStopped detection - Total test suite time reduced from ~20s to ~2-3s
- Fix semaphore leak in outer catch: track whether concurrencyLock was acquired and release it in the catch to avoid permanently reducing available concurrency - Pass shutdownCancellationTokenSource.Token to Task.Delay in the outer catch backoff so shutdown requests don't wait the full 10s interval - Rewrite test 1 to exercise the actual outer catch path by throwing from GetDelayInSecondsAfterOnFetchException (which escapes the inner catch) and assert that DispatcherLoopFailed is logged - Rename test 3 from LogsErrorAndRetries_WhenOuterExceptionOccurs to LogsFetchWorkItemFailure_WhenFetchThrows to accurately describe what it tests (inner catch, not outer catch) - Fix InMemoryLogger.BeginScope to return a no-op IDisposable instead of null to avoid NREs when scopes are disposed
There was a problem hiding this comment.
Pull request overview
This PR hardens WorkItemDispatcher<T>.DispatchAsync so the dispatch loop doesn’t silently die when exceptions escape the existing fetch-only try/catch, adding a new structured/ETW log event (DispatcherLoopFailed) and introducing unit tests intended to validate the new resilience behavior.
Changes:
- Wrap the dispatch loop body in an outer non-fatal exception handler that logs
DispatcherLoopFailedand backs off before retrying. - Add a fault-only continuation to the fire-and-forget
Task.Runto log critical failures if the dispatch loop faults. - Add new logging plumbing (EventId + structured log event + ETW event) and a new unit test file for dispatcher resilience.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/DurableTask.Core/WorkItemDispatcher.cs | Adds outer catch/backoff around dispatch loop and a faulted-task continuation for last-resort visibility. |
| src/DurableTask.Core/Logging/EventIds.cs | Introduces DispatcherLoopFailed = 30. |
| src/DurableTask.Core/Logging/LogEvents.cs | Adds structured log event DispatcherLoopFailed with Dispatcher and Details fields. |
| src/DurableTask.Core/Logging/StructuredEventSource.cs | Adds ETW event for DispatcherLoopFailed at Error level. |
| src/DurableTask.Core/Logging/LogHelper.cs | Adds LogHelper.DispatcherLoopFailed(...) helper method. |
| Test/DurableTask.Core.Tests/WorkItemDispatcherTests.cs | Adds new tests for dispatch loop recovery and new log event. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Fix double-release risk: move scheduledWorkItem declaration outside the inner try block so the outer catch can check whether the semaphore was handed off to ProcessWorkItemAsync before releasing it - Guard CTS.Token against ObjectDisposedException in backoff delay: catch ObjectDisposedException alongside OperationCanceledException so a disposed CTS during rapid shutdown doesn't fault the dispatch loop - Move test file from Test/ (uppercase) to test/ (lowercase) to match the csproj location and ensure tests compile on case-sensitive file systems - Rename test from ViaFaultySafeReleaseWorkItem to ViaFaultyDelayCallback to accurately describe the mechanism being tested
Summary
Fixes #1320
The activity dispatcher's
DispatchAsyncloop inWorkItemDispatchercan silently terminate due to unhandled exceptions in code paths outside the fetch try/catch block. Because the dispatch loop runs as a fire-and-forgetTask.Run, any unhandled exception is silently swallowed — no error is logged, noDispatcherStoppedevent is emitted, and no recovery mechanism exists.When this happens, the activity dispatcher dies permanently while the orchestration dispatcher continues running normally. Orchestrations keep scheduling activities, but no activities are ever picked up from the work-item queue. The only recovery is a full host restart.
Problem
In
WorkItemDispatcher.DispatchAsync, the fetch operation is wrapped in a try/catch, but several code paths outside this try/catch can throw unhandled exceptions:concurrencyLock.WaitAsync(TimeSpan.FromSeconds(5))ObjectDisposedExceptionif the semaphore is disposedSafeReleaseWorkItem(workItem)Task.Delay(TimeSpan.FromSeconds(delaySecs))ObjectDisposedExceptionif the CancellationTokenSource is disposedconcurrencyLock.Release()SemaphoreFullExceptionon double-releaseWhen any of these throw, the
DispatchAsyncTask transitions toFaultedstate and the exception is silently swallowed by .NET's unobserved exception handler. The dispatch loop terminates with zero telemetry — no errors logged, noDispatcherStoppedevent.Changes
WorkItemDispatcher.csDispatchAsync: Wrapped the entirewhileloop body in acatch (Exception exception) when (!Utils.IsFatal(exception))block that:LogHelper.DispatcherLoopFailed()andTraceHelper.TraceException()at Error levelProcessWorkItemAsync(prevents permanent concurrency reduction)Task.Delaythat respectsshutdownCancellationTokenSource.Token(so shutdown isn't delayed by the backoff)ObjectDisposedExceptionfrom a disposed CTS during rapid shutdownsemaphoreAcquiredandscheduledWorkItemtracking at the outer loop scope. The outer catch only releases the semaphore when it was acquired (semaphoreAcquired) but not yet transferred toProcessWorkItemAsync(!scheduledWorkItem), preventing both leaks and double-releases.ContinueWithonTask.RuninStartAsync: Added.ContinueWith(..., OnlyOnFaulted)as a last-resort safety net that logs at Critical level if the dispatch loop terminates due to a fatal exception (i.e.,OutOfMemoryExceptionorStackOverflowException) that bypasses thewhen (!Utils.IsFatal(exception))filterNew structured logging infrastructure
Added
DispatcherLoopFailed(EventId 30) across all logging layers:Logging/EventIds.csDispatcherLoopFailed = 30constantLogging/LogEvents.csDispatcherLoopFailedstructured log event class (LogLevel.Error) withDispatcherandDetailsfieldsLogging/StructuredEventSource.csEventLevel.ErrorLogging/LogHelper.csDispatcherLoopFailed(context, exception)methodNew test file:
test/DurableTask.Core.Tests/WorkItemDispatcherTests.cs7 unit tests covering the dispatch loop resilience and the new log event:
DispatchLoop_SurvivesOuterException_ViaFaultyDelayCallbackDispatcherLoopFailed, and the loop recoversDispatchLoop_SurvivesMultipleExceptionTypes_AndContinuesProcessingInvalidOperationException,TimeoutException, andTaskCanceledExceptionin sequenceDispatchLoop_LogsFetchWorkItemFailure_WhenFetchThrowsFetchWorkItemFailurewhen fetch throwsDispatchLoop_ProcessesWorkItemsSuccessfullyDispatchLoop_StopsGracefully_WhenNoExceptionsDispatcherStoppedis logged andDispatcherLoopFailedis NOT logged during graceful shutdownDispatcherLoopFailed_LogEvent_HasCorrectPropertiesDispatcherLoopFailed_LogEvent_ExposesStructuredFieldsDispatcher,Details) are accessible viaIReadOnlyDictionaryAll tests pass on both
net6.0andnet48. Tests use event-driven signaling (no fixedTask.Delaywaits) with a shared 10-second timeout for CI reliability.Diagnostic Improvement
After this fix, the previously-silent failure scenario will now produce:
DispatcherLoopFailedwith full exception details and dispatcher identityEventLevel.Errorfor ETW consumersTraceEventType.Errorlevel for classic trace listenersIf a fatal exception somehow bypasses the catch filter, the
ContinueWithcontinuation logs at Critical level so there is always visibility into dispatch loop termination.Behavioral Notes
Utils.IsFatalfilter preserved:OutOfMemoryExceptionandStackOverflowExceptionare not caught, matching the pattern used inProcessWorkItemAsyncStopAsyncdoesn't wait the full 10 seconds