-
Notifications
You must be signed in to change notification settings - Fork 325
Description
Summary
The activity dispatcher's DispatchAsync loop in WorkItemDispatcher<T> can silently terminate due to unhandled exceptions in code paths outside the fetch try/catch block. Because the dispatch loop runs as a fire-and-forget Task.Run, any unhandled exception is silently swallowed — no error is logged, no DispatcherStopped event is emitted, and no recovery mechanism exists.
When this happens, the activity dispatcher dies permanently while the orchestration dispatcher continues running normally. Orchestrations keep scheduling activities, but no activities are ever picked up from the work-item queue. The only recovery is a full host restart.
Observed Behavior
- All activity functions across all function types stop executing simultaneously
- Orchestrations continue running and scheduling new activities (control queue processing is healthy)
- Work-item queue messages accumulate (activities are queued via
SendingMessagebut never consumed viaReceivedMessage) - Zero
FunctionStarting/FunctionCompletedevents for any activity - No
DispatcherStoppedevent logged — the loop terminates without reaching theDispatcherStoppedlog at the end ofDispatchAsync - No errors logged anywhere in DurableTask-Core, DurableTask-AzureStorage, or the Durable extension
- Issue persists indefinitely until host is restarted
- At restart, transient
ArgumentNullException: Value cannot be null. (Parameter 'executor')errors appear inTaskActivityShim(race condition during function re-registration, self-resolves)
Root Cause Analysis
The dispatch loop (DispatchAsync)
In WorkItemDispatcher.cs, the DispatchAsync method runs a while (this.isStarted) loop. The fetch operation is wrapped in a try/catch, but several code paths outside this try/catch can throw unhandled exceptions:
| Code Path | Exception Risk |
|---|---|
concurrencyLock.WaitAsync(TimeSpan.FromSeconds(5)) |
ObjectDisposedException if the semaphore is disposed |
SafeReleaseWorkItem(workItem) |
Any exception from the release handler |
Task.Delay(TimeSpan.FromSeconds(delaySecs)) |
ObjectDisposedException if the CancellationTokenSource is disposed |
concurrencyLock.Release() |
SemaphoreFullException on double-release |
The fire-and-forget pattern
The dispatch loop is started via:
Task.Run(() => this.DispatchAsync(context));This Task is never awaited or observed. If DispatchAsync throws an unhandled exception, the Task transitions to Faulted state and the exception is silently swallowed by the TaskScheduler's unobserved exception handler (which by default does nothing in modern .NET).
ProcessWorkItemAsync has better protection but a gap
The ProcessWorkItemAsync method catches all exceptions via catch (Exception exception) when (!Utils.IsFatal(exception)), which handles everything except OutOfMemoryException and StackOverflowException. However, ProcessWorkItemAsync is also fire-and-forget (Task.Run), so fatal exceptions would cause the same silent termination (but of the processing task, not the dispatch loop).
Diagnostic Pattern
This is how to identify the issue in production telemetry (DurableFunctionsEvents):
- Activity Scheduled vs Started:
FunctionScheduled(activity) continues whileFunctionStarting/FunctionCompleted(activity) drops to zero - Work-item queue vs control queues: In
DurableTask-AzureStorageevents,ReceivedMessagewith emptyPartitionId(work-item queue = activities) drops to zero, whileReceivedMessagewithcontrol-XXPartitionId(control queues = orchestrations) continues normally - No DispatcherStopped event: The
DispatcherStoppedlog at the end ofDispatchAsyncis never emitted during the gap - No errors logged: Zero Warning/Error level events during the entire gap period
- Recovery only on restart:
DispatcherStarting/TaskHubWorkerStartedevents appear only when the host is restarted
Suggested Fix
Wrap the entire DispatchAsync body in a try/catch that:
- Logs the exception at Error level
- Emits a
DispatcherStoppedevent (or a newDispatcherFailedevent) - Either retries the loop iteration (with backoff) or triggers a graceful shutdown
For example:
async Task DispatchAsync(WorkItemDispatcherContext context)
{
string dispatcherId = context.DispatcherId;
bool logThrottle = true;
while (this.isStarted)
{
try
{
// ... existing dispatch loop body ...
}
catch (Exception exception) when (!Utils.IsFatal(exception))
{
// Log the exception so it's visible in telemetry
this.LogHelper.DispatcherFailed(context, exception);
TraceHelper.TraceException(
TraceEventType.Error,
"WorkItemDispatcherDispatch-UnhandledException",
exception,
this.GetFormattedLog(dispatcherId,
$"Unhandled exception in dispatch loop. Will retry after backoff."));
// Backoff before retrying the loop
await Task.Delay(TimeSpan.FromSeconds(10));
}
}
this.LogHelper.DispatcherStopped(context);
}Similarly, consider adding a continuation to the Task.Run call to at least log when a dispatch loop terminates unexpectedly:
Task.Run(() => this.DispatchAsync(context))
.ContinueWith(t =>
{
if (t.IsFaulted)
{
TraceHelper.TraceException(
TraceEventType.Critical,
"WorkItemDispatcherDispatch-FatalTermination",
t.Exception,
$"Dispatch loop for '{this.name}' terminated fatally!");
}
}, TaskContinuationOptions.OnlyOnFaulted);Environment
- DurableTask.Core version: 1.8.4 (but the issue exists in the current
mainbranch as well — the code pattern hasn't changed) - Storage backend: Azure Storage
- Hosting: Azure Functions (Dedicated P1v2), Windows, in-process .NET
- Durable Functions Extension: 2.4.1
Related Issues
- Handling of OutOfMemoryException in work items #629 — Handling of OutOfMemoryException in work items (related:
Utils.IsFatalonly covers OOM and StackOverflow) - Critical performance degradation due to TypeMissingException #886 — Critical performance degradation due to TypeMissingException (different symptom in
ProcessWorkItemAsync, but sameWorkItemDispatcherarea) - Weird error showing in console - TaskActivityDispatcher...System.ArgumentNullException: Value cannot be null. (Parameter 'executor') #803 — ArgumentNullException in TaskActivityShim (the symptom seen at restart/recovery, not the root cause)