Skip to content

WorkItemDispatcher.DispatchAsync can silently terminate due to unhandled exceptions in fire-and-forget Task.Run #1320

@berndverst

Description

@berndverst

Summary

The activity dispatcher's DispatchAsync loop in WorkItemDispatcher<T> can silently terminate due to unhandled exceptions in code paths outside the fetch try/catch block. Because the dispatch loop runs as a fire-and-forget Task.Run, any unhandled exception is silently swallowed — no error is logged, no DispatcherStopped event is emitted, and no recovery mechanism exists.

When this happens, the activity dispatcher dies permanently while the orchestration dispatcher continues running normally. Orchestrations keep scheduling activities, but no activities are ever picked up from the work-item queue. The only recovery is a full host restart.

Observed Behavior

  • All activity functions across all function types stop executing simultaneously
  • Orchestrations continue running and scheduling new activities (control queue processing is healthy)
  • Work-item queue messages accumulate (activities are queued via SendingMessage but never consumed via ReceivedMessage)
  • Zero FunctionStarting / FunctionCompleted events for any activity
  • No DispatcherStopped event logged — the loop terminates without reaching the DispatcherStopped log at the end of DispatchAsync
  • No errors logged anywhere in DurableTask-Core, DurableTask-AzureStorage, or the Durable extension
  • Issue persists indefinitely until host is restarted
  • At restart, transient ArgumentNullException: Value cannot be null. (Parameter 'executor') errors appear in TaskActivityShim (race condition during function re-registration, self-resolves)

Root Cause Analysis

The dispatch loop (DispatchAsync)

In WorkItemDispatcher.cs, the DispatchAsync method runs a while (this.isStarted) loop. The fetch operation is wrapped in a try/catch, but several code paths outside this try/catch can throw unhandled exceptions:

Code Path Exception Risk
concurrencyLock.WaitAsync(TimeSpan.FromSeconds(5)) ObjectDisposedException if the semaphore is disposed
SafeReleaseWorkItem(workItem) Any exception from the release handler
Task.Delay(TimeSpan.FromSeconds(delaySecs)) ObjectDisposedException if the CancellationTokenSource is disposed
concurrencyLock.Release() SemaphoreFullException on double-release

The fire-and-forget pattern

The dispatch loop is started via:

Task.Run(() => this.DispatchAsync(context));

This Task is never awaited or observed. If DispatchAsync throws an unhandled exception, the Task transitions to Faulted state and the exception is silently swallowed by the TaskScheduler's unobserved exception handler (which by default does nothing in modern .NET).

ProcessWorkItemAsync has better protection but a gap

The ProcessWorkItemAsync method catches all exceptions via catch (Exception exception) when (!Utils.IsFatal(exception)), which handles everything except OutOfMemoryException and StackOverflowException. However, ProcessWorkItemAsync is also fire-and-forget (Task.Run), so fatal exceptions would cause the same silent termination (but of the processing task, not the dispatch loop).

Diagnostic Pattern

This is how to identify the issue in production telemetry (DurableFunctionsEvents):

  1. Activity Scheduled vs Started: FunctionScheduled (activity) continues while FunctionStarting/FunctionCompleted (activity) drops to zero
  2. Work-item queue vs control queues: In DurableTask-AzureStorage events, ReceivedMessage with empty PartitionId (work-item queue = activities) drops to zero, while ReceivedMessage with control-XX PartitionId (control queues = orchestrations) continues normally
  3. No DispatcherStopped event: The DispatcherStopped log at the end of DispatchAsync is never emitted during the gap
  4. No errors logged: Zero Warning/Error level events during the entire gap period
  5. Recovery only on restart: DispatcherStarting/TaskHubWorkerStarted events appear only when the host is restarted

Suggested Fix

Wrap the entire DispatchAsync body in a try/catch that:

  1. Logs the exception at Error level
  2. Emits a DispatcherStopped event (or a new DispatcherFailed event)
  3. Either retries the loop iteration (with backoff) or triggers a graceful shutdown

For example:

async Task DispatchAsync(WorkItemDispatcherContext context)
{
    string dispatcherId = context.DispatcherId;
    bool logThrottle = true;
    
    while (this.isStarted)
    {
        try
        {
            // ... existing dispatch loop body ...
        }
        catch (Exception exception) when (!Utils.IsFatal(exception))
        {
            // Log the exception so it's visible in telemetry
            this.LogHelper.DispatcherFailed(context, exception);
            TraceHelper.TraceException(
                TraceEventType.Error,
                "WorkItemDispatcherDispatch-UnhandledException",
                exception,
                this.GetFormattedLog(dispatcherId, 
                    $"Unhandled exception in dispatch loop. Will retry after backoff."));
            
            // Backoff before retrying the loop
            await Task.Delay(TimeSpan.FromSeconds(10));
        }
    }
    
    this.LogHelper.DispatcherStopped(context);
}

Similarly, consider adding a continuation to the Task.Run call to at least log when a dispatch loop terminates unexpectedly:

Task.Run(() => this.DispatchAsync(context))
    .ContinueWith(t => 
    {
        if (t.IsFaulted)
        {
            TraceHelper.TraceException(
                TraceEventType.Critical,
                "WorkItemDispatcherDispatch-FatalTermination", 
                t.Exception,
                $"Dispatch loop for '{this.name}' terminated fatally!");
        }
    }, TaskContinuationOptions.OnlyOnFaulted);

Environment

  • DurableTask.Core version: 1.8.4 (but the issue exists in the current main branch as well — the code pattern hasn't changed)
  • Storage backend: Azure Storage
  • Hosting: Azure Functions (Dedicated P1v2), Windows, in-process .NET
  • Durable Functions Extension: 2.4.1

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions