WorkItemDispatcher.DispatchAsync can silently terminate due to unhandled exceptions in fire-and-forget Task.Run

## Summary

The activity dispatcher's `DispatchAsync` loop in `WorkItemDispatcher<T>` can silently terminate due to unhandled exceptions in code paths outside the fetch try/catch block. Because the dispatch loop runs as a fire-and-forget `Task.Run`, any unhandled exception is silently swallowed — no error is logged, no `DispatcherStopped` event is emitted, and no recovery mechanism exists.

When this happens, the **activity dispatcher dies permanently** while the **orchestration dispatcher continues running normally**. Orchestrations keep scheduling activities, but no activities are ever picked up from the work-item queue. The only recovery is a full host restart.

## Observed Behavior

- All activity functions across all function types stop executing simultaneously
- Orchestrations continue running and scheduling new activities (control queue processing is healthy)
- Work-item queue messages accumulate (activities are queued via `SendingMessage` but never consumed via `ReceivedMessage`)
- Zero `FunctionStarting` / `FunctionCompleted` events for any activity
- No `DispatcherStopped` event logged — the loop terminates without reaching the `DispatcherStopped` log at the end of `DispatchAsync`
- No errors logged anywhere in DurableTask-Core, DurableTask-AzureStorage, or the Durable extension
- Issue persists indefinitely until host is restarted
- At restart, transient `ArgumentNullException: Value cannot be null. (Parameter 'executor')` errors appear in `TaskActivityShim` (race condition during function re-registration, self-resolves)

## Root Cause Analysis

### The dispatch loop (`DispatchAsync`)

In [`WorkItemDispatcher.cs`](https://github.com/Azure/durabletask/blob/main/src/DurableTask.Core/WorkItemDispatcher.cs), the `DispatchAsync` method runs a `while (this.isStarted)` loop. The fetch operation is wrapped in a try/catch, but several code paths **outside** this try/catch can throw unhandled exceptions:

| Code Path | Exception Risk |
|-----------|----------------|
| `concurrencyLock.WaitAsync(TimeSpan.FromSeconds(5))` | `ObjectDisposedException` if the semaphore is disposed |
| `SafeReleaseWorkItem(workItem)` | Any exception from the release handler |
| `Task.Delay(TimeSpan.FromSeconds(delaySecs))` | `ObjectDisposedException` if the CancellationTokenSource is disposed |
| `concurrencyLock.Release()` | `SemaphoreFullException` on double-release |

### The fire-and-forget pattern

The dispatch loop is started via:
```csharp
Task.Run(() => this.DispatchAsync(context));
```

This `Task` is never awaited or observed. If `DispatchAsync` throws an unhandled exception, the `Task` transitions to `Faulted` state and the exception is silently swallowed by the TaskScheduler's unobserved exception handler (which by default does nothing in modern .NET).

### ProcessWorkItemAsync has better protection but a gap

The `ProcessWorkItemAsync` method catches all exceptions via `catch (Exception exception) when (!Utils.IsFatal(exception))`, which handles everything except `OutOfMemoryException` and `StackOverflowException`. However, `ProcessWorkItemAsync` is also fire-and-forget (`Task.Run`), so fatal exceptions would cause the same silent termination (but of the processing task, not the dispatch loop).

## Diagnostic Pattern

This is how to identify the issue in production telemetry (DurableFunctionsEvents):

1. **Activity Scheduled vs Started:** `FunctionScheduled` (activity) continues while `FunctionStarting`/`FunctionCompleted` (activity) drops to zero
2. **Work-item queue vs control queues:** In `DurableTask-AzureStorage` events, `ReceivedMessage` with empty `PartitionId` (work-item queue = activities) drops to zero, while `ReceivedMessage` with `control-XX` `PartitionId` (control queues = orchestrations) continues normally
3. **No DispatcherStopped event:** The `DispatcherStopped` log at the end of `DispatchAsync` is never emitted during the gap
4. **No errors logged:** Zero Warning/Error level events during the entire gap period
5. **Recovery only on restart:** `DispatcherStarting`/`TaskHubWorkerStarted` events appear only when the host is restarted

## Suggested Fix

Wrap the entire `DispatchAsync` body in a try/catch that:
1. Logs the exception at Error level
2. Emits a `DispatcherStopped` event (or a new `DispatcherFailed` event)
3. Either retries the loop iteration (with backoff) or triggers a graceful shutdown

For example:
```csharp
async Task DispatchAsync(WorkItemDispatcherContext context)
{
    string dispatcherId = context.DispatcherId;
    bool logThrottle = true;
    
    while (this.isStarted)
    {
        try
        {
            // ... existing dispatch loop body ...
        }
        catch (Exception exception) when (!Utils.IsFatal(exception))
        {
            // Log the exception so it's visible in telemetry
            this.LogHelper.DispatcherFailed(context, exception);
            TraceHelper.TraceException(
                TraceEventType.Error,
                "WorkItemDispatcherDispatch-UnhandledException",
                exception,
                this.GetFormattedLog(dispatcherId, 
                    $"Unhandled exception in dispatch loop. Will retry after backoff."));
            
            // Backoff before retrying the loop
            await Task.Delay(TimeSpan.FromSeconds(10));
        }
    }
    
    this.LogHelper.DispatcherStopped(context);
}
```

Similarly, consider adding a continuation to the `Task.Run` call to at least log when a dispatch loop terminates unexpectedly:
```csharp
Task.Run(() => this.DispatchAsync(context))
    .ContinueWith(t => 
    {
        if (t.IsFaulted)
        {
            TraceHelper.TraceException(
                TraceEventType.Critical,
                "WorkItemDispatcherDispatch-FatalTermination", 
                t.Exception,
                $"Dispatch loop for '{this.name}' terminated fatally!");
        }
    }, TaskContinuationOptions.OnlyOnFaulted);
```

## Environment

- DurableTask.Core version: 1.8.4 (but the issue exists in the current `main` branch as well — the code pattern hasn't changed)
- Storage backend: Azure Storage
- Hosting: Azure Functions (Dedicated P1v2), Windows, in-process .NET
- Durable Functions Extension: 2.4.1

## Related Issues

- #629 — Handling of OutOfMemoryException in work items (related: `Utils.IsFatal` only covers OOM and StackOverflow)
- #886 — Critical performance degradation due to TypeMissingException (different symptom in `ProcessWorkItemAsync`, but same `WorkItemDispatcher` area)
- #803 — ArgumentNullException in TaskActivityShim (the symptom seen at restart/recovery, not the root cause)


Code Path	Exception Risk
`concurrencyLock.WaitAsync(TimeSpan.FromSeconds(5))`	`ObjectDisposedException` if the semaphore is disposed
`SafeReleaseWorkItem(workItem)`	Any exception from the release handler
`Task.Delay(TimeSpan.FromSeconds(delaySecs))`	`ObjectDisposedException` if the CancellationTokenSource is disposed
`concurrencyLock.Release()`	`SemaphoreFullException` on double-release

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WorkItemDispatcher.DispatchAsync can silently terminate due to unhandled exceptions in fire-and-forget Task.Run #1320

Summary

Observed Behavior

Root Cause Analysis

The dispatch loop (`DispatchAsync`)

The fire-and-forget pattern

ProcessWorkItemAsync has better protection but a gap

Diagnostic Pattern

Suggested Fix

Environment

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WorkItemDispatcher.DispatchAsync can silently terminate due to unhandled exceptions in fire-and-forget Task.Run #1320

Description

Summary

Observed Behavior

Root Cause Analysis

The dispatch loop (DispatchAsync)

The fire-and-forget pattern

ProcessWorkItemAsync has better protection but a gap

Diagnostic Pattern

Suggested Fix

Environment

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The dispatch loop (`DispatchAsync`)