fix(gastown): No circuit breaker on dispatch failures — dead container causes 70h runaway loop (+ spend)

## Bug

A customer's town spent $100+ in credits over 70 hours doing nothing productive. The container died and the reconciler entered an infinite dispatch-fail-triage-reset loop with no circuit breaker, no backoff, and no spending cap.

**Town ID:** `be20bf4b-e74e-45ae-bf4f-56ab4cf3056d`

## What Happened

1. **Mar 24 16:41 UTC** — All 4 agent heartbeats stopped simultaneously. The container died (likely eviction or crash).
2. **Mar 24 16:43 UTC** — Reconciler detected working agents with no hooks (invariant [7]). Started dispatching agents.
3. **Every ~5 minutes for the next 70+ hours** — The reconciler loop:
   - `dispatch_agent` (3-4 per tick) → container start fails → `agent.dispatch_failed`
   - Reconciler detects failed agent → `stop_agent` → `create_triage_request`
   - Triage resets beads → `transition_bead` + `clear_bead_assignee` → bead eligible again
   - Next tick → same dispatch → same failure → infinite loop
4. **Mar 25 03:37 UTC** — Spiral widened from 1 bead to 5 concurrent beads
5. **Mar 25 16:41 UTC** — Refinery beads started cycling too (22+ MR beads, 10-47 failures each)
6. **Mar 27 15:06 UTC (now)** — Still running. 61 seconds per reconciler tick. 3,612 total dispatch failures.

## Diagnostic Evidence

| Metric | Value |
|--------|-------|
| Total dispatch failures (72h) | 3,612 |
| Polecat failures | 3,121 (86%) |
| Refinery failures | 491 (14%) |
| Top bead failure count | 635 (bead `43ad0152`, 59 hours straight) |
| Reconciler ticks (72h) | 24,641 |
| Invariant violations | 1,112 |
| Reconciler wall clock (latest) | 61,182 ms per tick (healthy: <100ms) |
| Error message in dispatch_failed events | **empty string** — not logged |
| Duration of spiral | 70+ hours and counting |

## Root Causes

### 1. No dispatch attempt cap per bead

The reconciler's `reconcileBeads` Rule 1 keeps emitting `dispatch_agent` for open/unassigned beads regardless of how many times dispatch has failed for that bead. A single bead was dispatched 635 times over 59 hours.

### 2. No backoff on dispatch failures

When `applyAction('dispatch_agent')` fails (container won't start), the agent is set back to idle and the bead stays in_progress. On the next tick (~5s), the reconciler tries again. There is no exponential backoff, no cooldown, no delay.

### 3. Triage creates a feedback loop

Each dispatch failure triggers `create_triage_request` (2 per tick). Triage resets beads and clears assignees, making them eligible for re-dispatch. This creates a closed loop: dispatch → fail → triage → reset → dispatch.

### 4. No spending cap enforcement

The reconciler has no awareness of how much money has been spent. It will happily dispatch agents indefinitely, even as the cost climbs past $100. (#1402 tracks the billing integration but it's not implemented yet.)

### 5. Dispatch failure errors not logged

All 3,612 `agent.dispatch_failed` AE events have an **empty error field**. The system doesn't record WHY the dispatch failed, making it impossible to diagnose from telemetry alone.

## Fix

### Fix 1 (Critical): Per-bead dispatch attempt cap with exponential backoff

After N consecutive dispatch failures for a bead (e.g., 5), stop retrying and fail the bead with a clear error message:

```ts
// In reconcileBeads Rule 1 / Rule 2:
if (agent.dispatch_attempts >= MAX_DISPATCH_ATTEMPTS) {
  actions.push({
    type: 'transition_bead',
    beadId: bead.bead_id,
    to: 'failed',
    reason: `Max dispatch attempts (${MAX_DISPATCH_ATTEMPTS}) exceeded`,
  });
  actions.push({ type: 'unhook_agent', agent_id: agent.id });
  return actions;
}
```

Between retries, apply exponential backoff: 30s → 1m → 2m → 5m → 10m. Track via `last_dispatch_at` + `dispatch_attempts` on agent metadata.

### Fix 2 (Critical): Town-level dispatch circuit breaker

If total dispatch failures across ALL beads in a town exceed a threshold within a time window (e.g., 20 failures in 30 minutes), pause all dispatches and create an escalation:

```ts
if (recentDispatchFailures(sql, 30 * 60_000) > 20) {
  // Circuit breaker open — stop dispatching
  actions.push({
    type: 'create_escalation',
    message: 'Dispatch circuit breaker: 20+ failures in 30 min. All dispatches paused. Check container health.',
    severity: 'critical',
  });
  return []; // No dispatch actions
}
```

### Fix 3 (Critical): Log the actual dispatch error

In `applyAction('dispatch_agent')`, when the container start fails, log the error to both the AE event and the console:

```ts
return async () => {
  try {
    const started = await ctx.dispatchAgent(agentId, beadId, rigId);
    if (!started) {
      logDispatchFailure(agentId, beadId, 'container start returned false');
    }
  } catch (err) {
    logDispatchFailure(agentId, beadId, err.message);
  }
};
```

### Fix 4 (High): Triage should not reset beads that have exceeded dispatch attempts

The triage agent / `create_triage_request` action should NOT clear the assignee or reset the bead if dispatch_attempts >= MAX_DISPATCH_ATTEMPTS. The bead should stay failed, not re-enter the dispatch pool.

### Fix 5 (Medium): Spending cap (tracked in #1402)

The reconciler should check the town's spending against a configurable cap before emitting `dispatch_agent` actions. When the cap is reached, stop all non-mayor dispatches and notify the user.

## Immediate Actions Needed

1. **Pause this town's alarm** or manually fail all open beads to stop the bleeding
2. **Investigate why the container won't start** for this town — the TownContainerDO may be in a bad state
3. **Credit the customer** for the wasted $100+ in spend

## Files

- `src/dos/town/reconciler.ts` — `reconcileBeads` Rule 1/2 (no dispatch cap), `reconcileReviewQueue` Rule 5/6 (no dispatch cap)
- `src/dos/town/actions.ts` — `dispatch_agent` handler (no error logging, no backoff)
- `src/dos/town/agents.ts` — `dispatch_attempts` field exists but is not checked against a max

## References

- #1402 — Billing integration (spending cap)
- #1450 — Container idle fix (related container lifecycle issues)
- #1236 — Session persistence across container restarts (graceful eviction)
- #1632 — PR polling stuck when token missing (similar "silent failure → infinite loop" pattern)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gastown): No circuit breaker on dispatch failures — dead container causes 70h runaway loop (+ spend) #1653

Bug

What Happened

Diagnostic Evidence

Root Causes

1. No dispatch attempt cap per bead

2. No backoff on dispatch failures

3. Triage creates a feedback loop

4. No spending cap enforcement

5. Dispatch failure errors not logged

Fix

Fix 1 (Critical): Per-bead dispatch attempt cap with exponential backoff

Fix 2 (Critical): Town-level dispatch circuit breaker

Fix 3 (Critical): Log the actual dispatch error

Fix 4 (High): Triage should not reset beads that have exceeded dispatch attempts

Fix 5 (Medium): Spending cap (tracked in #1402)

Immediate Actions Needed

Files

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
Total dispatch failures (72h)	3,612
Polecat failures	3,121 (86%)
Refinery failures	491 (14%)
Top bead failure count	635 (bead `43ad0152`, 59 hours straight)
Reconciler ticks (72h)	24,641
Invariant violations	1,112
Reconciler wall clock (latest)	61,182 ms per tick (healthy: <100ms)
Error message in dispatch_failed events	empty string — not logged
Duration of spiral	70+ hours and counting

fix(gastown): No circuit breaker on dispatch failures — dead container causes 70h runaway loop (+ spend) #1653

Description

Bug

What Happened

Diagnostic Evidence

Root Causes

1. No dispatch attempt cap per bead

2. No backoff on dispatch failures

3. Triage creates a feedback loop

4. No spending cap enforcement

5. Dispatch failure errors not logged

Fix

Fix 1 (Critical): Per-bead dispatch attempt cap with exponential backoff

Fix 2 (Critical): Town-level dispatch circuit breaker

Fix 3 (Critical): Log the actual dispatch error

Fix 4 (High): Triage should not reset beads that have exceeded dispatch attempts

Fix 5 (Medium): Spending cap (tracked in #1402)

Immediate Actions Needed

Files

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions