Skip to content

fix(gastown): No circuit breaker on dispatch failures — dead container causes 70h runaway loop (+ spend) #1653

@jrf0110

Description

@jrf0110

Bug

A customer's town spent $100+ in credits over 70 hours doing nothing productive. The container died and the reconciler entered an infinite dispatch-fail-triage-reset loop with no circuit breaker, no backoff, and no spending cap.

Town ID: be20bf4b-e74e-45ae-bf4f-56ab4cf3056d

What Happened

  1. Mar 24 16:41 UTC — All 4 agent heartbeats stopped simultaneously. The container died (likely eviction or crash).
  2. Mar 24 16:43 UTC — Reconciler detected working agents with no hooks (invariant [7]). Started dispatching agents.
  3. Every ~5 minutes for the next 70+ hours — The reconciler loop:
    • dispatch_agent (3-4 per tick) → container start fails → agent.dispatch_failed
    • Reconciler detects failed agent → stop_agentcreate_triage_request
    • Triage resets beads → transition_bead + clear_bead_assignee → bead eligible again
    • Next tick → same dispatch → same failure → infinite loop
  4. Mar 25 03:37 UTC — Spiral widened from 1 bead to 5 concurrent beads
  5. Mar 25 16:41 UTC — Refinery beads started cycling too (22+ MR beads, 10-47 failures each)
  6. Mar 27 15:06 UTC (now) — Still running. 61 seconds per reconciler tick. 3,612 total dispatch failures.

Diagnostic Evidence

Metric Value
Total dispatch failures (72h) 3,612
Polecat failures 3,121 (86%)
Refinery failures 491 (14%)
Top bead failure count 635 (bead 43ad0152, 59 hours straight)
Reconciler ticks (72h) 24,641
Invariant violations 1,112
Reconciler wall clock (latest) 61,182 ms per tick (healthy: <100ms)
Error message in dispatch_failed events empty string — not logged
Duration of spiral 70+ hours and counting

Root Causes

1. No dispatch attempt cap per bead

The reconciler's reconcileBeads Rule 1 keeps emitting dispatch_agent for open/unassigned beads regardless of how many times dispatch has failed for that bead. A single bead was dispatched 635 times over 59 hours.

2. No backoff on dispatch failures

When applyAction('dispatch_agent') fails (container won't start), the agent is set back to idle and the bead stays in_progress. On the next tick (~5s), the reconciler tries again. There is no exponential backoff, no cooldown, no delay.

3. Triage creates a feedback loop

Each dispatch failure triggers create_triage_request (2 per tick). Triage resets beads and clears assignees, making them eligible for re-dispatch. This creates a closed loop: dispatch → fail → triage → reset → dispatch.

4. No spending cap enforcement

The reconciler has no awareness of how much money has been spent. It will happily dispatch agents indefinitely, even as the cost climbs past $100. (#1402 tracks the billing integration but it's not implemented yet.)

5. Dispatch failure errors not logged

All 3,612 agent.dispatch_failed AE events have an empty error field. The system doesn't record WHY the dispatch failed, making it impossible to diagnose from telemetry alone.

Fix

Fix 1 (Critical): Per-bead dispatch attempt cap with exponential backoff

After N consecutive dispatch failures for a bead (e.g., 5), stop retrying and fail the bead with a clear error message:

// In reconcileBeads Rule 1 / Rule 2:
if (agent.dispatch_attempts >= MAX_DISPATCH_ATTEMPTS) {
  actions.push({
    type: 'transition_bead',
    beadId: bead.bead_id,
    to: 'failed',
    reason: `Max dispatch attempts (${MAX_DISPATCH_ATTEMPTS}) exceeded`,
  });
  actions.push({ type: 'unhook_agent', agent_id: agent.id });
  return actions;
}

Between retries, apply exponential backoff: 30s → 1m → 2m → 5m → 10m. Track via last_dispatch_at + dispatch_attempts on agent metadata.

Fix 2 (Critical): Town-level dispatch circuit breaker

If total dispatch failures across ALL beads in a town exceed a threshold within a time window (e.g., 20 failures in 30 minutes), pause all dispatches and create an escalation:

if (recentDispatchFailures(sql, 30 * 60_000) > 20) {
  // Circuit breaker open — stop dispatching
  actions.push({
    type: 'create_escalation',
    message: 'Dispatch circuit breaker: 20+ failures in 30 min. All dispatches paused. Check container health.',
    severity: 'critical',
  });
  return []; // No dispatch actions
}

Fix 3 (Critical): Log the actual dispatch error

In applyAction('dispatch_agent'), when the container start fails, log the error to both the AE event and the console:

return async () => {
  try {
    const started = await ctx.dispatchAgent(agentId, beadId, rigId);
    if (!started) {
      logDispatchFailure(agentId, beadId, 'container start returned false');
    }
  } catch (err) {
    logDispatchFailure(agentId, beadId, err.message);
  }
};

Fix 4 (High): Triage should not reset beads that have exceeded dispatch attempts

The triage agent / create_triage_request action should NOT clear the assignee or reset the bead if dispatch_attempts >= MAX_DISPATCH_ATTEMPTS. The bead should stay failed, not re-enter the dispatch pool.

Fix 5 (Medium): Spending cap (tracked in #1402)

The reconciler should check the town's spending against a configurable cap before emitting dispatch_agent actions. When the cap is reached, stop all non-mayor dispatches and notify the user.

Immediate Actions Needed

  1. Pause this town's alarm or manually fail all open beads to stop the bleeding
  2. Investigate why the container won't start for this town — the TownContainerDO may be in a bad state
  3. Credit the customer for the wasted $100+ in spend

Files

  • src/dos/town/reconciler.tsreconcileBeads Rule 1/2 (no dispatch cap), reconcileReviewQueue Rule 5/6 (no dispatch cap)
  • src/dos/town/actions.tsdispatch_agent handler (no error logging, no backoff)
  • src/dos/town/agents.tsdispatch_attempts field exists but is not checked against a max

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Blocks soft launchbugSomething isn't workinggt:billingUsage tracking, cost attribution, limits, meteringgt:containerContainer management, agent processes, SDK, heartbeatgt:coreReconciler, state machine, bead lifecycle, convoy flowkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions