-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Bug
A customer's town spent $100+ in credits over 70 hours doing nothing productive. The container died and the reconciler entered an infinite dispatch-fail-triage-reset loop with no circuit breaker, no backoff, and no spending cap.
Town ID: be20bf4b-e74e-45ae-bf4f-56ab4cf3056d
What Happened
- Mar 24 16:41 UTC — All 4 agent heartbeats stopped simultaneously. The container died (likely eviction or crash).
- Mar 24 16:43 UTC — Reconciler detected working agents with no hooks (invariant [7]). Started dispatching agents.
- Every ~5 minutes for the next 70+ hours — The reconciler loop:
dispatch_agent(3-4 per tick) → container start fails →agent.dispatch_failed- Reconciler detects failed agent →
stop_agent→create_triage_request - Triage resets beads →
transition_bead+clear_bead_assignee→ bead eligible again - Next tick → same dispatch → same failure → infinite loop
- Mar 25 03:37 UTC — Spiral widened from 1 bead to 5 concurrent beads
- Mar 25 16:41 UTC — Refinery beads started cycling too (22+ MR beads, 10-47 failures each)
- Mar 27 15:06 UTC (now) — Still running. 61 seconds per reconciler tick. 3,612 total dispatch failures.
Diagnostic Evidence
| Metric | Value |
|---|---|
| Total dispatch failures (72h) | 3,612 |
| Polecat failures | 3,121 (86%) |
| Refinery failures | 491 (14%) |
| Top bead failure count | 635 (bead 43ad0152, 59 hours straight) |
| Reconciler ticks (72h) | 24,641 |
| Invariant violations | 1,112 |
| Reconciler wall clock (latest) | 61,182 ms per tick (healthy: <100ms) |
| Error message in dispatch_failed events | empty string — not logged |
| Duration of spiral | 70+ hours and counting |
Root Causes
1. No dispatch attempt cap per bead
The reconciler's reconcileBeads Rule 1 keeps emitting dispatch_agent for open/unassigned beads regardless of how many times dispatch has failed for that bead. A single bead was dispatched 635 times over 59 hours.
2. No backoff on dispatch failures
When applyAction('dispatch_agent') fails (container won't start), the agent is set back to idle and the bead stays in_progress. On the next tick (~5s), the reconciler tries again. There is no exponential backoff, no cooldown, no delay.
3. Triage creates a feedback loop
Each dispatch failure triggers create_triage_request (2 per tick). Triage resets beads and clears assignees, making them eligible for re-dispatch. This creates a closed loop: dispatch → fail → triage → reset → dispatch.
4. No spending cap enforcement
The reconciler has no awareness of how much money has been spent. It will happily dispatch agents indefinitely, even as the cost climbs past $100. (#1402 tracks the billing integration but it's not implemented yet.)
5. Dispatch failure errors not logged
All 3,612 agent.dispatch_failed AE events have an empty error field. The system doesn't record WHY the dispatch failed, making it impossible to diagnose from telemetry alone.
Fix
Fix 1 (Critical): Per-bead dispatch attempt cap with exponential backoff
After N consecutive dispatch failures for a bead (e.g., 5), stop retrying and fail the bead with a clear error message:
// In reconcileBeads Rule 1 / Rule 2:
if (agent.dispatch_attempts >= MAX_DISPATCH_ATTEMPTS) {
actions.push({
type: 'transition_bead',
beadId: bead.bead_id,
to: 'failed',
reason: `Max dispatch attempts (${MAX_DISPATCH_ATTEMPTS}) exceeded`,
});
actions.push({ type: 'unhook_agent', agent_id: agent.id });
return actions;
}Between retries, apply exponential backoff: 30s → 1m → 2m → 5m → 10m. Track via last_dispatch_at + dispatch_attempts on agent metadata.
Fix 2 (Critical): Town-level dispatch circuit breaker
If total dispatch failures across ALL beads in a town exceed a threshold within a time window (e.g., 20 failures in 30 minutes), pause all dispatches and create an escalation:
if (recentDispatchFailures(sql, 30 * 60_000) > 20) {
// Circuit breaker open — stop dispatching
actions.push({
type: 'create_escalation',
message: 'Dispatch circuit breaker: 20+ failures in 30 min. All dispatches paused. Check container health.',
severity: 'critical',
});
return []; // No dispatch actions
}Fix 3 (Critical): Log the actual dispatch error
In applyAction('dispatch_agent'), when the container start fails, log the error to both the AE event and the console:
return async () => {
try {
const started = await ctx.dispatchAgent(agentId, beadId, rigId);
if (!started) {
logDispatchFailure(agentId, beadId, 'container start returned false');
}
} catch (err) {
logDispatchFailure(agentId, beadId, err.message);
}
};Fix 4 (High): Triage should not reset beads that have exceeded dispatch attempts
The triage agent / create_triage_request action should NOT clear the assignee or reset the bead if dispatch_attempts >= MAX_DISPATCH_ATTEMPTS. The bead should stay failed, not re-enter the dispatch pool.
Fix 5 (Medium): Spending cap (tracked in #1402)
The reconciler should check the town's spending against a configurable cap before emitting dispatch_agent actions. When the cap is reached, stop all non-mayor dispatches and notify the user.
Immediate Actions Needed
- Pause this town's alarm or manually fail all open beads to stop the bleeding
- Investigate why the container won't start for this town — the TownContainerDO may be in a bad state
- Credit the customer for the wasted $100+ in spend
Files
src/dos/town/reconciler.ts—reconcileBeadsRule 1/2 (no dispatch cap),reconcileReviewQueueRule 5/6 (no dispatch cap)src/dos/town/actions.ts—dispatch_agenthandler (no error logging, no backoff)src/dos/town/agents.ts—dispatch_attemptsfield exists but is not checked against a max
References
- feat(gastown): Billing integration — usage metering, limits, and cost visibility #1402 — Billing integration (spending cap)
- fix(gastown): Town containers never go idle — mayor holds alarm at 5s, constant health checks reset sleep timer #1450 — Container idle fix (related container lifecycle issues)
- Persist agent conversation across container restarts via AgentDO event reconstruction #1236 — Session persistence across container restarts (graceful eviction)
- fix(gastown): PR-strategy MR beads stuck in_progress when github_token is missing — agents re-dispatched for completed work #1632 — PR polling stuck when token missing (similar "silent failure → infinite loop" pattern)