You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a `chat.agent` run dies mid-stream (the user cancels, the worker OOMs, an unhandled exception kills the process), the next continuation run now reconstructs the conversation context automatically. Follow-ups like "keep going" continue the partial response; fresh follow-ups like "scrap that, what's 7+8?" abandon it and answer the new question. No customer code required.
12
+
13
+
Under the hood: the boot now reads BOTH stream tails — `session.out` for any partial assistant the dead run was streaming, `session.in` for any user messages it never acknowledged — and splices `[firstInFlightUser, partialAssistant]` onto the chain when both are present. The model sees full prior context plus the latest user message.
14
+
15
+
For policies different from "preserve context" — drop the partial entirely, synthesize tool results for an interrupted tool call, emit a recovery banner to the UI — register the new `onRecoveryBoot` hook:
streamText({ model, messages, abortSignal: signal }),
32
+
});
33
+
```
34
+
35
+
The hook receives `settledMessages`, `inFlightUsers`, `partialAssistant`, `pendingToolCalls`, `previousRunId`, `cause`, and a lazy `writer`. Return any of `chain`, `recoveredTurns`, or `beforeBoot` to override the default. Agents using `hydrateMessages` skip the hook — customer-owned persistence is the source of truth.
36
+
37
+
Also retracts the OOM resilience caveat: model context on retry is no longer "incomplete" without `hydrateMessages`. The smart default reconstructs full context from `session.out` replay.
38
+
39
+
See [Recovery boot](/ai-chat/patterns/recovery-boot) for the full guide.
Copy file name to clipboardExpand all lines: docs/ai-chat/how-it-works.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -48,7 +48,7 @@ The engine restores the suspended run from its checkpoint. The same JS process p
48
48
49
49
### Continuation (after exit)
50
50
51
-
If the run has fully exited (because it hit `maxTurns`, the customer called `chat.endRun()` or `chat.requestUpgrade()`, or it was cancelled or crashed), the next user message can't resume it — there is nothing to resume. Instead, the server triggers a brand-new run with `continuation: true`. The new run does a cold boot but reads the prior conversation's S3 snapshot and replays any `.out` chunks after the snapshot cursor, so the new run starts with the full message history already accumulated. Then it enters **Streaming** with `turn === 0` of the new run but `messageCount > 0`.
51
+
If the run has fully exited (because it hit `maxTurns`, the customer called `chat.endRun()` or `chat.requestUpgrade()`, or it was cancelled or crashed), the next user message can't resume it — there is nothing to resume. Instead, the server triggers a brand-new run with `continuation: true`. The new run does a cold boot, reads the prior conversation's S3 snapshot, replays any `.out` chunks after the snapshot cursor, AND replays any `.in` records past the last `turn-complete` cursor (the user messages a dead run never acknowledged). If the predecessor died mid-stream and left a partial assistant response in `.out`, the smart default splices `[firstInFlightUser, partialAssistant]` onto the chain so any follow-up has full context — see [Recovery boot](/ai-chat/patterns/recovery-boot). The new run then enters **Streaming** with `turn === 0` of the new run but `messageCount > 0`.
Copy file name to clipboardExpand all lines: docs/ai-chat/lifecycle-hooks.mdx
+44-1Lines changed: 44 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,11 +14,12 @@ description: "Hook into every stage of a chat agent's run: preload, turn start,
14
14
15
15
**Suspend / resume:**`onChatSuspend` fires when the run transitions from idle to suspended (waiting on the next message); `onChatResume` fires on wake.
16
16
17
-
**Three scopes to keep straight:**
17
+
**Four scopes to keep straight:**
18
18
19
19
| Scope | Fires when | Use for |
20
20
| --- | --- | --- |
21
21
|**Process** ([`onBoot`](#onboot)) | Every fresh worker boots — initial, preloaded, and reactive continuation (post-cancel/crash/`endRun`/upgrade). | Initialize `chat.local`, open per-process resources, re-hydrate state from your DB on continuation. |
22
+
|**Recovery** ([`onRecoveryBoot`](#onrecoveryboot)) | Continuation boot where the dead run left state behind — a partial assistant on `session.out` or in-flight users on `session.in`. | Override the smart default — drop the partial, synthesize tool results, emit a recovery banner. |
22
23
|**Chat** ([`onChatStart`](#onchatstart)) | First message of a chat's lifetime. Does NOT fire on continuation runs or OOM retries. | One-time DB rows for the chat, resources tied to the chat's lifetime. |
`onBoot` and `onChatStart` are complementary — keep DB-row creation in `onChatStart` (it only needs to happen once per chat) and put process-level setup (`chat.local`, connections, caches) in `onBoot` (it needs to happen on every fresh worker).
84
85
</Tip>
85
86
87
+
## onRecoveryBoot
88
+
89
+
Fires once on a continuation boot when the dead predecessor left state behind — a partial assistant on `session.out`, in-flight user messages on `session.in`, or both. The runtime reconstructs context automatically via a smart default; this hook is the override path for policies that need something different.
90
+
91
+
The hook does NOT fire when there's nothing to recover (clean continuation after `chat.endRun()`, fresh chat, OOM retry on top of a complete snapshot). It does NOT fire when [`hydrateMessages`](#hydratemessages) is registered (the customer owns persistence).
|`runId`|`string`| The Trigger.dev run ID for this run boot |
115
+
|`previousRunId`|`string`| Public id of the prior run that died |
116
+
|`cause`|`"cancelled" \| "crashed" \| "unknown"`| Best-effort cause. Currently always `"unknown"` — don't branch on it |
117
+
|`settledMessages`|`TUIMessage[]`| The chain persisted by the predecessor's last `onTurnComplete`|
118
+
|`inFlightUsers`|`TUIMessage[]`| User messages on `session.in` past the cursor — the message(s) the predecessor never acknowledged |
119
+
|`partialAssistant`|`TUIMessage \| undefined`| The trailing assistant message whose stream never received `finish`|
120
+
|`pendingToolCalls`|`Array<{ toolCallId, toolName, input, partIndex }>`| Tool calls in `input-available` state extracted from `partialAssistant`|
121
+
|`writer`|`ChatWriter`| Lazy session.out writer — write a recovery banner / signal here |
122
+
123
+
Returns `{ chain?, recoveredTurns?, beforeBoot? }` — every field optional. Omitted fields fall through to the smart default. See [Recovery boot](/ai-chat/patterns/recovery-boot) for the full guide, examples (drop partial, synthesize tool results, persist before boot), and interaction notes.
124
+
125
+
<Tip>
126
+
Don't put `chat.local` initialization in `onRecoveryBoot` — use [`onBoot`](#onboot). `onRecoveryBoot` is for recovery decisions, not per-process setup. `onBoot` fires first.
127
+
</Tip>
128
+
86
129
## onPreload
87
130
88
131
Fires when a **preloaded run** starts, before any messages arrive. Use it to eagerly create chat-scoped DB rows (the Chat row, the ChatSession row) while the user is still typing — so the very first message lands fast.
Copy file name to clipboardExpand all lines: docs/ai-chat/patterns/oom-resilience.mdx
+6-3Lines changed: 6 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -79,9 +79,11 @@ If your agent uses [`hydrateMessages`](/ai-chat/lifecycle-hooks#hydratemessages)
79
79
80
80
## Without `hydrateMessages`
81
81
82
-
The retry filter still prevents duplicate processing — turns 1..N-1 aren't re-run — but the OOM'd turn's accumulator is whatever the chat.agent's default flow can rebuild from `payload.messages` (typically just the first user message of the chat). The model context is **incomplete**: it doesn't see prior assistant responses. The conversation continues but a multi-turn OOM'd recovery may produce a less coherent reply.
82
+
Recovery boot reconstructs context automatically. The boot reads both the durable `session.out` snapshot (settled turns) and the `session.out` tail past the snapshot cursor (the partial assistant chunks the OOM'd turn streamed before dying). When the new attempt processes the OOM'd user message, the model sees the full prior conversation **plus** the partial assistant that was cut off — so a "keep going" follow-up continues naturally, and any other follow-up has the same context the original turn had.
83
83
84
-
If conversation continuity matters, use `hydrateMessages`.
84
+
`hydrateMessages` is still the right choice if you want a single source of truth in your own database (branching conversations, message-level access control, etc.). It's no longer required for OOM continuity.
85
+
86
+
For full control over recovery — drop the partial, synthesize tool results for an interrupted tool call, emit a recovery banner to the UI — register [`onRecoveryBoot`](/ai-chat/patterns/recovery-boot).
-[Recovery boot](/ai-chat/patterns/recovery-boot) — the underlying hook + smart default that gives OOM recovery its full-context behavior
112
115
-[Lifecycle hooks](/ai-chat/lifecycle-hooks) — `onChatResume` fires on every retry attempt with `phase: "preload"` or `"turn"`
113
-
-[Database persistence](/ai-chat/patterns/database-persistence) — the `hydrateMessages` pattern this builds on for full continuity
116
+
-[Database persistence](/ai-chat/patterns/database-persistence) — the `hydrateMessages` pattern for branching, ACL, and DB-as-source-of-truth scenarios
Copy file name to clipboardExpand all lines: docs/ai-chat/patterns/persistence-and-replay.mdx
+21-11Lines changed: 21 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -71,13 +71,23 @@ A new run boots when the user sends `u2`. Run 1 has long since exited. Run 2 has
71
71
GET the JSON blob. On 404 (no snapshot yet — first-ever turn) or read error or version mismatch, treat as empty and continue. Snapshot misses are non-fatal — replay alone may still be sufficient.
72
72
</Step>
73
73
<Steptitle="Replay session.out tail">
74
-
Subscribe to `session.out` with `wait=0` starting from `snapshot.lastOutEventId`. Drain whatever's there and close. In the steady state this returns empty (the snapshot already captured everything from turn 1). In a crash-recovery state — Run 1 emitted chunks but never wrote a snapshot — replay catches them.
74
+
Subscribe to `session.out` with `wait=0` starting from `snapshot.lastOutEventId`. Drain whatever's there and close. Returns:
75
+
- **Settled messages** — closed assistant turns past the snapshot cursor (the chunks of a turn that completed after the snapshot was written but before the run exited cleanly).
76
+
- **A partial assistant** — the trailing message if its stream never received a `finish` chunk. The dead run was mid-response when it died. `cleanupAbortedParts` has already stripped streaming-in-progress fragments.
77
+
78
+
In the steady state this returns empty. In recovery, it returns whatever the dead run was in the middle of.
75
79
</Step>
76
-
<Steptitle="Merge by id, replay wins">
77
-
Snapshot messages and replayed messages are merged by `id`. On collision, the replayed copy wins — `session.out` is the freshest representation of any assistant message.
80
+
<Steptitle="Replay session.in tail">
81
+
GET `session.in` records past the last `turn-complete`'s `session-in-event-id` cursor. Returns the user messages the dead run hadn't acknowledged — typically the message that triggered the cancelled / crashed turn, plus anything the customer typed after.
78
82
</Step>
79
-
<Steptitle="Clean up partial trailing assistants">
80
-
If the trailing assistant message has no `finish-step` (a turn that crashed mid-stream), `cleanupAbortedParts` truncates the partial parts. If nothing remains, drop the message entirely.
83
+
<Steptitle="Reconstruct the chain (smart default)">
84
+
Snapshot messages merge with the settled replay (replay wins on `id` collision). Then:
85
+
86
+
- If there's a partial assistant **and** at least one in-flight user message, splice `[firstInFlightUser, partialAssistant]` onto the end of the chain. The model sees the prior turn's incomplete attempt and can continue, abandon, or pivot based on the next user message.
87
+
- Remaining in-flight users dispatch as fresh turns after the recovered first one.
88
+
- If there's no partial OR no in-flight users, the chain is just the settled chain and any in-flight users dispatch normally.
89
+
90
+
Customers can override this entirely via [`onRecoveryBoot`](/ai-chat/patterns/recovery-boot).
81
91
</Step>
82
92
<Steptitle="Append the new wire message">
83
93
Append `u2` from the wire payload, exactly as on turn 1.
@@ -88,15 +98,15 @@ The model now sees `[u1, a1, u2]` and produces `a2`. After `onTurnComplete`, the
88
98
89
99
### Crash mid-turn — replay carries the load
90
100
91
-
Suppose Run 1's turn 1 streams partial assistant chunks to `session.out` and then OOMs before `onTurnComplete` fires. No snapshot was written. The next attempt boots and:
101
+
Suppose Run 1's turn 1 streams partial assistant chunks to `session.out` and then crashes (OOM, exception, server-side cancel) before `onTurnComplete` fires. No snapshot was written. The next run boots and:
92
102
93
103
1. Snapshot read returns 404 → empty.
94
-
2.Replay reads from seq 0, picks up the partial assistant chunks emitted before the crash.
95
-
3.`cleanupAbortedParts` strips the in-flight parts (or drops the whole message if nothing usable remains).
96
-
4.The accumulator is empty (or has just the prior user message — depends on what the runtime wrote before the crash).
97
-
5. The new attempt re-runs the OOM'd turn from scratch, on the [larger `oomMachine`](/ai-chat/patterns/oom-resilience).
104
+
2.`session.out` tail replay picks up the partial assistant chunks emitted before the crash. `cleanupAbortedParts` strips streaming-in-progress fragments but keeps the cleaned trailing message as the `partialAssistant`.
105
+
3.`session.in` tail replay finds the user message the dead run was answering (no `turn-complete` was written, so the cursor never advanced past it).
106
+
4.Smart default splices `[firstInFlightUser, partialAssistant]` onto the chain. Any later user messages (including the customer's follow-up) dispatch as fresh turns.
107
+
5. The model sees full prior context and responds in kind — continuing a cut-off essay on "keep going", answering a fresh question on "actually, what's 7+8?", abandoning the prior work on "scrap that, do X instead".
98
108
99
-
Replay-only is correct but slower. The snapshot is an optimization layered on top, not a correctness boundary.
109
+
Replay carries the conversation across the crash boundary with zero customer code. For policies different from "preserve context" — drop the partial entirely, synthesize tool results for an interrupted tool call, write a recovery banner to the UI — register [`onRecoveryBoot`](/ai-chat/patterns/recovery-boot).
0 commit comments