triggerdotdev
diff --git a/‎docs/ai-chat/changelog.mdx‎
Lines changed: 36 additions & 0 deletions b/‎docs/ai-chat/changelog.mdx‎
Lines changed: 36 additions & 0 deletions
diff --git a/‎docs/ai-chat/how-it-works.mdx‎
Lines changed: 1 addition & 1 deletion b/‎docs/ai-chat/how-it-works.mdx‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/ai-chat/lifecycle-hooks.mdx‎
Lines changed: 44 additions & 1 deletion b/‎docs/ai-chat/lifecycle-hooks.mdx‎
Lines changed: 44 additions & 1 deletion
diff --git a/‎docs/ai-chat/patterns/oom-resilience.mdx‎
Lines changed: 6 additions & 3 deletions b/‎docs/ai-chat/patterns/oom-resilience.mdx‎
Lines changed: 6 additions & 3 deletions
diff --git a/‎docs/ai-chat/patterns/persistence-and-replay.mdx‎
Lines changed: 21 additions & 11 deletions b/‎docs/ai-chat/patterns/persistence-and-replay.mdx‎
Lines changed: 21 additions & 11 deletions
@@ -4,6 +4,42 @@ sidebarTitle: "Changelog"
 description: "Pre-release updates for AI chat agents."
 ---
 
+<Update label="May 19, 2026" tags={["SDK"]}>
+
+## Recovery boot — context-preserving continuation after cancel / crash / OOM
+
+When a `chat.agent` run dies mid-stream (the user cancels, the worker OOMs, an unhandled exception kills the process), the next continuation run now reconstructs the conversation context automatically. Follow-ups like "keep going" continue the partial response; fresh follow-ups like "scrap that, what's 7+8?" abandon it and answer the new question. No customer code required.
+
+Under the hood: the boot now reads BOTH stream tails — `session.out` for any partial assistant the dead run was streaming, `session.in` for any user messages it never acknowledged — and splices `[firstInFlightUser, partialAssistant]` onto the chain when both are present. The model sees full prior context plus the latest user message.
+
+For policies different from "preserve context" — drop the partial entirely, synthesize tool results for an interrupted tool call, emit a recovery banner to the UI — register the new `onRecoveryBoot` hook:
+
+```ts
+import { chat } from "@trigger.dev/sdk/ai";
+
+export const myChat = chat.agent({
+  id: "my-chat",
+  onRecoveryBoot: async ({ partialAssistant, inFlightUsers, writer, cause, previousRunId }) => {
+    writer.write({
+      type: "data-chat-recovery",
+      data: { cause, previousRunId, partialPresent: partialAssistant !== undefined },
+      transient: true,
+    });
+    // return nothing → smart default applies
+  },
+  run: async ({ messages, signal }) =>
+    streamText({ model, messages, abortSignal: signal }),
+});
+```
+
+The hook receives `settledMessages`, `inFlightUsers`, `partialAssistant`, `pendingToolCalls`, `previousRunId`, `cause`, and a lazy `writer`. Return any of `chain`, `recoveredTurns`, or `beforeBoot` to override the default. Agents using `hydrateMessages` skip the hook — customer-owned persistence is the source of truth.
+
+Also retracts the OOM resilience caveat: model context on retry is no longer "incomplete" without `hydrateMessages`. The smart default reconstructs full context from `session.out` replay.
+
+See [Recovery boot](/ai-chat/patterns/recovery-boot) for the full guide.
+
+</Update>
+
 <Update label="May 16, 2026" description="0.0.0-chat-prerelease-20260519091352" tags={["SDK", "Breaking"]}>
 
 ## `session.out` is now bounded — header-form control records + per-turn trim
 
@@ -48,7 +48,7 @@ The engine restores the suspended run from its checkpoint. The same JS process p
 
 ### Continuation (after exit)
 
-If the run has fully exited (because it hit `maxTurns`, the customer called `chat.endRun()` or `chat.requestUpgrade()`, or it was cancelled or crashed), the next user message can't resume it — there is nothing to resume. Instead, the server triggers a brand-new run with `continuation: true`. The new run does a cold boot but reads the prior conversation's S3 snapshot and replays any `.out` chunks after the snapshot cursor, so the new run starts with the full message history already accumulated. Then it enters **Streaming** with `turn === 0` of the new run but `messageCount > 0`.
+If the run has fully exited (because it hit `maxTurns`, the customer called `chat.endRun()` or `chat.requestUpgrade()`, or it was cancelled or crashed), the next user message can't resume it — there is nothing to resume. Instead, the server triggers a brand-new run with `continuation: true`. The new run does a cold boot, reads the prior conversation's S3 snapshot, replays any `.out` chunks after the snapshot cursor, AND replays any `.in` records past the last `turn-complete` cursor (the user messages a dead run never acknowledged). If the predecessor died mid-stream and left a partial assistant response in `.out`, the smart default splices `[firstInFlightUser, partialAssistant]` onto the chain so any follow-up has full context — see [Recovery boot](/ai-chat/patterns/recovery-boot). The new run then enters **Streaming** with `turn === 0` of the new run but `messageCount > 0`.
 
 ### Closed
 
 
@@ -14,11 +14,12 @@ description: "Hook into every stage of a chat agent's run: preload, turn start,
 
 **Suspend / resume:** `onChatSuspend` fires when the run transitions from idle to suspended (waiting on the next message); `onChatResume` fires on wake.
 
-**Three scopes to keep straight:**
+**Four scopes to keep straight:**
 
 | Scope | Fires when | Use for |
 | --- | --- | --- |
 | **Process** ([`onBoot`](#onboot)) | Every fresh worker boots — initial, preloaded, and reactive continuation (post-cancel/crash/`endRun`/upgrade). | Initialize `chat.local`, open per-process resources, re-hydrate state from your DB on continuation. |
+| **Recovery** ([`onRecoveryBoot`](#onrecoveryboot)) | Continuation boot where the dead run left state behind — a partial assistant on `session.out` or in-flight users on `session.in`. | Override the smart default — drop the partial, synthesize tool results, emit a recovery banner. |
 | **Chat** ([`onChatStart`](#onchatstart)) | First message of a chat's lifetime. Does NOT fire on continuation runs or OOM retries. | One-time DB rows for the chat, resources tied to the chat's lifetime. |
 | **Turn** ([`onTurnStart`](#onturnstart), [`onTurnComplete`](#onturncomplete), etc.) | Every turn. | Persist messages, post-process responses. |
 
@@ -83,6 +84,48 @@ export const myChat = chat.agent({
   `onBoot` and `onChatStart` are complementary — keep DB-row creation in `onChatStart` (it only needs to happen once per chat) and put process-level setup (`chat.local`, connections, caches) in `onBoot` (it needs to happen on every fresh worker).
 </Tip>
 
+## onRecoveryBoot
+
+Fires once on a continuation boot when the dead predecessor left state behind — a partial assistant on `session.out`, in-flight user messages on `session.in`, or both. The runtime reconstructs context automatically via a smart default; this hook is the override path for policies that need something different.
+
+The hook does NOT fire when there's nothing to recover (clean continuation after `chat.endRun()`, fresh chat, OOM retry on top of a complete snapshot). It does NOT fire when [`hydrateMessages`](#hydratemessages) is registered (the customer owns persistence).
+
+```ts
+export const myChat = chat.agent({
+  id: "my-chat",
+  onRecoveryBoot: async ({ partialAssistant, inFlightUsers, writer, cause, previousRunId }) => {
+    writer.write({
+      type: "data-chat-recovery",
+      data: { cause, previousRunId, partialPresent: partialAssistant !== undefined },
+      transient: true,
+    });
+    // Return nothing → fall through to the smart default
+    // (splice partial + first user into chain, dispatch the rest).
+  },
+  run: async ({ messages, signal }) =>
+    streamText({ model: openai("gpt-4o"), messages, abortSignal: signal }),
+});
+```
+
+| Field              | Type                                                              | Description                                                                                          |
+| ------------------ | ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
+| `ctx`              | `TaskRunContext`                                                  | Full task run context                                                                                |
+| `chatId`           | `string`                                                          | Chat session ID                                                                                      |
+| `runId`            | `string`                                                          | The Trigger.dev run ID for this run boot                                                             |
+| `previousRunId`    | `string`                                                          | Public id of the prior run that died                                                                 |
+| `cause`            | `"cancelled" \| "crashed" \| "unknown"`                           | Best-effort cause. Currently always `"unknown"` — don't branch on it                                 |
+| `settledMessages`  | `TUIMessage[]`                                                    | The chain persisted by the predecessor's last `onTurnComplete`                                       |
+| `inFlightUsers`    | `TUIMessage[]`                                                    | User messages on `session.in` past the cursor — the message(s) the predecessor never acknowledged    |
+| `partialAssistant` | `TUIMessage \| undefined`                                         | The trailing assistant message whose stream never received `finish`                                  |
+| `pendingToolCalls` | `Array<{ toolCallId, toolName, input, partIndex }>`               | Tool calls in `input-available` state extracted from `partialAssistant`                              |
+| `writer`           | `ChatWriter`                                                      | Lazy session.out writer — write a recovery banner / signal here                                      |
+
+Returns `{ chain?, recoveredTurns?, beforeBoot? }` — every field optional. Omitted fields fall through to the smart default. See [Recovery boot](/ai-chat/patterns/recovery-boot) for the full guide, examples (drop partial, synthesize tool results, persist before boot), and interaction notes.
+
+<Tip>
+  Don't put `chat.local` initialization in `onRecoveryBoot` — use [`onBoot`](#onboot). `onRecoveryBoot` is for recovery decisions, not per-process setup. `onBoot` fires first.
+</Tip>
+
 ## onPreload
 
 Fires when a **preloaded run** starts, before any messages arrive. Use it to eagerly create chat-scoped DB rows (the Chat row, the ChatSession row) while the user is still typing — so the very first message lands fast.
 
@@ -79,9 +79,11 @@ If your agent uses [`hydrateMessages`](/ai-chat/lifecycle-hooks#hydratemessages)
 
 ## Without `hydrateMessages`
 
-The retry filter still prevents duplicate processing — turns 1..N-1 aren't re-run — but the OOM'd turn's accumulator is whatever the chat.agent's default flow can rebuild from `payload.messages` (typically just the first user message of the chat). The model context is **incomplete**: it doesn't see prior assistant responses. The conversation continues but a multi-turn OOM'd recovery may produce a less coherent reply.
+Recovery boot reconstructs context automatically. The boot reads both the durable `session.out` snapshot (settled turns) and the `session.out` tail past the snapshot cursor (the partial assistant chunks the OOM'd turn streamed before dying). When the new attempt processes the OOM'd user message, the model sees the full prior conversation **plus** the partial assistant that was cut off — so a "keep going" follow-up continues naturally, and any other follow-up has the same context the original turn had.
 
-If conversation continuity matters, use `hydrateMessages`.
+`hydrateMessages` is still the right choice if you want a single source of truth in your own database (branching conversations, message-level access control, etc.). It's no longer required for OOM continuity.
+
+For full control over recovery — drop the partial, synthesize tool results for an interrupted tool call, emit a recovery banner to the UI — register [`onRecoveryBoot`](/ai-chat/patterns/recovery-boot).
 
 ## Tool execute idempotency
 
@@ -109,5 +111,6 @@ export const sendEmail = tool({
 
 ## See also
 
+- [Recovery boot](/ai-chat/patterns/recovery-boot) — the underlying hook + smart default that gives OOM recovery its full-context behavior
 - [Lifecycle hooks](/ai-chat/lifecycle-hooks) — `onChatResume` fires on every retry attempt with `phase: "preload"` or `"turn"`
-- [Database persistence](/ai-chat/patterns/database-persistence) — the `hydrateMessages` pattern this builds on for full continuity
+- [Database persistence](/ai-chat/patterns/database-persistence) — the `hydrateMessages` pattern for branching, ACL, and DB-as-source-of-truth scenarios
@@ -71,13 +71,23 @@ A new run boots when the user sends `u2`. Run 1 has long since exited. Run 2 has
     GET the JSON blob. On 404 (no snapshot yet — first-ever turn) or read error or version mismatch, treat as empty and continue. Snapshot misses are non-fatal — replay alone may still be sufficient.
   </Step>
   <Step title="Replay session.out tail">
-    Subscribe to `session.out` with `wait=0` starting from `snapshot.lastOutEventId`. Drain whatever's there and close. In the steady state this returns empty (the snapshot already captured everything from turn 1). In a crash-recovery state — Run 1 emitted chunks but never wrote a snapshot — replay catches them.
+    Subscribe to `session.out` with `wait=0` starting from `snapshot.lastOutEventId`. Drain whatever's there and close. Returns:
+    - **Settled messages** — closed assistant turns past the snapshot cursor (the chunks of a turn that completed after the snapshot was written but before the run exited cleanly).
+    - **A partial assistant** — the trailing message if its stream never received a `finish` chunk. The dead run was mid-response when it died. `cleanupAbortedParts` has already stripped streaming-in-progress fragments.
+
+    In the steady state this returns empty. In recovery, it returns whatever the dead run was in the middle of.
   </Step>
-  <Step title="Merge by id, replay wins">
-    Snapshot messages and replayed messages are merged by `id`. On collision, the replayed copy wins — `session.out` is the freshest representation of any assistant message.
+  <Step title="Replay session.in tail">
+    GET `session.in` records past the last `turn-complete`'s `session-in-event-id` cursor. Returns the user messages the dead run hadn't acknowledged — typically the message that triggered the cancelled / crashed turn, plus anything the customer typed after.
   </Step>
-  <Step title="Clean up partial trailing assistants">
-    If the trailing assistant message has no `finish-step` (a turn that crashed mid-stream), `cleanupAbortedParts` truncates the partial parts. If nothing remains, drop the message entirely.
+  <Step title="Reconstruct the chain (smart default)">
+    Snapshot messages merge with the settled replay (replay wins on `id` collision). Then:
+
+    - If there's a partial assistant **and** at least one in-flight user message, splice `[firstInFlightUser, partialAssistant]` onto the end of the chain. The model sees the prior turn's incomplete attempt and can continue, abandon, or pivot based on the next user message.
+    - Remaining in-flight users dispatch as fresh turns after the recovered first one.
+    - If there's no partial OR no in-flight users, the chain is just the settled chain and any in-flight users dispatch normally.
+
+    Customers can override this entirely via [`onRecoveryBoot`](/ai-chat/patterns/recovery-boot).
   </Step>
   <Step title="Append the new wire message">
     Append `u2` from the wire payload, exactly as on turn 1.
@@ -88,15 +98,15 @@ The model now sees `[u1, a1, u2]` and produces `a2`. After `onTurnComplete`, the
 
 ### Crash mid-turn — replay carries the load
 
-Suppose Run 1's turn 1 streams partial assistant chunks to `session.out` and then OOMs before `onTurnComplete` fires. No snapshot was written. The next attempt boots and:
+Suppose Run 1's turn 1 streams partial assistant chunks to `session.out` and then crashes (OOM, exception, server-side cancel) before `onTurnComplete` fires. No snapshot was written. The next run boots and:
 
 1. Snapshot read returns 404 → empty.
-2. Replay reads from seq 0, picks up the partial assistant chunks emitted before the crash.
-3. `cleanupAbortedParts` strips the in-flight parts (or drops the whole message if nothing usable remains).
-4. The accumulator is empty (or has just the prior user message — depends on what the runtime wrote before the crash).
-5. The new attempt re-runs the OOM'd turn from scratch, on the [larger `oomMachine`](/ai-chat/patterns/oom-resilience).
+2. `session.out` tail replay picks up the partial assistant chunks emitted before the crash. `cleanupAbortedParts` strips streaming-in-progress fragments but keeps the cleaned trailing message as the `partialAssistant`.
+3. `session.in` tail replay finds the user message the dead run was answering (no `turn-complete` was written, so the cursor never advanced past it).
+4. Smart default splices `[firstInFlightUser, partialAssistant]` onto the chain. Any later user messages (including the customer's follow-up) dispatch as fresh turns.
+5. The model sees full prior context and responds in kind — continuing a cut-off essay on "keep going", answering a fresh question on "actually, what's 7+8?", abandoning the prior work on "scrap that, do X instead".
 
-Replay-only is correct but slower. The snapshot is an optimization layered on top, not a correctness boundary.
+Replay carries the conversation across the crash boundary with zero customer code. For policies different from "preserve context" — drop the partial entirely, synthesize tool results for an interrupted tool call, write a recovery banner to the UI — register [`onRecoveryBoot`](/ai-chat/patterns/recovery-boot).
 
 ## OOM-retry interaction