Skip to content

Commit 537cef2

Browse files
committed
docs(ai-chat): recovery boot — onRecoveryBoot hook + smart context-preserving default
1 parent 4a623c3 commit 537cef2

10 files changed

Lines changed: 426 additions & 17 deletions

docs/ai-chat/changelog.mdx

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,42 @@ sidebarTitle: "Changelog"
44
description: "Pre-release updates for AI chat agents."
55
---
66

7+
<Update label="May 19, 2026" tags={["SDK"]}>
8+
9+
## Recovery boot — context-preserving continuation after cancel / crash / OOM
10+
11+
When a `chat.agent` run dies mid-stream (the user cancels, the worker OOMs, an unhandled exception kills the process), the next continuation run now reconstructs the conversation context automatically. Follow-ups like "keep going" continue the partial response; fresh follow-ups like "scrap that, what's 7+8?" abandon it and answer the new question. No customer code required.
12+
13+
Under the hood: the boot now reads BOTH stream tails — `session.out` for any partial assistant the dead run was streaming, `session.in` for any user messages it never acknowledged — and splices `[firstInFlightUser, partialAssistant]` onto the chain when both are present. The model sees full prior context plus the latest user message.
14+
15+
For policies different from "preserve context" — drop the partial entirely, synthesize tool results for an interrupted tool call, emit a recovery banner to the UI — register the new `onRecoveryBoot` hook:
16+
17+
```ts
18+
import { chat } from "@trigger.dev/sdk/ai";
19+
20+
export const myChat = chat.agent({
21+
id: "my-chat",
22+
onRecoveryBoot: async ({ partialAssistant, inFlightUsers, writer, cause, previousRunId }) => {
23+
writer.write({
24+
type: "data-chat-recovery",
25+
data: { cause, previousRunId, partialPresent: partialAssistant !== undefined },
26+
transient: true,
27+
});
28+
// return nothing → smart default applies
29+
},
30+
run: async ({ messages, signal }) =>
31+
streamText({ model, messages, abortSignal: signal }),
32+
});
33+
```
34+
35+
The hook receives `settledMessages`, `inFlightUsers`, `partialAssistant`, `pendingToolCalls`, `previousRunId`, `cause`, and a lazy `writer`. Return any of `chain`, `recoveredTurns`, or `beforeBoot` to override the default. Agents using `hydrateMessages` skip the hook — customer-owned persistence is the source of truth.
36+
37+
Also retracts the OOM resilience caveat: model context on retry is no longer "incomplete" without `hydrateMessages`. The smart default reconstructs full context from `session.out` replay.
38+
39+
See [Recovery boot](/ai-chat/patterns/recovery-boot) for the full guide.
40+
41+
</Update>
42+
743
<Update label="May 16, 2026" description="0.0.0-chat-prerelease-20260519091352" tags={["SDK", "Breaking"]}>
844

945
## `session.out` is now bounded — header-form control records + per-turn trim

docs/ai-chat/how-it-works.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ The engine restores the suspended run from its checkpoint. The same JS process p
4848

4949
### Continuation (after exit)
5050

51-
If the run has fully exited (because it hit `maxTurns`, the customer called `chat.endRun()` or `chat.requestUpgrade()`, or it was cancelled or crashed), the next user message can't resume it — there is nothing to resume. Instead, the server triggers a brand-new run with `continuation: true`. The new run does a cold boot but reads the prior conversation's S3 snapshot and replays any `.out` chunks after the snapshot cursor, so the new run starts with the full message history already accumulated. Then it enters **Streaming** with `turn === 0` of the new run but `messageCount > 0`.
51+
If the run has fully exited (because it hit `maxTurns`, the customer called `chat.endRun()` or `chat.requestUpgrade()`, or it was cancelled or crashed), the next user message can't resume it — there is nothing to resume. Instead, the server triggers a brand-new run with `continuation: true`. The new run does a cold boot, reads the prior conversation's S3 snapshot, replays any `.out` chunks after the snapshot cursor, AND replays any `.in` records past the last `turn-complete` cursor (the user messages a dead run never acknowledged). If the predecessor died mid-stream and left a partial assistant response in `.out`, the smart default splices `[firstInFlightUser, partialAssistant]` onto the chain so any follow-up has full context — see [Recovery boot](/ai-chat/patterns/recovery-boot). The new run then enters **Streaming** with `turn === 0` of the new run but `messageCount > 0`.
5252

5353
### Closed
5454

docs/ai-chat/lifecycle-hooks.mdx

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,12 @@ description: "Hook into every stage of a chat agent's run: preload, turn start,
1414

1515
**Suspend / resume:** `onChatSuspend` fires when the run transitions from idle to suspended (waiting on the next message); `onChatResume` fires on wake.
1616

17-
**Three scopes to keep straight:**
17+
**Four scopes to keep straight:**
1818

1919
| Scope | Fires when | Use for |
2020
| --- | --- | --- |
2121
| **Process** ([`onBoot`](#onboot)) | Every fresh worker boots — initial, preloaded, and reactive continuation (post-cancel/crash/`endRun`/upgrade). | Initialize `chat.local`, open per-process resources, re-hydrate state from your DB on continuation. |
22+
| **Recovery** ([`onRecoveryBoot`](#onrecoveryboot)) | Continuation boot where the dead run left state behind — a partial assistant on `session.out` or in-flight users on `session.in`. | Override the smart default — drop the partial, synthesize tool results, emit a recovery banner. |
2223
| **Chat** ([`onChatStart`](#onchatstart)) | First message of a chat's lifetime. Does NOT fire on continuation runs or OOM retries. | One-time DB rows for the chat, resources tied to the chat's lifetime. |
2324
| **Turn** ([`onTurnStart`](#onturnstart), [`onTurnComplete`](#onturncomplete), etc.) | Every turn. | Persist messages, post-process responses. |
2425

@@ -83,6 +84,48 @@ export const myChat = chat.agent({
8384
`onBoot` and `onChatStart` are complementary — keep DB-row creation in `onChatStart` (it only needs to happen once per chat) and put process-level setup (`chat.local`, connections, caches) in `onBoot` (it needs to happen on every fresh worker).
8485
</Tip>
8586

87+
## onRecoveryBoot
88+
89+
Fires once on a continuation boot when the dead predecessor left state behind — a partial assistant on `session.out`, in-flight user messages on `session.in`, or both. The runtime reconstructs context automatically via a smart default; this hook is the override path for policies that need something different.
90+
91+
The hook does NOT fire when there's nothing to recover (clean continuation after `chat.endRun()`, fresh chat, OOM retry on top of a complete snapshot). It does NOT fire when [`hydrateMessages`](#hydratemessages) is registered (the customer owns persistence).
92+
93+
```ts
94+
export const myChat = chat.agent({
95+
id: "my-chat",
96+
onRecoveryBoot: async ({ partialAssistant, inFlightUsers, writer, cause, previousRunId }) => {
97+
writer.write({
98+
type: "data-chat-recovery",
99+
data: { cause, previousRunId, partialPresent: partialAssistant !== undefined },
100+
transient: true,
101+
});
102+
// Return nothing → fall through to the smart default
103+
// (splice partial + first user into chain, dispatch the rest).
104+
},
105+
run: async ({ messages, signal }) =>
106+
streamText({ model: openai("gpt-4o"), messages, abortSignal: signal }),
107+
});
108+
```
109+
110+
| Field | Type | Description |
111+
| ------------------ | ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
112+
| `ctx` | `TaskRunContext` | Full task run context |
113+
| `chatId` | `string` | Chat session ID |
114+
| `runId` | `string` | The Trigger.dev run ID for this run boot |
115+
| `previousRunId` | `string` | Public id of the prior run that died |
116+
| `cause` | `"cancelled" \| "crashed" \| "unknown"` | Best-effort cause. Currently always `"unknown"` — don't branch on it |
117+
| `settledMessages` | `TUIMessage[]` | The chain persisted by the predecessor's last `onTurnComplete` |
118+
| `inFlightUsers` | `TUIMessage[]` | User messages on `session.in` past the cursor — the message(s) the predecessor never acknowledged |
119+
| `partialAssistant` | `TUIMessage \| undefined` | The trailing assistant message whose stream never received `finish` |
120+
| `pendingToolCalls` | `Array<{ toolCallId, toolName, input, partIndex }>` | Tool calls in `input-available` state extracted from `partialAssistant` |
121+
| `writer` | `ChatWriter` | Lazy session.out writer — write a recovery banner / signal here |
122+
123+
Returns `{ chain?, recoveredTurns?, beforeBoot? }` — every field optional. Omitted fields fall through to the smart default. See [Recovery boot](/ai-chat/patterns/recovery-boot) for the full guide, examples (drop partial, synthesize tool results, persist before boot), and interaction notes.
124+
125+
<Tip>
126+
Don't put `chat.local` initialization in `onRecoveryBoot` — use [`onBoot`](#onboot). `onRecoveryBoot` is for recovery decisions, not per-process setup. `onBoot` fires first.
127+
</Tip>
128+
86129
## onPreload
87130

88131
Fires when a **preloaded run** starts, before any messages arrive. Use it to eagerly create chat-scoped DB rows (the Chat row, the ChatSession row) while the user is still typing — so the very first message lands fast.

docs/ai-chat/patterns/oom-resilience.mdx

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,9 +79,11 @@ If your agent uses [`hydrateMessages`](/ai-chat/lifecycle-hooks#hydratemessages)
7979

8080
## Without `hydrateMessages`
8181

82-
The retry filter still prevents duplicate processing — turns 1..N-1 aren't re-run — but the OOM'd turn's accumulator is whatever the chat.agent's default flow can rebuild from `payload.messages` (typically just the first user message of the chat). The model context is **incomplete**: it doesn't see prior assistant responses. The conversation continues but a multi-turn OOM'd recovery may produce a less coherent reply.
82+
Recovery boot reconstructs context automatically. The boot reads both the durable `session.out` snapshot (settled turns) and the `session.out` tail past the snapshot cursor (the partial assistant chunks the OOM'd turn streamed before dying). When the new attempt processes the OOM'd user message, the model sees the full prior conversation **plus** the partial assistant that was cut off — so a "keep going" follow-up continues naturally, and any other follow-up has the same context the original turn had.
8383

84-
If conversation continuity matters, use `hydrateMessages`.
84+
`hydrateMessages` is still the right choice if you want a single source of truth in your own database (branching conversations, message-level access control, etc.). It's no longer required for OOM continuity.
85+
86+
For full control over recovery — drop the partial, synthesize tool results for an interrupted tool call, emit a recovery banner to the UI — register [`onRecoveryBoot`](/ai-chat/patterns/recovery-boot).
8587

8688
## Tool execute idempotency
8789

@@ -109,5 +111,6 @@ export const sendEmail = tool({
109111

110112
## See also
111113

114+
- [Recovery boot](/ai-chat/patterns/recovery-boot) — the underlying hook + smart default that gives OOM recovery its full-context behavior
112115
- [Lifecycle hooks](/ai-chat/lifecycle-hooks)`onChatResume` fires on every retry attempt with `phase: "preload"` or `"turn"`
113-
- [Database persistence](/ai-chat/patterns/database-persistence) — the `hydrateMessages` pattern this builds on for full continuity
116+
- [Database persistence](/ai-chat/patterns/database-persistence) — the `hydrateMessages` pattern for branching, ACL, and DB-as-source-of-truth scenarios

docs/ai-chat/patterns/persistence-and-replay.mdx

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -71,13 +71,23 @@ A new run boots when the user sends `u2`. Run 1 has long since exited. Run 2 has
7171
GET the JSON blob. On 404 (no snapshot yet — first-ever turn) or read error or version mismatch, treat as empty and continue. Snapshot misses are non-fatal — replay alone may still be sufficient.
7272
</Step>
7373
<Step title="Replay session.out tail">
74-
Subscribe to `session.out` with `wait=0` starting from `snapshot.lastOutEventId`. Drain whatever's there and close. In the steady state this returns empty (the snapshot already captured everything from turn 1). In a crash-recovery state — Run 1 emitted chunks but never wrote a snapshot — replay catches them.
74+
Subscribe to `session.out` with `wait=0` starting from `snapshot.lastOutEventId`. Drain whatever's there and close. Returns:
75+
- **Settled messages** — closed assistant turns past the snapshot cursor (the chunks of a turn that completed after the snapshot was written but before the run exited cleanly).
76+
- **A partial assistant** — the trailing message if its stream never received a `finish` chunk. The dead run was mid-response when it died. `cleanupAbortedParts` has already stripped streaming-in-progress fragments.
77+
78+
In the steady state this returns empty. In recovery, it returns whatever the dead run was in the middle of.
7579
</Step>
76-
<Step title="Merge by id, replay wins">
77-
Snapshot messages and replayed messages are merged by `id`. On collision, the replayed copy wins — `session.out` is the freshest representation of any assistant message.
80+
<Step title="Replay session.in tail">
81+
GET `session.in` records past the last `turn-complete`'s `session-in-event-id` cursor. Returns the user messages the dead run hadn't acknowledged — typically the message that triggered the cancelled / crashed turn, plus anything the customer typed after.
7882
</Step>
79-
<Step title="Clean up partial trailing assistants">
80-
If the trailing assistant message has no `finish-step` (a turn that crashed mid-stream), `cleanupAbortedParts` truncates the partial parts. If nothing remains, drop the message entirely.
83+
<Step title="Reconstruct the chain (smart default)">
84+
Snapshot messages merge with the settled replay (replay wins on `id` collision). Then:
85+
86+
- If there's a partial assistant **and** at least one in-flight user message, splice `[firstInFlightUser, partialAssistant]` onto the end of the chain. The model sees the prior turn's incomplete attempt and can continue, abandon, or pivot based on the next user message.
87+
- Remaining in-flight users dispatch as fresh turns after the recovered first one.
88+
- If there's no partial OR no in-flight users, the chain is just the settled chain and any in-flight users dispatch normally.
89+
90+
Customers can override this entirely via [`onRecoveryBoot`](/ai-chat/patterns/recovery-boot).
8191
</Step>
8292
<Step title="Append the new wire message">
8393
Append `u2` from the wire payload, exactly as on turn 1.
@@ -88,15 +98,15 @@ The model now sees `[u1, a1, u2]` and produces `a2`. After `onTurnComplete`, the
8898

8999
### Crash mid-turn — replay carries the load
90100

91-
Suppose Run 1's turn 1 streams partial assistant chunks to `session.out` and then OOMs before `onTurnComplete` fires. No snapshot was written. The next attempt boots and:
101+
Suppose Run 1's turn 1 streams partial assistant chunks to `session.out` and then crashes (OOM, exception, server-side cancel) before `onTurnComplete` fires. No snapshot was written. The next run boots and:
92102

93103
1. Snapshot read returns 404 → empty.
94-
2. Replay reads from seq 0, picks up the partial assistant chunks emitted before the crash.
95-
3. `cleanupAbortedParts` strips the in-flight parts (or drops the whole message if nothing usable remains).
96-
4. The accumulator is empty (or has just the prior user message — depends on what the runtime wrote before the crash).
97-
5. The new attempt re-runs the OOM'd turn from scratch, on the [larger `oomMachine`](/ai-chat/patterns/oom-resilience).
104+
2. `session.out` tail replay picks up the partial assistant chunks emitted before the crash. `cleanupAbortedParts` strips streaming-in-progress fragments but keeps the cleaned trailing message as the `partialAssistant`.
105+
3. `session.in` tail replay finds the user message the dead run was answering (no `turn-complete` was written, so the cursor never advanced past it).
106+
4. Smart default splices `[firstInFlightUser, partialAssistant]` onto the chain. Any later user messages (including the customer's follow-up) dispatch as fresh turns.
107+
5. The model sees full prior context and responds in kind — continuing a cut-off essay on "keep going", answering a fresh question on "actually, what's 7+8?", abandoning the prior work on "scrap that, do X instead".
98108

99-
Replay-only is correct but slower. The snapshot is an optimization layered on top, not a correctness boundary.
109+
Replay carries the conversation across the crash boundary with zero customer code. For policies different from "preserve context" — drop the partial entirely, synthesize tool results for an interrupted tool call, write a recovery banner to the UI — register [`onRecoveryBoot`](/ai-chat/patterns/recovery-boot).
100110

101111
## OOM-retry interaction
102112

0 commit comments

Comments
 (0)