Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions .blog/durable-execution-layer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Workflow DevKit makes Agents durable

## Thesis

Production AI agents are not single HTTP requests. They are long-running programs that plan, call tools, wait on external systems, and keep internal state across dozens of decisions.

Stateless compute fights that shape. A cold start or a timeout resets the process mid-loop. A retry replays side effects unless you build your own idempotency ledger. Teams end up rebuilding durable execution out of database rows, queues, and scheduled jobs.

Workflow DevKit turns that pile of infrastructure back into code. You write an Agent as a workflow function. The runtime persists progress as an event log and deterministically replays the workflow to reconstruct state after failures, cold starts, or scale events.

## Current state

Most "production agent" stacks ship the same diagram with different logos:

* A `agent_runs` table that stores conversation state, tool history, and a cursor.
* A queue that re-invokes the agent after every tool call.
* A cron job that scans for stuck runs, retries failed calls, and advances timers.
* Idempotency keys everywhere to avoid double-charging, double-emailing, or double-writing.

This works, but it costs engineering time forever. Every tool integration becomes a mini state machine. Every new failure mode adds another column: `attempt`, `next_run_at`, `last_error`, `lock_owner`. The "agent" ends up split across handlers that must agree on invariants.

Here's the pattern in code.

**Before: DB row + queue for an Agent tool-calling loop**

```ts
import { sql } from "./db";
import { queue } from "./queue";

export async function runAgent(runId: string) {
const run = await sql`SELECT * FROM agent_runs WHERE id=${runId}`;
try {
const next = await llmPlan(run.state);
const toolOut = await callTool(next.tool, next.args, {
idempotencyKey: `${runId}:${run.step}`,
});
await sql`UPDATE agent_runs SET state=${toolOut.state}, step=${run.step + 1}
WHERE id=${runId}`;
await queue.add("agent", { runId }, { jobId: `${runId}:${run.step + 1}` });
} catch (err) {
await sql`UPDATE agent_runs SET retries=${run.retries + 1}, last_error=${String(err)}
WHERE id=${runId}`;
await queue.add("agent", { runId }, { delay: backoff(run.retries) });
}
}
```

The code above "works" until it doesn't. You now own locking, exactly-once semantics, backoff, and recovery. Any bug that advances `step` at the wrong time corrupts the run. Any mismatch between the stored cursor and the tool history produces duplicated tool calls.

## The shift

Durable execution flips the control plane. Instead of persisting *state* and reconstructing control flow, you persist *control flow* and reconstruct state.

Workflow DevKit records every side effect boundary as an event. When the workflow restarts, the runtime replays the workflow from the top in a deterministic sandbox and feeds it the same event stream. Completed steps return their recorded results. Pending steps suspend the workflow and get scheduled. The workflow code stays readable because it is still just async TypeScript.

**After: the same Agent as a workflow with steps**

```ts
type AgentState = { messages: string[]; done: boolean };

async function llmPlan(state: AgentState) {
'use step';
return decideNextAction(state.messages);
}
async function callTool(name: string, args: unknown) {
'use step';
return tools[name](args);
}

export async function agentLoop(initial: AgentState) {
'use workflow';
let state = initial;
while (state.done === false) {
const plan = await llmPlan(state);
state = await callTool(plan.tool, plan.args);
}
return state;
}
```

The pain disappears because you stopped simulating a runtime in tables. The workflow function is the state machine. The durable log is the source of truth. Retries stop being a cross-cutting concern you re-implement for every tool.

## The vision

Agents need four things that plain serverless does not provide:

1. **State across tool calls.** The agent has to remember what already happened.
2. **Selective retries.** A transient failure should retry one tool call, not the entire run.
3. **Parallel execution.** Agents fan out: retrieval + enrichment + verification.
4. **Long waits.** Human-in-the-loop and external systems do not fit in a 10-60 second timeout.

Workflow DevKit maps those directly onto existing JavaScript primitives:

* Use local variables for state. The runtime reconstructs them by replay.
* Use `FatalError` and `RetryableError` inside steps to control retry and backoff.
* Use `Promise.all()` and `Promise.race()` in workflows for fanout and competition.
* Use `sleep()` for durable delays and hooks to pause until an external event arrives.

That last pair matters for agents because "waiting" is normal. A workflow can suspend while it waits for a webhook, a human approval, or an upstream batch job. The runtime resumes the workflow when the event shows up, without you writing a scheduler.

Retries are the other place teams burn weeks. The usual solution is a cron-driven state machine that retries failed calls and advances a `next_retry_at` timestamp.

**Before: cron + state machine retry for flaky API calls**

```ts
import { sql } from "./db";

export async function retryCron() {
const jobs = await sql`SELECT * FROM api_calls
WHERE status='retry' AND run_at < now()
LIMIT 100`;
for (const job of jobs.rows) {
const res = await fetch(job.url, { method: "POST", body: job.body });
const status = res.status < 500 ? "done" : "retry";
await sql`UPDATE api_calls SET status=${status}, attempts=${job.attempts + 1},
run_at=${nextRunAt(job.attempts)} WHERE id=${job.id}`;
}
}
```

That code turns "retry an HTTP call" into an operational subsystem. The database becomes a task scheduler. The cron job becomes a reliability layer.

**After: RetryableError inside a step**

```ts
import { FatalError, RetryableError } from "workflow";

async function postInvoice(id: string) {
'use step';
const origin = process.env.INVOICE_API_ORIGIN ?? "";
const res = await fetch(`${origin}/invoices/${id}`, { method: "POST" });
if (res.status >= 500) throw new RetryableError("invoice API 5xx", { retryAfter: "30s" });
if (res.ok === false) throw new FatalError(`invoice API ${res.status}`);
return res.json();
}

export async function invoiceAgent(id: string) {
'use workflow';
return await postInvoice(id);
}
```

The step throws a structured error. The runtime persists that failure, schedules a retry with backoff, and replays the workflow without re-running completed work.

## Next steps

Treat "Agent" as a workflow boundary, not a request handler. Keep the workflow deterministic and push I/O into steps. If a piece of code needs the network, the filesystem, or a timer, it belongs in a step.

Start small. Pick one agent loop that currently writes state to a database and triggers itself via a queue. Move the loop into a workflow function. Wrap each tool call in a step function. Replace cron-based retry with `RetryableError` and durable `sleep()`.

Run the workflow locally, then inspect the event log and step timeline.

```bash
npx -y -p @workflow/cli wf inspect runs
```
161 changes: 161 additions & 0 deletions .blog/how-deterministic-replay-works-for-ai-agents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# How Workflow DevKit executes Agents with deterministic replay

## Problem

An Agent that calls tools is a distributed system in a single function body. It crosses process boundaries every time it waits on the network, hits a timeout, or gets retried by the platform. Stateless retries re-run code, not intent.

The usual mitigation is "checkpoint everything." After every tool call you write a cursor and a blob of state to durable storage. On restart you read the checkpoint and try to reconstruct what happened. This approach turns agent code into a database-backed interpreter.

## Approach

Workflow DevKit splits Agent code into two execution models:

* **Workflow functions** (`'use workflow'`) run inside a sandboxed VM. They orchestrate control flow, hold state in local variables, and stay deterministic.
* **Step functions** (`'use step'`) run with full Node.js access. They perform side effects: network calls, SDKs, file I/O, crypto, and timers.

The runtime persists every step boundary as an event in an append-only log. When the workflow runs again, it replays the workflow from the top, feeds it the same event stream, and returns recorded results for completed steps. Only missing or failed steps execute.

That design targets the failure modes that break agents in production: cold starts mid-conversation, platform timeouts, partial success in parallel fanout, and flaky tool calls.

## Implementation details

### Build-time split: workflow bundle vs step bundle

A workflow file contains both orchestrator code and side-effecting code. Workflow DevKit's build pipeline uses an SWC transform to recognize the `'use workflow'` / `'use step'` directives and split them into separate bundles.

That split is what makes the runtime model crisp: orchestrators run in a deterministic VM, and steps run in normal Node.js. You still write a single file.

### Determinism in the workflow VM

The workflow VM runs under constraints that make replay reliable:

* `Math.random()` is seeded per workflow run.
* `Date.now()` is fixed and advanced based on event timestamps during replay.
* `crypto.getRandomValues()` and `crypto.randomUUID()` are deterministic.
* `process.env` is copied and frozen.
* Timer APIs (`setTimeout`, `setInterval`, `setImmediate`) throw. Use durable `sleep()` instead.
* Global `fetch` is blocked in workflows. Put network I/O in steps.

This matters for agents because non-determinism breaks replay. If the orchestrator reads "now" or random data to decide which tool to call, it must see the same values on every replay.

### Event log + suspension

A workflow run consumes an ordered event stream. When the workflow hits an awaited step, it looks for events with the step's correlation id:

* `step_created` confirms the step exists.
* `step_started`, `step_retrying`, `step_completed`, `step_failed` drive resolution.
* `wait_created` / `wait_completed` back durable `sleep()`.
* `hook_created` and hook completion events back external resumes.

When an awaited step has no matching event yet, the workflow throws a `WorkflowSuspension`. The suspension carries a queue of pending invocations (steps, waits, hooks). The runtime handler persists the missing `*_created` events and enqueues step executions with an idempotency key equal to the correlation id.

The workflow stops at that point. Step workers run, append completion or retry events, and re-enqueue the workflow. On the next replay, the workflow re-runs the same code and picks up exactly where it left off.

### Built-in retries at the step boundary

Step execution owns retries. A step can fail in three ways:

* Throw `FatalError` to fail the step and bubble the error to the workflow.
* Throw `RetryableError` to retry with an explicit `retryAfter`.
* Throw any other error to retry with the default policy, up to `maxRetries` (default is 3).

Retries do not re-run completed steps. The event log preserves the successful work and the orchestrator replays it.

## Code patterns

### Crash recovery without checkpoints

**Before: manual checkpoint writes and cursor recovery**

```ts
import { sql } from "./db";

export async function agentHandler(runId: string) {
const run = await sql`SELECT cursor, state FROM agent_runs WHERE id=${runId}`;
let { cursor, state } = run.rows[0];
while (cursor < state.plan.length) {
const out = await tools[state.plan[cursor]](state);
cursor += 1;
state = { ...state, out };
await sql`UPDATE agent_runs SET cursor=${cursor}, state=${state} WHERE id=${runId}`;
}
return state;
}
```

This is a checkpointed interpreter. Every loop iteration writes to storage so the next invocation can reconstruct progress.

**After: deterministic replay, no explicit checkpoints**

```ts
async function runTool(name: string, input: unknown) {
'use step';
return tools[name](input);
}

export async function agentRun(plan: { name: string }[], initial: unknown) {
'use workflow';
let state = initial;
for (const action of plan) state = await runTool(action.name, state);
return state;
}
```

The workflow stores state in local variables. The runtime reconstructs those variables on replay by feeding recorded step results back into the same loop.

### Parallel fanout without bespoke orchestration

Agents fan out to keep latency bounded: search + fetch + summarize in parallel. The hard part is partial success. One branch can succeed while another fails, and a stateless retry re-executes both unless you persist per-branch outputs.

**Before: custom fanout bookkeeping to avoid redoing work**

```ts
import { sql } from "./db";

export async function fanout(runId: string) {
await sql`UPDATE runs SET status='running' WHERE id=${runId}`;
const [a, b] = await Promise.allSettled([callA(), callB()]);
if (a.status === "fulfilled") await sql`UPDATE runs SET a=${a.value} WHERE id=${runId}`;
if (b.status === "fulfilled") await sql`UPDATE runs SET b=${b.value} WHERE id=${runId}`;
if (a.status === "rejected" || b.status === "rejected") throw new Error("retry later");
return { a: a.value, b: b.value };
}
```

You persist intermediate results because the platform does not.

**After: Promise.all over durable steps**

```ts
async function fetchA() {
'use step';
return callA();
}
async function fetchB() {
'use step';
return callB();
}

export async function fanoutWorkflow() {
'use workflow';
const [a, b] = await Promise.all([fetchA(), fetchB()]);
return { a, b };
}
```

Each step has its own event history and retry policy. If `fetchB()` fails and retries, `fetchA()` replays from its `step_completed` event without re-executing.

## Results

Workflow DevKit moves agent reliability into the runtime instead of your app code:

* Cold starts and timeouts resume from the event log, not from ad hoc checkpoints.
* Tool-call retries are selective. Completed steps return recorded results.
* Parallel fanout uses ordinary `Promise.all()` with independent step retries.
* Long waits become first-class via durable `sleep()` and hook-based resume.

The operational surface area shrinks. You stop maintaining a queue protocol, a scheduler, and a state machine schema per agent.

```bash
npx -y -p @workflow/cli wf inspect runs --limit 10
```
Loading
Loading