Declarative retry / backoff / on_error / sub-second sleep missing from DSL (resilience gap)

## Summary

The `df.*` composable-futures DSL has no declarative retry, backoff, on-error callback, sub-second sleep, set-style fan-out, or timeout combinator. Real-world durable workloads — especially rate-limited LLM/embed/chat patterns that need to retry on transient 429 — must hand-roll `df.loop` + `df.if` + `df.sleep` + `df.break` per job, plus a plpgsql token-bucket function, plus an external reconciler for terminal-failure propagation.

Reported originally against v0.1.1 during a bug bash (2026-05-14). Re-verified on `main` (v0.2.1 in development): no fix has shipped since v0.1.1, and no existing issue tracks declarative retry/backoff/on_error.

## Concrete impact

A 30-job rate-limited embed batch with "retry on denial with 1s backoff" requires ~57 lines of SQL today (per-job `df.loop`/`df.if`/`df.sleep`/`df.break` graph + `bucket` table + `try_acquire(...)` plpgsql function with `FOR UPDATE` row lock + audit table). A declarative retry primitive would reduce this to ~9 lines and eliminate the plpgsql + bucket table entirely (~85% reduction).

### Verbatim workaround as it must be written today

```sql
v_iid := df.start(
  df.loop(
    df.if(
      df.sql(format('SELECT bba3.try_acquire(%L, 1)', i)),
      df.seq(
        df.sql(format($q$UPDATE bba3.s5_jobs
                          SET embedding   = azure_openai.create_embeddings('default-embedding', body)::vector(1536),
                              status      = 'done',
                              finished_at = clock_timestamp()
                          WHERE id = %L$q$, i)),
        df.break('done')
      ),
      df.sleep(1)  -- minimum 1 second; df.sleep takes bigint seconds only
    ),
    df.sql('SELECT TRUE')
  ),
  's5-job-' || i::text,
  'postgres'
);
```

Wall clock ~9.6s for 30 jobs, dominated by `df.sleep` granularity rather than the actual rate-limit math.

### Hypothetical clean shape

**Option A — retry kwargs on `df.start` / `df.sql`:**

```sql
v_iid := df.start(
  df.sql(format($q$UPDATE bba3.s5_jobs
                    SET embedding = azure_openai.create_embeddings('default-embedding', body)::vector(1536),
                        status = 'done',
                        finished_at = clock_timestamp()
                    WHERE id = %L$q$, i)),
  's5-job-' || i::text,
  'postgres',
  retry        => 'on_429',
  backoff      => 'exponential(base=>200ms, cap=>30s, jitter=>0.2)',
  max_attempts => 100
);
```

**Option B — `df.retry` combinator:**

```sql
v_iid := df.start(
  df.retry(
    df.sql(format($q$UPDATE ... azure_openai.create_embeddings(...) ... WHERE id = %L$q$, i)),
    policy       => 'rate_limited',
    backoff      => df.exponential('200ms', '30s', 0.2),
    max_attempts => 100
  ),
  's5-job-' || i::text,
  'postgres'
);
```

## Verification on main (HEAD past v0.2.1 in development)

- No `df.retry` combinator and no `retry`/`backoff`/`max_attempts`/`on_error` kwargs on `df.start` or `df.sql` (`src/dsl.rs`, `src/lib.rs`).
- `df.sleep` still takes `bigint` seconds only — `pub fn sleep(seconds: i64)` in `src/dsl.rs`, SQL signature `df.sleep(bigint)` in `src/lib.rs`.
- No `df.catch` / `df.on_error` / `df.timeout`. No set-style fan-out: only pairwise `df.join` / `df.join3` / `df.race`.
- `CHANGELOG.md` shows no resilience-related changes between v0.1.1 and v0.2.1.

## Scope (sub-items to consider together)

This issue intentionally bundles the related resilience gaps surfaced by the same workload, since a coherent design likely addresses several at once:

1. **Declarative retry/backoff/on_error** — `df.retry(...)` combinator and/or kwargs on `df.start`/`df.sql`. Must support: policy/error-class predicate (e.g. transient vs permanent, 429-like), exponential backoff with cap + jitter, max attempts, optional on-error callback or non-retryable error classes.
2. **Sub-second sleep** — accept interval or fractional seconds in `df.sleep` so backoff isn't floored to 1s.
3. **`df.timeout(future, duration)`** — first-class, so users don't have to compose `df.race` with `df.sleep` manually.
4. **Set-style fan-out** — `df.parallel(futures[])` / `df.join_all(...)`, instead of forcing pairwise `df.join` / `df.join3`.
5. **`df.catch` / `df.on_error`** — for failure-handling branches without hand-rolled `df.if` envelopes.

## Prior art

| Runtime | Shape |
|---|---|
| Temporal | `ActivityOptions.RetryPolicy { initial_interval, backoff_coefficient, maximum_interval, maximum_attempts, non_retryable_error_types }` |
| Azure Durable Functions | `context.df.callActivityWithRetry(activity, retryOptions, input)` with `RetryOptions(firstRetryInterval, maxAttempts)` + `backoffCoefficient`, `maxRetryInterval`, `retryTimeout`, handler |
| AWS Step Functions | `Retry` blocks in state JSON: `{ ErrorEquals, IntervalSeconds, BackoffRate, MaxAttempts, JitterStrategy }` |

All three expose first-class declarative retry on every activity invocation.

## Related / not duplicates

- #123 — status casing mismatch between `df.status()` and `df.list_instances()`. Surfaced in the same bug bash; **already tracked, not duplicated here.**
- #134 (closed) — zero-duration sleep busy-loop. Different problem.
- #139 — rate-limiting `df.start()` itself for DoS protection. Different layer (admission control, not per-future retry).

## Origin

Filed from a bug report originally against v0.1.1 (bug bash 2026-05-14, `bug-candidate/aibugbash/phase-3/pg_durable-resilience-gap`). Re-verified on `main` 2026-05-19.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Declarative retry / backoff / on_error / sub-second sleep missing from DSL (resilience gap) #155

Summary

Concrete impact

Verbatim workaround as it must be written today

Hypothetical clean shape

Verification on main (HEAD past v0.2.1 in development)

Scope (sub-items to consider together)

Prior art

Related / not duplicates

Origin

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Runtime	Shape
Temporal	`ActivityOptions.RetryPolicy { initial_interval, backoff_coefficient, maximum_interval, maximum_attempts, non_retryable_error_types }`
Azure Durable Functions	`context.df.callActivityWithRetry(activity, retryOptions, input)` with `RetryOptions(firstRetryInterval, maxAttempts)` + `backoffCoefficient`, `maxRetryInterval`, `retryTimeout`, handler
AWS Step Functions	`Retry` blocks in state JSON: `{ ErrorEquals, IntervalSeconds, BackoffRate, MaxAttempts, JitterStrategy }`

Declarative retry / backoff / on_error / sub-second sleep missing from DSL (resilience gap) #155

Description

Summary

Concrete impact

Verbatim workaround as it must be written today

Hypothetical clean shape

Verification on main (HEAD past v0.2.1 in development)

Scope (sub-items to consider together)

Prior art

Related / not duplicates

Origin

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions