Skip to content

Declarative retry / backoff / on_error / sub-second sleep missing from DSL (resilience gap) #155

@pinodeca

Description

@pinodeca

Summary

The df.* composable-futures DSL has no declarative retry, backoff, on-error callback, sub-second sleep, set-style fan-out, or timeout combinator. Real-world durable workloads — especially rate-limited LLM/embed/chat patterns that need to retry on transient 429 — must hand-roll df.loop + df.if + df.sleep + df.break per job, plus a plpgsql token-bucket function, plus an external reconciler for terminal-failure propagation.

Reported originally against v0.1.1 during a bug bash (2026-05-14). Re-verified on main (v0.2.1 in development): no fix has shipped since v0.1.1, and no existing issue tracks declarative retry/backoff/on_error.

Concrete impact

A 30-job rate-limited embed batch with "retry on denial with 1s backoff" requires ~57 lines of SQL today (per-job df.loop/df.if/df.sleep/df.break graph + bucket table + try_acquire(...) plpgsql function with FOR UPDATE row lock + audit table). A declarative retry primitive would reduce this to ~9 lines and eliminate the plpgsql + bucket table entirely (~85% reduction).

Verbatim workaround as it must be written today

v_iid := df.start(
  df.loop(
    df.if(
      df.sql(format('SELECT bba3.try_acquire(%L, 1)', i)),
      df.seq(
        df.sql(format($q$UPDATE bba3.s5_jobs
                          SET embedding   = azure_openai.create_embeddings('default-embedding', body)::vector(1536),
                              status      = 'done',
                              finished_at = clock_timestamp()
                          WHERE id = %L$q$, i)),
        df.break('done')
      ),
      df.sleep(1)  -- minimum 1 second; df.sleep takes bigint seconds only
    ),
    df.sql('SELECT TRUE')
  ),
  's5-job-' || i::text,
  'postgres'
);

Wall clock ~9.6s for 30 jobs, dominated by df.sleep granularity rather than the actual rate-limit math.

Hypothetical clean shape

Option A — retry kwargs on df.start / df.sql:

v_iid := df.start(
  df.sql(format($q$UPDATE bba3.s5_jobs
                    SET embedding = azure_openai.create_embeddings('default-embedding', body)::vector(1536),
                        status = 'done',
                        finished_at = clock_timestamp()
                    WHERE id = %L$q$, i)),
  's5-job-' || i::text,
  'postgres',
  retry        => 'on_429',
  backoff      => 'exponential(base=>200ms, cap=>30s, jitter=>0.2)',
  max_attempts => 100
);

Option B — df.retry combinator:

v_iid := df.start(
  df.retry(
    df.sql(format($q$UPDATE ... azure_openai.create_embeddings(...) ... WHERE id = %L$q$, i)),
    policy       => 'rate_limited',
    backoff      => df.exponential('200ms', '30s', 0.2),
    max_attempts => 100
  ),
  's5-job-' || i::text,
  'postgres'
);

Verification on main (HEAD past v0.2.1 in development)

  • No df.retry combinator and no retry/backoff/max_attempts/on_error kwargs on df.start or df.sql (src/dsl.rs, src/lib.rs).
  • df.sleep still takes bigint seconds only — pub fn sleep(seconds: i64) in src/dsl.rs, SQL signature df.sleep(bigint) in src/lib.rs.
  • No df.catch / df.on_error / df.timeout. No set-style fan-out: only pairwise df.join / df.join3 / df.race.
  • CHANGELOG.md shows no resilience-related changes between v0.1.1 and v0.2.1.

Scope (sub-items to consider together)

This issue intentionally bundles the related resilience gaps surfaced by the same workload, since a coherent design likely addresses several at once:

  1. Declarative retry/backoff/on_errordf.retry(...) combinator and/or kwargs on df.start/df.sql. Must support: policy/error-class predicate (e.g. transient vs permanent, 429-like), exponential backoff with cap + jitter, max attempts, optional on-error callback or non-retryable error classes.
  2. Sub-second sleep — accept interval or fractional seconds in df.sleep so backoff isn't floored to 1s.
  3. df.timeout(future, duration) — first-class, so users don't have to compose df.race with df.sleep manually.
  4. Set-style fan-outdf.parallel(futures[]) / df.join_all(...), instead of forcing pairwise df.join / df.join3.
  5. df.catch / df.on_error — for failure-handling branches without hand-rolled df.if envelopes.

Prior art

Runtime Shape
Temporal ActivityOptions.RetryPolicy { initial_interval, backoff_coefficient, maximum_interval, maximum_attempts, non_retryable_error_types }
Azure Durable Functions context.df.callActivityWithRetry(activity, retryOptions, input) with RetryOptions(firstRetryInterval, maxAttempts) + backoffCoefficient, maxRetryInterval, retryTimeout, handler
AWS Step Functions Retry blocks in state JSON: { ErrorEquals, IntervalSeconds, BackoffRate, MaxAttempts, JitterStrategy }

All three expose first-class declarative retry on every activity invocation.

Related / not duplicates

Origin

Filed from a bug report originally against v0.1.1 (bug bash 2026-05-14, bug-candidate/aibugbash/phase-3/pg_durable-resilience-gap). Re-verified on main 2026-05-19.

Metadata

Metadata

Labels

bugSomething isn't workingenhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions