Summary
The df.* composable-futures DSL has no declarative retry, backoff, on-error callback, sub-second sleep, set-style fan-out, or timeout combinator. Real-world durable workloads — especially rate-limited LLM/embed/chat patterns that need to retry on transient 429 — must hand-roll df.loop + df.if + df.sleep + df.break per job, plus a plpgsql token-bucket function, plus an external reconciler for terminal-failure propagation.
Reported originally against v0.1.1 during a bug bash (2026-05-14). Re-verified on main (v0.2.1 in development): no fix has shipped since v0.1.1, and no existing issue tracks declarative retry/backoff/on_error.
Concrete impact
A 30-job rate-limited embed batch with "retry on denial with 1s backoff" requires ~57 lines of SQL today (per-job df.loop/df.if/df.sleep/df.break graph + bucket table + try_acquire(...) plpgsql function with FOR UPDATE row lock + audit table). A declarative retry primitive would reduce this to ~9 lines and eliminate the plpgsql + bucket table entirely (~85% reduction).
Verbatim workaround as it must be written today
v_iid := df.start(
df.loop(
df.if(
df.sql(format('SELECT bba3.try_acquire(%L, 1)', i)),
df.seq(
df.sql(format($q$UPDATE bba3.s5_jobs
SET embedding = azure_openai.create_embeddings('default-embedding', body)::vector(1536),
status = 'done',
finished_at = clock_timestamp()
WHERE id = %L$q$, i)),
df.break('done')
),
df.sleep(1) -- minimum 1 second; df.sleep takes bigint seconds only
),
df.sql('SELECT TRUE')
),
's5-job-' || i::text,
'postgres'
);
Wall clock ~9.6s for 30 jobs, dominated by df.sleep granularity rather than the actual rate-limit math.
Hypothetical clean shape
Option A — retry kwargs on df.start / df.sql:
v_iid := df.start(
df.sql(format($q$UPDATE bba3.s5_jobs
SET embedding = azure_openai.create_embeddings('default-embedding', body)::vector(1536),
status = 'done',
finished_at = clock_timestamp()
WHERE id = %L$q$, i)),
's5-job-' || i::text,
'postgres',
retry => 'on_429',
backoff => 'exponential(base=>200ms, cap=>30s, jitter=>0.2)',
max_attempts => 100
);
Option B — df.retry combinator:
v_iid := df.start(
df.retry(
df.sql(format($q$UPDATE ... azure_openai.create_embeddings(...) ... WHERE id = %L$q$, i)),
policy => 'rate_limited',
backoff => df.exponential('200ms', '30s', 0.2),
max_attempts => 100
),
's5-job-' || i::text,
'postgres'
);
Verification on main (HEAD past v0.2.1 in development)
- No
df.retry combinator and no retry/backoff/max_attempts/on_error kwargs on df.start or df.sql (src/dsl.rs, src/lib.rs).
df.sleep still takes bigint seconds only — pub fn sleep(seconds: i64) in src/dsl.rs, SQL signature df.sleep(bigint) in src/lib.rs.
- No
df.catch / df.on_error / df.timeout. No set-style fan-out: only pairwise df.join / df.join3 / df.race.
CHANGELOG.md shows no resilience-related changes between v0.1.1 and v0.2.1.
Scope (sub-items to consider together)
This issue intentionally bundles the related resilience gaps surfaced by the same workload, since a coherent design likely addresses several at once:
- Declarative retry/backoff/on_error —
df.retry(...) combinator and/or kwargs on df.start/df.sql. Must support: policy/error-class predicate (e.g. transient vs permanent, 429-like), exponential backoff with cap + jitter, max attempts, optional on-error callback or non-retryable error classes.
- Sub-second sleep — accept interval or fractional seconds in
df.sleep so backoff isn't floored to 1s.
df.timeout(future, duration) — first-class, so users don't have to compose df.race with df.sleep manually.
- Set-style fan-out —
df.parallel(futures[]) / df.join_all(...), instead of forcing pairwise df.join / df.join3.
df.catch / df.on_error — for failure-handling branches without hand-rolled df.if envelopes.
Prior art
| Runtime |
Shape |
| Temporal |
ActivityOptions.RetryPolicy { initial_interval, backoff_coefficient, maximum_interval, maximum_attempts, non_retryable_error_types } |
| Azure Durable Functions |
context.df.callActivityWithRetry(activity, retryOptions, input) with RetryOptions(firstRetryInterval, maxAttempts) + backoffCoefficient, maxRetryInterval, retryTimeout, handler |
| AWS Step Functions |
Retry blocks in state JSON: { ErrorEquals, IntervalSeconds, BackoffRate, MaxAttempts, JitterStrategy } |
All three expose first-class declarative retry on every activity invocation.
Related / not duplicates
Origin
Filed from a bug report originally against v0.1.1 (bug bash 2026-05-14, bug-candidate/aibugbash/phase-3/pg_durable-resilience-gap). Re-verified on main 2026-05-19.
Summary
The
df.*composable-futures DSL has no declarative retry, backoff, on-error callback, sub-second sleep, set-style fan-out, or timeout combinator. Real-world durable workloads — especially rate-limited LLM/embed/chat patterns that need to retry on transient 429 — must hand-rolldf.loop+df.if+df.sleep+df.breakper job, plus a plpgsql token-bucket function, plus an external reconciler for terminal-failure propagation.Reported originally against v0.1.1 during a bug bash (2026-05-14). Re-verified on
main(v0.2.1 in development): no fix has shipped since v0.1.1, and no existing issue tracks declarative retry/backoff/on_error.Concrete impact
A 30-job rate-limited embed batch with "retry on denial with 1s backoff" requires ~57 lines of SQL today (per-job
df.loop/df.if/df.sleep/df.breakgraph +buckettable +try_acquire(...)plpgsql function withFOR UPDATErow lock + audit table). A declarative retry primitive would reduce this to ~9 lines and eliminate the plpgsql + bucket table entirely (~85% reduction).Verbatim workaround as it must be written today
Wall clock ~9.6s for 30 jobs, dominated by
df.sleepgranularity rather than the actual rate-limit math.Hypothetical clean shape
Option A — retry kwargs on
df.start/df.sql:Option B —
df.retrycombinator:Verification on main (HEAD past v0.2.1 in development)
df.retrycombinator and noretry/backoff/max_attempts/on_errorkwargs ondf.startordf.sql(src/dsl.rs,src/lib.rs).df.sleepstill takesbigintseconds only —pub fn sleep(seconds: i64)insrc/dsl.rs, SQL signaturedf.sleep(bigint)insrc/lib.rs.df.catch/df.on_error/df.timeout. No set-style fan-out: only pairwisedf.join/df.join3/df.race.CHANGELOG.mdshows no resilience-related changes between v0.1.1 and v0.2.1.Scope (sub-items to consider together)
This issue intentionally bundles the related resilience gaps surfaced by the same workload, since a coherent design likely addresses several at once:
df.retry(...)combinator and/or kwargs ondf.start/df.sql. Must support: policy/error-class predicate (e.g. transient vs permanent, 429-like), exponential backoff with cap + jitter, max attempts, optional on-error callback or non-retryable error classes.df.sleepso backoff isn't floored to 1s.df.timeout(future, duration)— first-class, so users don't have to composedf.racewithdf.sleepmanually.df.parallel(futures[])/df.join_all(...), instead of forcing pairwisedf.join/df.join3.df.catch/df.on_error— for failure-handling branches without hand-rolleddf.ifenvelopes.Prior art
ActivityOptions.RetryPolicy { initial_interval, backoff_coefficient, maximum_interval, maximum_attempts, non_retryable_error_types }context.df.callActivityWithRetry(activity, retryOptions, input)withRetryOptions(firstRetryInterval, maxAttempts)+backoffCoefficient,maxRetryInterval,retryTimeout, handlerRetryblocks in state JSON:{ ErrorEquals, IntervalSeconds, BackoffRate, MaxAttempts, JitterStrategy }All three expose first-class declarative retry on every activity invocation.
Related / not duplicates
df.status()anddf.list_instances(). Surfaced in the same bug bash; already tracked, not duplicated here.df.start()itself for DoS protection. Different layer (admission control, not per-future retry).Origin
Filed from a bug report originally against v0.1.1 (bug bash 2026-05-14,
bug-candidate/aibugbash/phase-3/pg_durable-resilience-gap). Re-verified onmain2026-05-19.