Skip to content

Add exponential+jitter retry helpers and apply to transient LLM/agent/fs/observation operations#111

Open
danielbdyer wants to merge 1 commit into
mainfrom
codex/add-reusable-retry-helpers-with-exponential-backoff
Open

Add exponential+jitter retry helpers and apply to transient LLM/agent/fs/observation operations#111
danielbdyer wants to merge 1 commit into
mainfrom
codex/add-reusable-retry-helpers-with-exponential-backoff

Conversation

@danielbdyer
Copy link
Copy Markdown
Owner

Motivation

  • Transient failures (network hiccups, ENOENT races, temporary Playwright navigation errors) should be retried with bounded exponential backoff and jitter while deterministic validation errors must never be retried.
  • Centralize retry policy to keep retry semantics consistent and emit retry metadata into rationale/observations for provenance and debugging.

Description

  • Added reusable retry primitives in lib/application/effect.ts: exponentialJitterSchedule, a pipeable retryWithBackoff(config), and retryWithBackoffResult(effectFactory, config) which returns retry-attempt metadata (RetryResult).
  • Wrapped translation LLM calls (lib/application/translation-provider.ts for llm-api and copilot) with retryWithBackoffResult, gated by a transient error predicate that excludes validation-error, and appended retry counts into translation rationale via withRetryMetadataRationale.
  • Wrapped agent interpreter LLM/session calls (lib/application/agent-interpreter-provider.ts) with retryWithBackoffResult, excluded validation-error from retries, and propagated retry metadata into both rationale and observation.detail.retryAttempts using addRetryDetail.
  • Added selective filesystem race retries in lib/infrastructure/fs/local-fs.ts (read/write/json/stat/list/ensure) by wrapping tryFileSystem effects with a file-race predicate (checks FileSystemError.cause for ENOENT/EEXIST) and a small backoff.
  • Added optional transient retry wrapping to Playwright screen observation (lib/infrastructure/observation/playwright-screen-observer.ts) using a predicate that excludes validation errors and inspects the cause message for navigation/timeouts.
  • Retry defaults used at call sites: translation providers baseDelayMs: 150, maxRecurs: 2; agent providers baseDelayMs: 200, maxRecurs: 2; filesystem races baseDelayMs: 40, maxRecurs: 2; Playwright observation baseDelayMs: 120, maxRecurs: 1.

Testing

  • Ran npm run build (and npm run types) to verify compilation; the build failed due to pre-existing unrelated TypeScript issues in lib/application/benchmark.ts and lib/playwright/state-topology.ts and not due to the retry changes.
  • Verified locally that retry helpers type-check and that no additional compile errors were introduced by the retry changes beyond the unrelated failures reported above.
  • No runtime test suite changes were executed in this change; further integration testing is recommended for LLM/agent backends and concurrent FS scenarios to validate retry behavior and rationale/observation metadata emission.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant