Skip to content

feat(#4560): default transient-network retry in shared HTTP transport#230

Merged
Skobeltsyn merged 1 commit into
mainfrom
feat/4560-transport-retry
Jun 17, 2026
Merged

feat(#4560): default transient-network retry in shared HTTP transport#230
Skobeltsyn merged 1 commit into
mainfrom
feat/4560-transport-retry

Conversation

@Skobeltsyn

Copy link
Copy Markdown
Contributor

Summary

Transient network failures now retry by default across all HTTP providers — addressing the live-LLM review (the OpenRouter blip) and the principle that a dropped connection / 503 is the textbook retryable case. The shared non-streaming transport HttpModelClientSupport.sendBounded (used by Claude, OpenAI + DeepSeek/Kimi/OpenRouter/Perplexity, Gemini, Ollama) retries:

  • connection-level IOException (reset, refused, no-route, unexpected EOF), and
  • transient HTTP statuses (408/429/500/502/503/504).

3 attempts, exponential backoff (250ms→500ms) — matches the OpenAI SDK default. Previously only Ollama retried.

Two deliberate exclusions

  • HttpTimeoutException is not retried — the per-request timeout is the caller's total budget; retrying would silently 3× it (keeps OllamaClientTimeoutTest's ~250ms elapsed bound intact).
  • Original exception type preserved on exhaustion (rethrown, not wrapped) so the agent-level onLLMError can still match e is ConnectException.

Sits below onLLMError/LlmErrorDecision; streaming is not retried (mid-stream re-issue would duplicate tokens).

Gates

6 tests (scripted fake HttpClient: conn-exception retry, transient-status retry, exhaustion rethrows original, no-retry-on-timeout, no-retry-on-4xx, final-body-preserved). Full ./gradlew build green. docs/error-recovery.md + CHANGELOG updated. Closes #4560.

CodeQL java-kotlin expected-red on Kotlin 2.4 (codeql#21938); build is the gate.

🤖 Generated with Claude Code

… (sendBounded)

Network blips deserve a retry policy by default, not opt-in. The shared non-streaming
transport (used by Claude, OpenAI + DeepSeek/Kimi/OpenRouter/Perplexity, Gemini, Ollama)
now retries transient failures:
- connection-level IOException (reset, refused, no-route, unexpected EOF), and
- transient HTTP statuses (408/429/500/502/503/504).
3 attempts, exponential backoff (250ms->500ms). Matches OpenAI-SDK default behavior.

Two deliberate exclusions:
- HttpTimeoutException is NOT retried — the per-request timeout is the caller's TOTAL
  budget; retrying would silently multiply it (kept OllamaClientTimeoutTest's ~250ms
  elapsed bound intact).
- the ORIGINAL exception type is preserved on exhaustion (rethrown, not wrapped) so
  onLLMError can still match `e is ConnectException`, etc.

Sits below onLLMError/LlmErrorDecision (transport rides out blips; handler sees what
survives). Streaming not retried (mid-stream re-issue would duplicate tokens). 6 tests
(scripted fake HttpClient). docs/error-recovery.md + CHANGELOG updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Skobeltsyn Skobeltsyn merged commit aad83e2 into main Jun 17, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant