fix(agent): reuse a pooled SGLang client across turns and retry once on pre-connect connector errors#2069
Open
EazyReal wants to merge 1 commit into
Open
Conversation
… pre-connect errors call_sglang_generate opens a fresh aiohttp.ClientSession (and TCP connection) for every /generate turn, plus another short-lived session per /abort_request. Under concurrent multi-turn agent rollouts this churns sockets and intermittently surfaces as aiohttp.ClientConnectorError, failing the trajectory even though the request never reached SGLang. - Register one pooled aiohttp.ClientSession per adapter app via an aiohttp cleanup_ctx (sglang_client_context), built over a shared TCPConnector (limit=0, ttl_dns_cache=300, keepalive_timeout=60) and closed on app shutdown. BaseAdapter.__init__ appends it to app.cleanup_ctx, so every consumer that runs the app through a runner gets it for free. - Retry the /generate POST exactly once on aiohttp.ClientConnectorError, reusing the SAME rid. A connector error is raised before any request bytes are written, so the server never saw the rid and the retry cannot double-generate. - Errors after the request may have reached the server (disconnects, timeouts, HTTP errors) are NOT retried; they still flow through the existing abort-by-rid path, now over the pooled client. Tests pin the invariants: stable client identity across turns; retry happens exactly once and reuses the rid; mid-flight failures are never retried and abort by rid; a second consecutive connector error propagates after exactly one retry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem.
call_sglang_generateopens a freshaiohttp.ClientSession(and TCP connector) for every/generatecall — one per agent turn — plus another short-lived session per/abort_request. Under concurrent multi-turn agent rollouts this churns sockets (accumulating TIME_WAIT connections) and intermittently surfaces asaiohttp.ClientConnectorError, which fails the whole turn — and with it the trajectory — even though the request never reached SGLang.Before. Each turn: new session → new TCP connection → request → teardown. A transient connect-time failure (ephemeral-port/socket churn, brief router unavailability) raises
ClientConnectorErrorwith no recovery: a 200-turn rollout batch dies on a single failedconnect()even though no request was ever in flight.After. Each adapter app owns one pooled
ClientSession, registered via aiohttp'scleanup_ctxinBaseAdapter.__init__and closed on app shutdown, so turns reuse warm keep-alive connections instead of reconnecting per call. AClientConnectorErroris retried exactly once with the samerid: connector errors are raised before any request bytes are written, so the server never saw the rid and the retry cannot double-generate. Errors after the request may have reached the server (disconnects, timeouts, HTTP >= 400) are NOT retried — they still go through the existing abort-by-rid path, now over the pooled client.Why this fix. The root cause is per-call connection churn, so the fix pools at the adapter-app boundary where the lifecycle hook already exists (
cleanup_ctx), rather than papering over symptoms with a blanket retry. The retry is restricted to the only failure class that is provably idempotent (pre-connect), preserving the invariant that any request that may have reached SGLang is aborted by rid, never re-issued. Request body, headers, sampling params, rid semantics, and thecall_sglang_generatesignature are unchanged; both provider adapters (openai/anthropic) get the pooled client for free throughBaseAdapter. The connector useslimit=0to match the previous (unbounded, per-call) behavior — concurrency stays governed by the rollout scheduler.Tests (
tests/test_agent_adapters.py, CPU-only, already in CI):ServerDisconnectedError) is never retried and aborts by rid;Each test was mutation-checked: regenerating the rid on retry, bypassing the pooled client, widening the retry to
aiohttp.ClientError, or removing the retry bound each fails at least one test.🤖 Generated with Claude Code