Skip to content

Fix ConnectionErrorRetryTimeout to act as a hard deadline during connection attempts#711

Closed
rhysparry wants to merge 2 commits into
mainfrom
rhys/eft-3214/fix-connection-error-retry-timeout-hard-deadline
Closed

Fix ConnectionErrorRetryTimeout to act as a hard deadline during connection attempts#711
rhysparry wants to merge 2 commits into
mainfrom
rhys/eft-3214/fix-connection-error-retry-timeout-hard-deadline

Conversation

@rhysparry
Copy link
Copy Markdown
Contributor

Summary

  • ConnectionErrorRetryTimeout was only checked at the start of each retry loop iteration in SecureListeningClient, so a single blocked connection attempt (slow TCP connect or frozen SSL handshake) could exceed the deadline by up to one full per-attempt timeout before the loop could exit.
  • Adds a CancellationTokenSource armed at ConnectionErrorRetryTimeout whose token is passed into AcquireConnectionAsync, propagating through both ConnectWithTimeoutAsync and AuthenticateAsClientEnforcingTimeout. This makes the deadline a true hard ceiling on each individual attempt.
  • Deadline-triggered OperationCanceledException is converted to TimeoutException in the catch block so HandleError surfaces it as HalibutClientException rather than re-throwing as OperationCanceledException.
  • Includes a regression test using a paused port forwarder (TCP connects immediately but data is frozen, causing SSL handshake to block until stream timeout). With ConnectionErrorRetryTimeout=3s and per-attempt stream timeout of 8s, the test verifies the call fails in under 6s.

Implications for existing users

Production defaults (ConnectionErrorRetryTimeout=5min, TcpClientConnectTimeout=60s, TcpClientTimeout=10min): no practical change. Retries complete (or exhaust RetryCountLimit) long before the 5-minute deadline fires.

Short ConnectionErrorRetryTimeout with long per-attempt timeouts: behaviour visibly changes. Previously a single stuck TCP connect or SSL handshake could outlast the deadline by up to TcpClientConnectTimeout or TcpClientTimeout.receive. Now the attempt is interrupted at the deadline. This is the intended semantics of the setting.

Exception type: the deadline case previously accidentally produced OperationCanceledException (via HandleError's switch case). It now produces HalibutClientException, which is the correct type for RPC timeout errors in Halibut.

Background TCP socket: when the deadline fires inside ConnectWithTimeoutAsync, the underlying ConnectAsync task is orphaned and the socket is force-closed via DisposeClient() — the same behaviour as when TcpClientConnectTimeout fires normally.

net48 SSL handshake: AuthenticateAsClientAsync on .NET Framework ignores cancellation tokens (noted in TcpConnectionFactory.cs with a // TODO: ASYNC ME UP! comment). The deadline CTS can still interrupt the TCP connect phase, but has no effect inside the SSL handshake on net48; the stream's ReceiveTimeout still controls that phase.

Test plan

  • New test ConnectionErrorRetryTimeout_IsAHardDeadlineEvenWhenIndividualAttemptsBlockLonger confirmed failing before fix (~10s, wrong exception type) and passing after fix (~3.7s, HalibutClientException)
  • All 5 ListeningConnectRetryFixture tests pass (including existing timeout and retry-count tests)

🤖 Generated with Claude Code

…ection attempts

Previously the timeout was only checked at the start of each retry loop
iteration, so a blocked SSL handshake could exceed it by up to one full
per-attempt timeout. Now a linked CancellationToken fires at the deadline,
interrupting AcquireConnectionAsync mid-attempt. Deadline cancellations are
surfaced as HalibutClientException rather than OperationCanceledException.

Includes a regression test using a paused port forwarder to demonstrate the
before/after: the overall wait is now bounded by ConnectionErrorRetryTimeout
(~3s) rather than the per-attempt stream timeout (~8s).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rhysparry rhysparry requested a review from a team as a code owner April 29, 2026 05:09
… deadline

AuthenticateAsClientAsync on .NET 4.8 does not accept a CancellationToken,
so the ConnectionErrorRetryTimeout deadline had no effect during the TLS
handshake — the attempt ran until the socket ReceiveTimeout instead.

Register a callback on the cancellation token that closes the underlying
TcpClient, which unblocks AuthenticateAsClientAsync immediately. The
resulting exception is re-thrown as OperationCanceledException so the
existing handling in SecureListeningClient routes it correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rhysparry
Copy link
Copy Markdown
Contributor Author

We want to avoid this behavioral change.

@rhysparry rhysparry closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant