Skip to content

Fix network bridge silent death when transport exception occurs during broker info handshake#1865

Open
TylerMurali wants to merge 3 commits intoapache:mainfrom
tyler-technologies-oss:fix-bridge-reconnect-1864
Open

Fix network bridge silent death when transport exception occurs during broker info handshake#1865
TylerMurali wants to merge 3 commits intoapache:mainfrom
tyler-technologies-oss:fix-bridge-reconnect-1864

Conversation

@TylerMurali
Copy link
Copy Markdown
Contributor

@TylerMurali TylerMurali commented Apr 1, 2026

Summary

Fixes #1864

  • Bug: In DemandForwardingBridgeSupport.start(), the onException() handlers for both local and remote transports return early when futureBrokerInfo is not done, skipping serviceLocalException()/serviceRemoteException(). This prevents the reconnection chain (fireBridgeFailed()discoveryAgent.serviceFailed()) from firing, causing the bridge to silently die without reconnection.
  • Fix: Remove the two return statements so the exception service methods always fire. The future cancellation is preserved. This is a 2-line change.
  • Evidence: The duplexInboundLocalBroker handler in the same file unconditionally calls serviceLocalException(), confirming the intended behavior.

Production Validation

Tested in production with 15 network connectors over 3.5 months. The bug path was hit 12 times (transport exceptions during broker info handshake), and all 12 resulted in successful reconnection after the fix. Prior to the fix, these were permanent connector losses requiring broker restarts.

Test Plan

  • testBridgeReconnectsAfterHandshakeFailureReproduces the exact bug path: uses a fake server socket that accepts TCP connections but never sends BrokerInfo, so onException fires while futureBrokerInfo is NOT done. Fails without the fix, passes with it.
  • testBridgeReconnectsAfterRemoteBrokerRestart — General reconnection test: single remote broker stop/restart cycle
  • Existing tests pass: DemandForwardingBridgeTest, DemandForwardingBridgeSupportTest, NetworkReconnectTest

TylerMurali and others added 2 commits March 31, 2026 20:39
…g broker info handshake (apache#1864)

Remove early return in onException() handlers for local and remote
transports in DemandForwardingBridgeSupport. When futureBrokerInfo
was not done, the handler cancelled the future and returned without
calling serviceLocalException()/serviceRemoteException(), preventing
the reconnection chain from firing. The bridge would silently die
and never re-establish.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add testBridgeReconnectsAfterHandshakeFailure which uses a fake server
socket that accepts connections but never sends BrokerInfo, ensuring
onException fires while futureBrokerInfo is pending. This test fails
without the fix and passes with it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion propagation

Address reviewer feedback from apache#1865:
- Remove Thread.sleep(2000) in favor of condition-based Wait.waitFor()
- Disable InactivityMonitor (maxInactivityDuration=0) to isolate the
  onException code path from secondary exception sources

Add exception-type verification using a custom BridgeFactory that tracks
calls to serviceRemoteException(). The test now asserts that the original
IOException reaches serviceRemoteException (from onException handler),
not only a TimeoutException (from the collectBrokerInfos fallback).
Without the fix, the early return in onException prevents the IOException
from being propagated — the test fails with "Exceptions received:
TimeoutException".

The test still performs the full end-to-end handshake failure simulation:
fake server (no BrokerInfo) -> socket close -> real broker starts ->
bridge reconnects -> messages flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TylerMurali TylerMurali requested review from jbonofre and mattrpav April 1, 2026 14:41
TylerMurali added a commit to tyler-technologies-oss/activemq that referenced this pull request Apr 1, 2026
This test belongs to the upstream PR (apache#1865) on the
fix-bridge-reconnect-1864 branch. It requires JDK 17+ and the Jakarta
namespace which are not available on the 5.18.3 release branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cshannon cshannon self-requested a review April 2, 2026 14:41

boolean hasIOException = remoteExceptions.stream()
.anyMatch(ex -> ex instanceof IOException);
assertTrue(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still investigating this, but the first observation is the test case here might need some improvement. I ran this test without the included fix and I tried commenting out the assertion here and the tests continued and both passed and the bridges seemed to reconnect so it's not really demonstrating the issue of silently not reconnecting as described from what I can tell.

@cshannon
Copy link
Copy Markdown
Contributor

cshannon commented Apr 2, 2026

So I am not saying there is not an issue here, but I'm trying to figure out what I am missing because the reconnect logic still seems to be fired even without this fix because the TimeoutException is still being passed along to bridge failed. Maybe there is a race condition.

Here is what I am seeing when I run the test case and step through things without the fix:

  1. EOFException does happen first and is caught here and aborts as noted
  2. However, after that happens the call to collectBrokerInfos() is called here.
  3. collectBrokerInfos() triggers a timeout exception and is caught here
  4. bridgeFailedError(timeoutException) is called here which passes in that timeout exception and called bridgeFailed() on the listener
  5. discoveryServiceAgent.sericeFailed(event) is invoked here which in turn kicks off the reconnect logic

@cshannon
Copy link
Copy Markdown
Contributor

cshannon commented Apr 2, 2026

So digging in a bit more I see that when the exception happens it doesn't pass the IOException obviously, but futureRemoteBrokerInfo.cancel(true); is called which triggers the TimeoutException. That does get passed to serviceRemoteException() in collectBrokerInfos() which should trigger the reconnect.

So the part that I am confused about is your saying that returning early prevents fireBridgeFailed() from firing, but from what I can see returning early first calls cancel on the futureRemoteBrokerInfo which in turn cause the TimeoutException which is caught later and triggers fireBridgeFailed()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Network bridge silently dies without reconnection when transport exception occurs during broker info handshake

4 participants