Skip to content

Add support for server selection's deprioritized servers to all topologies.#1860

Open
vbabanin wants to merge 47 commits intomongodb:mainfrom
vbabanin:JAVA-6021
Open

Add support for server selection's deprioritized servers to all topologies.#1860
vbabanin wants to merge 47 commits intomongodb:mainfrom
vbabanin:JAVA-6021

Conversation

@vbabanin
Copy link
Member

@vbabanin vbabanin commented Jan 12, 2026

Relevant specification changes:

JAVA-6021,
JAVA-6074,
JAVA-6105,
JAVA-6114


protected void updateDescription(final ClusterDescription newDescription) {
@VisibleForTesting(otherwise = PROTECTED)
public void updateDescription(final ClusterDescription newDescription) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we can avoid making this method public by overriding it in the subclass. We won't even need to make it public in the subclass, because the test code using the method will be in the same package as the subclass. However, this requires the subclass to not be anonymous (if we could use var, the proposed approach would have worked even for an anonymous subclass).

@vbabanin vbabanin self-assigned this Jan 15, 2026
Comment on lines +120 to +125
} catch (MongoTimeoutException mongoTimeoutException) {
List<ServerDescription> inLatencyWindowServers = buildServerDescriptions(definition.getArray("in_latency_window"));
assertTrue("Expected emtpy but was " + inLatencyWindowServers.size() + " " + definition.toJson(
JsonWriterSettings.builder()
.indent(true).build()), inLatencyWindowServers.isEmpty());
return;
Copy link
Member Author

@vbabanin vbabanin Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we now perform the full server-selection path (via BaseCluster), the behavior of "no servers selected: is observed differently than before.

Previously, these tests invoked the Selector directly, got an empty list, and asserted on that result. With BaseCluster, server selection runs the normal selection loop: it will retry until either a server becomes selectable or the selection timeout elapses.

In this setup, a server selection timeout is the expected signal that no servers are available/eligible for selection. The timeout is set to 200ms to keep the test fast, while giving enough headroom to avoid any flakiness.

List<ServerDescription> latencyBasedSelectedServers = latencyBasedServerSelector.select(clusterDescription);
assertServers(latencyBasedSelectedServers, inLatencyWindowServers);
assertNotNull(serverTuple);
assertTrue(inLatencyWindowServers.stream().anyMatch(s -> s.getAddress().equals(serverTuple.getServerDescription().getAddress())));
Copy link
Member Author

@vbabanin vbabanin Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s an ambiguity in the server selection spec docs about what drivers must assert for in_latency_window:

tests/README.md says to verify the set of servers in in_latency_window.

"Drivers implementing server selection MUST test that their implementation correctly returns the set of servers in in_latency_window."

server-selection-tests.mdsays to verify the selection returns one of the servers in in_latency_window.

"Drivers implementing server selection MUST test that their implementations correctly return one of the servers in in_latency_window."

This test follows server-selection-tests.md, so asserting that the selected server is within the expected in_latency_window set is consistent with the requirements in the spec.

P.S: Both files were created in this single commit with contradicting requirements:

Feb 6, 2015 - Commit 6b63123a - "Add Server Selection Spec"

 File 1: server-selection-tests.rst
 Drivers implementing server selection MUST test that their implementations
 correctly return one of the servers in ``in_latency_window``.

 File 2: tests/README.rst
 Drivers implementing server selection MUST test that their implementations
 correctly return the set of servers in ``in_latency_window``.

# Conflicts:
#	driver-core/src/test/resources/specifications
#	driver-core/src/test/unit/com/mongodb/connection/ServerSelectionSelectionTest.java
Copy link
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last reviewed commit is f20b482.

The files I haven't yet reviewed:

  • ServerDeprioritizationTest
  • ServerSelectionSelectionTest

@vbabanin I am proposing to resolve all the outstanding comments before proceeding with the changes required by DRIVERS-3404 (for details, see the description of this PR).

new ReadConcernAwareNoOpSessionContext(ReadConcern.DEFAULT),
new TimeoutContext(TIMEOUT_SETTINGS),
getServerApi());
public static OperationContext getOperationContext() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for noticing the issue with OperationContext being mutable (it was mutable before this PR), and fixing it.

The methods returning OperationContext in this class are a mess:

  • getOperationContext()
  • createOperationContext(TimeoutSettings timeoutSettings)
  • createNewOperationContext(TimeoutSettings timeoutSettings)
  • getOperationContext(ReadPreference readPreference)
  1. Let's name the methods consistently. I think, all of the aforementioned methods should use the "create" prefix.
    1.1. Let's do this automatically via IDE in a separate commit, and express in the commit message that the commit was done via automatic refactoring, so that reviewers know not to review it.
  2. Let's remove the weirdly named and trivial createNewOperationContext method, and inline it where it is used (fortunately, it is used only in two places in ClusterFixture, and nowhere else).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in:

  1. 103c4b0
  2. 021c262

def source = getBinding().getReadConnectionSource(OPERATION_CONTEXT)
def connection = source.getConnection(OPERATION_CONTEXT)
def source = getBinding().getReadConnectionSource(getOperationContext())
def connection = source.getConnection(getOperationContext())
Copy link
Member

@stIncMale stIncMale Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, the OperationContext used here should be the same. There may be many more changed places in this PR where this is the case, and given how many changes there are, it may not be too easy to identify them all.

For brevity, I'll be marking such places with just the "same context" comment. For now I am not sure if it is even practically achievable to identify them all.

The problem is exacerbated by ClusterFixture creating a new OperationContext each time it needs one, which means that the ClusterFixture.getBinding never uses the same context as the one used by the test calling ClusterFixture.getBinding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 4310894
Identified one more place: c736e6a

def source = getBinding().getReadConnectionSource(OPERATION_CONTEXT)
def connection = source.getConnection(OPERATION_CONTEXT)
def source = getBinding().getReadConnectionSource(getOperationContext())
def connection = source.getConnection(getOperationContext())
Copy link
Member

@stIncMale stIncMale Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 4310894

Comment on lines +85 to +101
@@ -98,7 +98,7 @@ protected void applyApplicationError(final BsonDocument applicationError) {
switch (type) {
case "command":
exception = getCommandFailureException(applicationError.getDocument("response"), serverAddress,
OPERATION_CONTEXT.getTimeoutContext());
getOperationContext().getTimeoutContext());
Copy link
Member

@stIncMale stIncMale Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 4310894

Comment on lines +157 to +159
Timeout serverSelectionTimeout = getOperationContext().getTimeoutContext().computeServerSelectionTimeout();
DefaultServer server = (DefaultServer) getCluster()
.getServersSnapshot(serverSelectionTimeout, OPERATION_CONTEXT.getTimeoutContext())
.getServersSnapshot(serverSelectionTimeout, getOperationContext().getTimeoutContext())
Copy link
Member

@stIncMale stIncMale Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 4310894

final ClusterableServerFactory serverFactory,
final ClientMetadata clientMetadata) {
@VisibleForTesting(otherwise = PRIVATE)
protected BaseCluster(final ClusterId clusterId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of otherwise is incorrect here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 08b3466

getRaceConditionPreFilteringSelector(serversSnapshot),
serverSelector,
serverDeprioritization.getServerSelector(),
serverDeprioritization.applyDeprioritization(serverSelector),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename the method to apply. The name of the type (ServerDeprioritization) of the object whose instance method we call tells what is being applied. There is no need to duplicate that in the names of methods.

Let's do this automatically via IDE in a separate commit, and express in the commit message that the commit was done via automatic refactoring, so that reviewers know not to review it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 98caa34

Comment on lines +286 to 293
if (serverDescriptions.size() == 1 || deprioritized.isEmpty()) {
return wrappedSelector.select(clusterDescription);
}

List<ServerDescription> nonDeprioritizedServerDescriptions = serverDescriptions
.stream()
.filter(serverDescription -> !deprioritized.contains(serverDescription.getAddress()))
.collect(toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave a TODO-JAVA-XXXX comment linking this code (the if optimization as well as the use of Stream) to the existing performance ticket about server selection. We should also mention in the description of that ticket that there are TODO-JAVA-XXXX comments that needs to be addressed.

The open questions here are:

  • whether the if optimization is worth itl
  • whether we should use a loop instead of using Stream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new ClusterDescription(clusterDescription.getConnectionMode(), clusterDescription.getType(),
nonDeprioritizedServerDescriptions,
clusterDescription.getClusterSettings(),
clusterDescription.getServerSettings()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the

public ClusterDescription(final ClusterConnectionMode connectionMode, final ClusterType type,
                          @Nullable final MongoException srvResolutionException,
                          final List<ServerDescription> serverDescriptions,
                          @Nullable final ClusterSettings clusterSettings,
                          @Nullable final ServerSettings serverSettings) {

constructor.

@stIncMale
Copy link
Member

I thought I already proposed this, but looks like I didn't: let's not implement any new spec changes in the PR (there were spec changes) until we resolve all outstanding discussions.

Copy link
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last reviewed commit is f20b482.

  • I reviewed ServerDeprioritizationTest.java.
  • ServerSelectionSelectionTest.java is yet to be reviewed.

of(Named.of(generateArgumentName(ALL_SERVERS), ALL_SERVERS))
);
}
@ParameterizedTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty line is missing after the selectNoneDeprioritized method and the @ParameterizedTest annotation on the next method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 2bc1b2e

Comment on lines +185 to +186
ServerSelector selector = createAssertingSelector(ALL_SERVERS, singletonList(SERVER_A));
assertEquals(singletonList(SERVER_A), serverDeprioritization.applyDeprioritization(selector).select(SHARDED_CLUSTER));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's return ALL_SERVERS from selector in onAttemptFailureIgnoresIfPoolClearedException. This way, we will not only check that the input to selector has all servers (that is, serverDeprioritization does not filter anything out at that point), but also that the serverDeprioritization does not have any unexpected logic of filtering the servers selected by the wrapped selector. If it had such an unexpected logic, it would have been more likely to leave SERVER_A be when it is the only server selected by the wrapped selector, and less likely to leave SERVER_A be when it is one of the servers selected by the wrapped selector. Thus, implementing selector such that it selects more than just SERVER_A, makes the test stronger.

Suggested change
ServerSelector selector = createAssertingSelector(ALL_SERVERS, singletonList(SERVER_A));
assertEquals(singletonList(SERVER_A), serverDeprioritization.applyDeprioritization(selector).select(SHARDED_CLUSTER));
ServerSelector selector = createAssertingSelector(ALL_SERVERS, ALL_SERVERS);
assertEquals(ALL_SERVERS, serverDeprioritization.applyDeprioritization(selector).select(SHARDED_CLUSTER));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 8a8bf0a.

Copy link
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last reviewed commit is f20b482.

Everything has been reviewed.

@@ -59,6 +80,14 @@ public class ServerSelectionSelectionTest {
private final ClusterDescription clusterDescription;
private final long heartbeatFrequencyMS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[this comment is not about the current PR]

Let's remove the heartbeatFrequencyMS field and make it a local variable, as Idea suggests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: cf5c484

try {
serverSelector = getServerSelector();
selectedServers = serverSelector.select(clusterDescription);
assumeTrue(!description.endsWith("/max-staleness/tests/ReplicaSetWithPrimary/MaxStalenessWithModePrimary.json"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this line changed? It seemed fine before. It is still OK now, but it now has a negation that is not necessary.

Copy link
Member Author

@vbabanin vbabanin Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this wasn’t intentional. The assumeTrue(!...) form is the older variant (it exists on 5.5.x/5.6.x and pre-cleanup main). We later switched main to assumeFalse(...) in 41ca107. This branch was created before that change, and I likely reintroduced the old form during a merge of main into this PR.

I have updated it to assumeFalse(...) here to match main and remove the negation. Commit: 32609f6

import static org.junit.Assert.fail;
import static org.junit.Assume.assumeFalse;
import static org.junit.Assume.assumeTrue;
import static org.junit.jupiter.api.Assertions.assertNotNull;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's either keep using JUnit 4, or rewrite this class to use JUnit 5.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 7d3d862

import static org.junit.jupiter.api.Assertions.assertNotNull;
import static org.mockito.Mockito.when;

// See https://github.com/mongodb/specifications/tree/master/source/server-selection/tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[this comment is not about the current PR]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: eff9e23

private final ClusterDescription clusterDescription;
private final long heartbeatFrequencyMS;
private final boolean error;
private final List<ServerAddress> deprioritizedServerAddresses;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem that we need this field. We can call extractDeprioritizedServerAddresses from createOperationContext, which is simpler than using the deprioritizedServerAddresses instance field to pass the result of extractDeprioritizedServerAddresses to createOperationContext.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 78d685d

private OperationContext createOperationContext() {
OperationContext operationContext =
OperationContext.simpleOperationContext(
new TimeoutContext(TIMEOUT_SETTINGS.withServerSelectionTimeoutMS(SERVER_SELECTION_TIMEOUT_MS)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use 0 ("immediate" timeout) and remove the SERVER_SELECTION_TIMEOUT_MS constant:

Suggested change
new TimeoutContext(TIMEOUT_SETTINGS.withServerSelectionTimeoutMS(SERVER_SELECTION_TIMEOUT_MS)));
new TimeoutContext(TIMEOUT_SETTINGS.withServerSelectionTimeoutMS(0)));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: b16bfb0

OperationContext.ServerDeprioritization serverDeprioritization = operationContext.getServerDeprioritization();
for (ServerAddress address : deprioritizedServerAddresses) {
serverDeprioritization.updateCandidate(address);
serverDeprioritization.onAttemptFailure(new MongoConfigurationException("test"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not being important, MongoConfigurationException is quite a weird exception here: why would we deprioritize a server due to a client configuration exception? Of course, in production it does not matter, because such an exception does not trigger a retry. But for the sake of making this less surprising, let's use something else, like MongoException?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: a7fc3fc

Comment on lines +320 to +322
private BaseCluster createTestCluster(final ClusterDescription clusterDescription, final Cluster.ServersSnapshot serversSnapshot) {
return new TestCluster(clusterDescription, serversSnapshot);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this method?

P.S. If it is really needed, it should be static.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 1d15a73

@Override
public void onChange(final ServerDescriptionChangedEvent event) {
// We do not expect this to be called during server selection.
Assertions.fail();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Does this comment communicate something that is not already obvious? It seems the assertion makes it clear that we don't expect the method to be called.
  2. We should call com.mongodb.assertions.Assertions.fail here instead of a JUnit assertion method: we do not check that the onChange is not called, we simply do not expect the onChange method to be called on this test double, so we don't care to implement it, and that's why we simply fail (it's the same behavior we have in MongoMockito, and we use com.mongodb.assertions.Assertions.fail there, see InsufficientStubbingDetector.answer).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 3e2c0cd

@vbabanin vbabanin requested a review from stIncMale March 5, 2026 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants