noncebalancer: use endpointsharding, ignore ready status by jsha · Pull Request #8679 · letsencrypt/boulder

jsha · 2026-03-14T00:06:34Z

The old noncebalancer only saw READY SubConns, which was a problem during the brief periods when a SubConn was reconnecting (for instance due to a GOAWAY from the server), since nonce redemption requests are not fungible between backends. Unfortunately, READY SubConns are all that the balancer interface provides. And we can't get that interface to pass non-READY SubConns to our picker without reimplementing or copying all its SubConn management logic.

Luckily, grpc provides the endpointsharding balancer implementation that does exactly what we want. It maintains a collection of child balancers each owning a single endpoint (note: for our setup an endpoint is equivalent to a single address, though it can be one-to-many). It also lets us query the state of each child, including the endpoint it's responsible for.

This allows us to construct a picker that is aware of all available backends, even those that aren't currently READY. That, in turn, prevents us from temporarily serving errors while a given nonce redemption backend is reconnecting.

To see another example of endpointsharding in use, see the customroundrobin implementation.

For more context on how endpointsharding came to be implemented, see gRFC A61: IPv4 and IPv6 Dualstack Backend Support.

If you're curious how endpointsharding passes around the information about non-READY SubConns, it uses a type assertion from a balancer.Picker to its internal type.

Alternative to #8672. Fixes #8662.

This edits noncebalancer.go in place for ease of diffing, and also copies the original grpc/noncebalancer (with no edits) to grpc/noncebalancerv1. But don't take my word for it:

diff <(git show origin/main:grpc/noncebalancer/noncebalancer.go) grpc/noncebalancerv1/noncebalancer.go
diff <(git show origin/main:grpc/noncebalancer/noncebalancer_test.go) grpc/noncebalancerv1/noncebalancer_test.go

The old noncebalancer only saw READY SubConns, which was a problem during the brief periods when a SubConn needed to reconnect (for instance due to a GOAWAY from the server). Unfortunately, that's all the balancer interface provides. And we can't get it to pass non-READY SubConns to our picker without reimplementing or copying all its SubConn management logic. Luckily, grpc provides the [`endpointsharding`] balancer implementation that does exactly what we want. It maintains a collection of child balancers each owning a single endpoint (note: for our purposes an endpoint is equivalent to addresses, though it can be one-to-many). It also lets us query the [state] of each child, including the endpoint it's responsible for us. This allows us to construct a picker that is aware of all available backends, even those that aren't currently READY. That, in turn, prevents us from temporarily serving errors while a given nonce redemption backend reconnects. To see an example of `endpointsharding` in use, see the [`customroundrobin`] implementation. For more context on how `endpointsharding` came to be implemented, see [gRFC A61: IPv4 and IPv6 Dualstack Backend Support](a61). [`endpointsharding`]: https://pkg.go.dev/google.golang.org/grpc/balancer/endpointsharding [state]: https://pkg.go.dev/google.golang.org/grpc/balancer/endpointsharding#ChildState [a61]: https://github.com/grpc/proposal/blob/master/A61-IPv4-IPv6-dualstack-backends.md [`customroundrobin`]: https://github.com/grpc/grpc-go/blob/99f36d4a0c28bc967a8d3fe23ebc2a264b322070/examples/features/customloadbalancer/client/customroundrobin/customroundrobin.go

jsha · 2026-03-17T17:11:20Z

Back in draft because I'm currently implementing the config-based switching between implementations.

Set maxConnectionAge to 1s, and make nonce_test.go collect 300 nonces, then redeem them one at a time, separated by 10ms. This creates a high likelihood of a redemption request occuring during a reconnect.

jsha · 2026-03-17T20:00:02Z

Ready for review. The new noncebalancer is selectable by setting in wfe2.json:

			"srvResolver": "nonce-srv-v2",

beautifulentropy

Great work on this! Using endpointsharding is a really clean way to get visibility into non-READY backends without reimplementing SubConn management. I have just one optional comment, let me know what you think.

beautifulentropy · 2026-03-18T20:39:46Z

grpc/noncebalancer/noncebalancer.go

 		return balancer.PickResult{}, ErrNoBackendsMatchPrefix.Err()
 	}
-	return balancer.PickResult{SubConn: sc}, nil
+	return childPicker.Pick(info)


This is a genuinely rare case and not likely to matter much during normal operations:

If the child picker returns an error (e.g. a pickfirst child in TRANSIENT_FAILURE during a network partition), gRPC will immediately terminate the RPC. That error is then surfaced to the Subscriber as a 500 instead of badNonce.

Alternatively, since DNS recently resolved this backend, we know it did exist, so this error may truly be transient. If we return ErrNoSubConnAvailable here instead, this would queue the RPC until the SubConn's state changes and a new picker is built, or until the RPC's context deadline expires.

If the backend recovers, the queued RPC succeeds. If it's truly gone, DNS re-resolution will eventually drop it, the child will be removed, and the prefixToPicker check will result in ErrNoBackendsMatchPrefix.

Playing around with this on my own I don't see an error we could safely treat as an indicator of a backend in TRANSIENT_FAILURE. Rather than requeuing all RPCs that get error responses from childPicker.Pick(), it might be better to accept that 500s may happen in this instance.

I think "we know your nonce prefix is real, and know what server to talk to in order to redeem it, but we can't talk to that server" is a reasonable 500. It truly is an internal server error.

github-actions · 2026-03-18T20:47:34Z

@jsha, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

aarongable

LGTM with nits.

aarongable · 2026-03-19T00:07:22Z

grpc/internal/resolver/dns/dns_resolver.go

 	"context"
 	"errors"
 	"fmt"
+	"google.golang.org/grpc/serviceconfig"


Not sure why this import got moved up, it should stay below with the other google.golang.org imports.

aarongable · 2026-03-19T00:11:28Z

grpc/noncebalancer/noncebalancer.go

 		return balancer.PickResult{}, ErrNoBackendsMatchPrefix.Err()
 	}
-	return balancer.PickResult{SubConn: sc}, nil
+	return childPicker.Pick(info)


I think "we know your nonce prefix is real, and know what server to talk to in order to redeem it, but we can't talk to that server" is a reasonable 500. It truly is an internal server error.

aarongable · 2026-03-19T00:12:05Z

grpc/noncebalancer/noncebalancer.go

+// builder builds a nonceBalancer, which internally uses `endpointsharding.NewBalancer`.
+//
+// The embedded `endpointsharding` balancer manages a set of child pickers that all use
+// `pickfirst` on an endpoint that consists of a single IP address (because our `"nonce-srv"`


I think:

Suggested change

// `pickfirst` on an endpoint that consists of a single IP address (because our `"nonce-srv"`

// `pickfirst` on an endpoint that consists of a single IP address (because our `"nonce-srv-v2"`

aarongable · 2026-03-19T00:14:44Z

test/config/wfe2.json

 				}
 			],
-			"srvResolver": "nonce-srv",
+			"srvResolver": "nonce-srv-v2",


I feel like we should leave this as nonce-srv, even though it leads to occasional CI flakes, until prod updates to use the new resolver.

If that makes the new integration test fail (I imagine it does), then gate the test on config-next.

jsha marked this pull request as ready for review March 16, 2026 16:55

jsha requested a review from a team as a code owner March 16, 2026 16:55

jsha requested a review from beautifulentropy March 16, 2026 16:55

jsha marked this pull request as draft March 17, 2026 17:10

aarongable mentioned this pull request Mar 17, 2026

grpc: Add noncebalancer that tracks non-READY backends #8672

Closed

jsha added 3 commits March 17, 2026 12:35

Re-add noncebalancerv1

4ac9a53

noncebalancer: integration test reconnects

8fa3ca9

Set maxConnectionAge to 1s, and make nonce_test.go collect 300 nonces, then redeem them one at a time, separated by 10ms. This creates a high likelihood of a redemption request occuring during a reconnect.

Fix import grouping

6726cfc

jsha marked this pull request as ready for review March 17, 2026 19:57

jsha mentioned this pull request Mar 17, 2026

noncebalancer: integration test reconnects #8680

Closed

beautifulentropy approved these changes Mar 18, 2026

View reviewed changes

beautifulentropy requested a review from aarongable March 18, 2026 20:47

aarongable approved these changes Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

noncebalancer: use endpointsharding, ignore ready status#8679

noncebalancer: use endpointsharding, ignore ready status#8679
jsha wants to merge 4 commits intomainfrom
noncebalancer-endpointsharding

jsha commented Mar 14, 2026 •

edited

Loading

Uh oh!

jsha commented Mar 17, 2026

Uh oh!

jsha commented Mar 17, 2026

Uh oh!

beautifulentropy left a comment

Uh oh!

beautifulentropy Mar 18, 2026 •

edited

Loading

Uh oh!

beautifulentropy Mar 18, 2026

Uh oh!

aarongable Mar 19, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

aarongable left a comment

Uh oh!

aarongable Mar 19, 2026

Uh oh!

aarongable Mar 19, 2026

Uh oh!

aarongable Mar 19, 2026

Uh oh!

aarongable Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// `pickfirst` on an endpoint that consists of a single IP address (because our `"nonce-srv"`
	// `pickfirst` on an endpoint that consists of a single IP address (because our `"nonce-srv-v2"`

Uh oh!

Conversation

jsha commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsha commented Mar 17, 2026

Uh oh!

jsha commented Mar 17, 2026

Uh oh!

beautifulentropy left a comment

Choose a reason for hiding this comment

Uh oh!

beautifulentropy Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beautifulentropy Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

aarongable Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

aarongable left a comment

Choose a reason for hiding this comment

Uh oh!

aarongable Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

aarongable Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

aarongable Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

aarongable Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jsha commented Mar 14, 2026 •

edited

Loading

beautifulentropy Mar 18, 2026 •

edited

Loading