Skip to content

noncebalancer: use endpointsharding, ignore ready status#8679

Open
jsha wants to merge 4 commits intomainfrom
noncebalancer-endpointsharding
Open

noncebalancer: use endpointsharding, ignore ready status#8679
jsha wants to merge 4 commits intomainfrom
noncebalancer-endpointsharding

Conversation

@jsha
Copy link
Contributor

@jsha jsha commented Mar 14, 2026

The old noncebalancer only saw READY SubConns, which was a problem during the brief periods when a SubConn was reconnecting (for instance due to a GOAWAY from the server), since nonce redemption requests are not fungible between backends. Unfortunately, READY SubConns are all that the balancer interface provides. And we can't get that interface to pass non-READY SubConns to our picker without reimplementing or copying all its SubConn management logic.

Luckily, grpc provides the endpointsharding balancer implementation that does exactly what we want. It maintains a collection of child balancers each owning a single endpoint (note: for our setup an endpoint is equivalent to a single address, though it can be one-to-many). It also lets us query the state of each child, including the endpoint it's responsible for.

This allows us to construct a picker that is aware of all available backends, even those that aren't currently READY. That, in turn, prevents us from temporarily serving errors while a given nonce redemption backend is reconnecting.

To see another example of endpointsharding in use, see the customroundrobin implementation.

For more context on how endpointsharding came to be implemented, see gRFC A61: IPv4 and IPv6 Dualstack Backend Support.

If you're curious how endpointsharding passes around the information about non-READY SubConns, it uses a type assertion from a balancer.Picker to its internal type.

Alternative to #8672. Fixes #8662.

This edits noncebalancer.go in place for ease of diffing, and also copies the original grpc/noncebalancer (with no edits) to grpc/noncebalancerv1. But don't take my word for it:

diff <(git show origin/main:grpc/noncebalancer/noncebalancer.go) grpc/noncebalancerv1/noncebalancer.go
diff <(git show origin/main:grpc/noncebalancer/noncebalancer_test.go) grpc/noncebalancerv1/noncebalancer_test.go

The old noncebalancer only saw READY SubConns, which was a problem during the
brief periods when a SubConn needed to reconnect (for instance due to a GOAWAY
from the server). Unfortunately, that's all the balancer interface provides.
And we can't get it to pass non-READY SubConns to our picker without
reimplementing or copying all its SubConn management logic.

Luckily, grpc provides the [`endpointsharding`] balancer implementation
that does exactly what we want.  It maintains a collection of child
balancers each owning a single endpoint (note: for our purposes an
endpoint is equivalent to addresses, though it can be one-to-many).
It also lets us query the [state] of each child, including the
endpoint it's responsible for us.

This allows us to construct a picker that is aware of all available backends,
even those that aren't currently READY. That, in turn, prevents us from
temporarily serving errors while a given nonce redemption backend reconnects.

To see an example of `endpointsharding` in use, see the [`customroundrobin`]
implementation.

For more context on how `endpointsharding` came to be implemented, see
[gRFC A61: IPv4 and IPv6 Dualstack Backend Support](a61).

[`endpointsharding`]: https://pkg.go.dev/google.golang.org/grpc/balancer/endpointsharding
[state]: https://pkg.go.dev/google.golang.org/grpc/balancer/endpointsharding#ChildState
[a61]: https://github.com/grpc/proposal/blob/master/A61-IPv4-IPv6-dualstack-backends.md
[`customroundrobin`]: https://github.com/grpc/grpc-go/blob/99f36d4a0c28bc967a8d3fe23ebc2a264b322070/examples/features/customloadbalancer/client/customroundrobin/customroundrobin.go
@jsha jsha marked this pull request as ready for review March 16, 2026 16:55
@jsha jsha requested a review from a team as a code owner March 16, 2026 16:55
@jsha jsha requested a review from beautifulentropy March 16, 2026 16:55
@jsha jsha marked this pull request as draft March 17, 2026 17:10
@jsha
Copy link
Contributor Author

jsha commented Mar 17, 2026

Back in draft because I'm currently implementing the config-based switching between implementations.

jsha added 3 commits March 17, 2026 12:35
Set maxConnectionAge to 1s, and make nonce_test.go collect 300 nonces, then
redeem them one at a time, separated by 10ms. This creates a high likelihood of
a redemption request occuring during a reconnect.
@jsha jsha marked this pull request as ready for review March 17, 2026 19:57
@jsha
Copy link
Contributor Author

jsha commented Mar 17, 2026

Ready for review. The new noncebalancer is selectable by setting in wfe2.json:

			"srvResolver": "nonce-srv-v2",

Copy link
Member

@beautifulentropy beautifulentropy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! Using endpointsharding is a really clean way to get visibility into non-READY backends without reimplementing SubConn management. I have just one optional comment, let me know what you think.

return balancer.PickResult{}, ErrNoBackendsMatchPrefix.Err()
}
return balancer.PickResult{SubConn: sc}, nil
return childPicker.Pick(info)
Copy link
Member

@beautifulentropy beautifulentropy Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a genuinely rare case and not likely to matter much during normal operations:

If the child picker returns an error (e.g. a pickfirst child in TRANSIENT_FAILURE during a network partition), gRPC will immediately terminate the RPC. That error is then surfaced to the Subscriber as a 500 instead of badNonce.

Alternatively, since DNS recently resolved this backend, we know it did exist, so this error may truly be transient. If we return ErrNoSubConnAvailable here instead, this would queue the RPC until the SubConn's state changes and a new picker is built, or until the RPC's context deadline expires.

If the backend recovers, the queued RPC succeeds. If it's truly gone, DNS re-resolution will eventually drop it, the child will be removed, and the prefixToPicker check will result in ErrNoBackendsMatchPrefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing around with this on my own I don't see an error we could safely treat as an indicator of a backend in TRANSIENT_FAILURE. Rather than requeuing all RPCs that get error responses from childPicker.Pick(), it might be better to accept that 500s may happen in this instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "we know your nonce prefix is real, and know what server to talk to in order to redeem it, but we can't talk to that server" is a reasonable 500. It truly is an internal server error.

@github-actions
Copy link
Contributor

@jsha, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits.

"context"
"errors"
"fmt"
"google.golang.org/grpc/serviceconfig"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this import got moved up, it should stay below with the other google.golang.org imports.

return balancer.PickResult{}, ErrNoBackendsMatchPrefix.Err()
}
return balancer.PickResult{SubConn: sc}, nil
return childPicker.Pick(info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "we know your nonce prefix is real, and know what server to talk to in order to redeem it, but we can't talk to that server" is a reasonable 500. It truly is an internal server error.

// builder builds a nonceBalancer, which internally uses `endpointsharding.NewBalancer`.
//
// The embedded `endpointsharding` balancer manages a set of child pickers that all use
// `pickfirst` on an endpoint that consists of a single IP address (because our `"nonce-srv"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think:

Suggested change
// `pickfirst` on an endpoint that consists of a single IP address (because our `"nonce-srv"`
// `pickfirst` on an endpoint that consists of a single IP address (because our `"nonce-srv-v2"`

}
],
"srvResolver": "nonce-srv",
"srvResolver": "nonce-srv-v2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should leave this as nonce-srv, even though it leads to occasional CI flakes, until prod updates to use the new resolver.

If that makes the new integration test fail (I imagine it does), then gate the test on config-next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix badNonce CI flake

3 participants