connmgr: Limit total overall normal connections.#3697
Conversation
a38ce05 to
fc420cc
Compare
1534cee to
14e3f07
Compare
This adds a new context-aware semaphore type with Acquire and Release methods for use in upcoming changes that aim to simplify connection limiting by making use of semaphores for blocking until permits become available.
This adds tests for the new context-aware semaphore to ensure the acquire, release, and context cancel semantics work as expected.
5a0477f to
0f49238
Compare
The existing connection manager code was written well before contexts
were introduced. Further, due to the old async model that has now been
converted to a synchronous model, it is based around connection requests
that have their state atomically updated asynchronously as various
things happen.
While it has undoubtedly worked well enough for over a decade, it has
always been a challenge to add new functionality to it and requires the
use of a lot of less than ideal and highly outdated techniques such as
polling for state changes. It is also rather brittle in terms of
requiring output connections to be manually disconnected in the
connection manager after they've been closed to avoid things like
leaking goroutines and failing to update target outbound counts.
Moreover, it only tracks outgoing connections which ultimately forces a
lot of connection-related tasks to be split across different layers
instead of residing in the connection manager itself where they more
naturally belong. Notably, that split, for all intents and purposes,
prevents implementing some desirable more advanced features such as
immediate connection shedding, different connection types, and listeners
tied to specific network types.
With the primary goal of addressing all of the aforementioned points and
providing a solid base to work on for adding new features moving
forward, this significantly reworks the connection manager to completely
get rid of the notion of exposed connection requests in favor of a new
custom connection type that wraps the underlying net.Conn.
The new wrapped connections automatically handle cleanup when closed and
have an associated connection type enum that allows easily
distinguishing inbound, outbound, and manual connections as well as
supporting new connection types in the future.
Another nice feature of the new wrapped connections is they provide
efficient access to concrete parsed address types which paves the way
for avoiding a lot of constant reparsing, repeated host/port splitting
and joining, and generally much more ergonomic immutable address types.
Since changing to wrapped connections basically required a rather
significant rewrite of large portions of the connection manager anyway,
this also takes the opportunity to improve several other aspects of the
connection manager in the process such as implementing full context
support, full tracking of all connection types by the manager itself,
much more robust semaphore-based automatic connection limiting, cleaner
persistent connection handling with independent limits, prevention of
multiple connections of any type to the same address:port, more useful
debug logging, and cleanly closing all connections during shutdown.
It is also important to note that the following overall semantics have
intentionally been changed versus the existing connection manager:
- A maximum of 8 persistent connections is now imposed and they no
longer count toward the configured target number of automatic outbound
peers to maintain
- Duplicate addresses (host:port) are now rejected by the connection
manager for all types (inbound, outbound, manual, persistent)
- Note that inbound conns from the same IP will necessarily have
different ports, so the same max IP limits apply in that case
- RPC 'node connect' for all connection attempts now:
- Supports the RPC connection and server contexts
- Properly handles duplicate address rejection including pending
attempts
- RPC 'node connect' for non-persistent conn attempts now:
- Waits for the connection attempt result before returning
- Returns an error if the connection attempt fails
- Cancels the connection attempt if the RPC connection is closed
before it succeeds
- RPC 'node remove' now supports removing a pending connection by its
persistent connection ID (since no peer ID exists before a valid
connection is established)
- It is no longer possible for state transitions to allow things like
duplicate addresses or failed cancellation
The max retry duration is currently an unexported global variable that the tests override at init time. At least one of the tests also additionally overrides it for that specified test too. While this works, it is somewhat brittle and prevents the tests from being run in parallel. This improves the situation by making the max retry duration a field on the connection manager instead of a global variable and adding a test helper for creating a new connection manager that overrides it by default. Then any tests that need a different value can simply override it on their local instance. It also makes the tests parallel since they can no longer clobber one another.
This updates the test for checking the connection manager cleanly shuts down with failed conns to actualy test what it is intended to. Manual connections do not automatically retry, only persistent connections.
This adds tests to ensure closing a connection multiple times works as intended.
This adds tests to ensure duplication connections are rejected for all possible states.
This adds tests to ensure attempts to add more than the maximum allowed number of persistent are rejected.
This adds tests to ensure the Disconnect method properly disconnects pending and established connections for both non-persistent and persistent connections.
This adds tests to ensure the Remove method properly disconnects and removes pending and established connections for both non-persistent and persistent connections.
This updates the connmgr package README.md to match the new design and capabilities.
This adds a couple of test helpers for asserting the internal state of the connection manager updates all tests to call the new helpers throughout. The first one asserts the internal maps are all coherent and do not violate any preconditions. The second one asserts clean shutdown.
Currently the whitelisting logic happens in the server which makes it inaccessible to the connection manager. In order to pave the way for supporting various connection-related logic that currently happens in the server, but ideally should be happening in the connection manager, this adds basic support for whitelisting CIDR prefixes to the connection manager. The connection manager config struct now accepts a slice of prefixes and a new method named IsWhitelisted is added. Note that this only adds support . It does not update anything to use the new functionality yet.
This adds tests to ensure the new whitelist detection method works as expected.
This modifies the server to pass in the parsed whitelist entries to the connection manager config and the relevant code to make use of the new method it exposes. Finally, it removes the no longer used local isWhitelisted method.
0f49238 to
0fdcbec
Compare
| // Wait for a permit to make another overall connection. This limits | ||
| // the total number of normal connections while the previous limits the | ||
| // total number of automatic outbound connections. | ||
| if !cm.totalNormalConnsSem.Acquire(ctx) { |
There was a problem hiding this comment.
Just to check my understanding, assuming default limits, having this here means you could end up stuck with 125 inbound connections and 0 outbound? Is that a change from current semantics?
There was a problem hiding this comment.
Your understanding is correct. It's a maximum upper bound on the total overall number of connections, regardless of inbound of outbound.
It is not a change from the current semantics as it's also true today.
In fact, along the same lines, it's also the case that today persistent connections count against the total limit too which means inbound+outbound similarly can technically crowd out desired persistent connections. That particular case is no longer true with this series of PRs (as you're probably aware since you've been reviewing them) since persistent connections have been given special priority in terms of having their own limits.
There was a problem hiding this comment.
I think you have a great point here though. This is, in fact, as I mentioned, matching the current semantics, but given outbound connections are generally preferred for various things, it would likely be better to instead go ahead and change the semantics a bit to make "normal" connections" exclude automatic outbound connections and adjust the limits accordingly.
I'm considering the following semantics:
- calculate the total allowed normal conns as
Config.MaxNormalConns - Config.TargetOutbound(after normalizing and checking values, clamping to 0, etc) - if total allowed normal conns is 0, clear any provided
Config.Listenerssince it would never allow inbound conns at that point anyway - Allow the target outbound to remain limited by its own semaphore and remove the new semaphore from that path
That will effectively give outbound conns dedicate slots which I think would be a nice change.
It would still them treat inbound and manual outbound conns with equal weight and limit them to the total.
There was a problem hiding this comment.
Sounds like a good idea for a future improvement, I think probably a seperate PR though rather than re-working this one. Tbh thats how I thought it already worked!
There was a problem hiding this comment.
Fair enough. I was going to just rework this one, but doing it in a separate PR since this one is already working, tested, and has several more building on it is probably better.
This adds a new TryAcquire method to the context-aware semaphore. As the name implies, the method supports conditionally acquiring the semaphore only when resources are immediately available. In other words, it will not block when there are no resources immediately available.
This adds tests for the new TryAcquire method on the context-aware semaphore to ensure the semantics work as expected.
The current overall total connection limits are enforced by the server rather than the connection manager. This is not ideal for many reasons, but one of the most important consequences is that it makes DoS attacks easier. Another example of some less than ideal behavior that it allows is that some rare combinations of events can lead to temporary extra connection churn. It is much more robust and natural to perform the limiting in the connection manager itself via semaphores. That approach not only significantly hardens the server against DoS attacks and solves various edge cases present in the current code, it also paves the way for even more advanced features such as traffic shaping in the future. To that end, this adds semaphore-based limiting for the total overall number of normal connections to the connection manager and removes the relevant current limiting for it from the server. Normal connections are the automatic outbound, manual outbound, and inbound connections. Persistent connections, on the other hand, are not subject to the limit since they have their own limiting. This is consistent with them not being subject to the automatic target outbound limit either.
This adds tests to ensure that the new max normal connection limiting properly enforces the limit including automatic outbound, manual outbound, and inbound connections. It also ensures that it not applied to persistent connections.
0fdcbec to
14ff823
Compare
This requires #3695 and #3696.
The current overall total connection limits are enforced by the server rather than the connection manager. This is not ideal for many reasons, but one of the most important consequences is that it makes DoS attacks easier. Another example of some less than ideal behavior that it allows is that some rare combinations of events can lead to temporary extra connection churn.
It is much more robust and natural to perform the limiting in the connection manager itself via semaphores. That approach not only significantly hardens the server against DoS attacks and solves various edge cases present in the current code, it also paves the way for even more advanced features such as traffic shaping in the future.
To that end, this adds semaphore-based limiting for the total overall number of normal connections to the connection manager and removes the relevant current limiting for it from the server.
Normal connections are the automatic outbound, manual outbound, and inbound connections. Persistent connections, on the other hand, are not subject to the limit since they have their own limiting. This is consistent with them not being subject to the automatic target outbound limit either.
It also adds tests to ensure that the new max normal connection limiting properly enforces the limit including automatic outbound, manual outbound, and inbound connections and that it is not applied to persistent connections.