Skip to content

[cleanup][test] Remove PortManager and use kernel-assigned ports everywhere#25694

Merged
merlimat merged 8 commits intoapache:masterfrom
merlimat:cleanup/remove-port-manager
May 7, 2026
Merged

[cleanup][test] Remove PortManager and use kernel-assigned ports everywhere#25694
merlimat merged 8 commits intoapache:masterfrom
merlimat:cleanup/remove-port-manager

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

@merlimat merlimat commented May 6, 2026

Motivation

PortManager (in pulsar-common) and the inner BasePortManager of LocalBookkeeperEnsemble allocated "free" ports via ServerSocket(0) and tracked them in a JVM-level lock set. The lock was JVM-only, so the OS-level port could still be grabbed by another process between allocation and bind, producing flaky BindExceptions in tests (most recently ShadowTopicRealBkTest.setup).

The class has outlived its usefulness — modern BookKeeper identifies bookies by bookieId, not host:port, so kernel-assigned ports (port 0) work across the board. The "lock" never actually prevented OS-level collisions anyway.

Modifications

Removed

  • pulsar-common/.../PortManager.java and its test.
  • pulsar-package-management/.../test/PortManager.java (unused duplicate).
  • BasePortManager inner class in LocalBookkeeperEnsemble.
  • All bkBasePort constructor variants of LocalBookkeeperEnsemble.
  • The Supplier<Integer> portManager parameter from LocalBookkeeperEnsemble — every caller was passing () -> 0. Bookies now always bind to kernel-assigned ports internally.
  • --bookkeeper-port CLI option from pulsar standalone (and its setter, getter, builder method, plus the test calls). Multi-bookie standalone gets kernel-assigned ports.
  • useDynamicBrokerPorts() toggle in MultiBrokerBaseTest (now always dynamic).
  • bkPort field in BKCluster.BKClusterConf.

Refactored

  • BKCluster: bookies bind to port 0; bookieId is derived from the data dir hash so multiple cluster instances on a shared metadata store don't collide on cookie validation, while restarts on the same dir keep cookies stable. The unused cookie-parsing port-recovery logic is dropped.
  • MultiBrokerBaseTest: mainBrokerPort and additionalBrokerPorts are populated post-start from pulsar.getBrokerListenPort() instead of pre-allocated.
  • PulsarTestContext.handlePreallocatePorts: inlined ServerSocket(0) allocation. Still needed by tests that build advertised-listener URLs before the broker starts (those URLs require a known port up front).
  • 17 test files updated: PortManager.nextLockedFreePort() calls replaced with 0 (let kernel pick) or inline ServerSocket(0) (only where pre-allocation is genuinely required, e.g. for advertised-listener URLs).
  • PulsarStandaloneTest.testStandaloneWithRocksDB: assertion switched from getBookiePort() (now returns kernel-assigned ports that change per restart) to getBookieId() (the persistent BK identity).

Behavior change

--bookkeeper-port is gone from pulsar standalone. Bookies get kernel-assigned ports — no impact for normal usage since clients only connect to the broker, not directly to bookies.

Verifying this change

Locally exercised the affected tests:

  • LocalBookkeeperEnsembleTest (2/2)
  • ShadowTopicRealBkTest (1/1)
  • PulsarStandaloneTest (5/5)
  • EndToEndTest for BKCluster (10 passed, 1 skipped)
  • AdvertisedListenersTest, LoadManagerFailFastTest, AdvertisedListenersMultiBrokerLeaderElectionTest

Full compileJava and compileTestJava clean.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API: removes --bookkeeper-port CLI option from pulsar standalone
  • The schema: no
  • The default values of configurations: no
  • The threading model: no
  • The binary protocol: no
  • The REST endpoints: no
  • The admin CLI options: --bookkeeper-port removed from pulsar standalone
  • The metrics: no
  • Anything that affects deployment: no

merlimat added 2 commits May 5, 2026 18:36
…ywhere

`PortManager` (in pulsar-common) and the inner `BasePortManager` of
`LocalBookkeeperEnsemble` allocated "free" ports via `ServerSocket(0)` and
tracked them in a JVM-level lock set. The lock was JVM-only, so the OS-level
port could still be grabbed by another process between allocation and bind,
producing flaky `BindException`s in tests. The class has outlived its
usefulness — modern BookKeeper identifies bookies by `bookieId`, not host:port,
so port 0 (kernel-assigned) is fine across the board.

Removed:
- `pulsar-common/...PortManager.java` and its test.
- `pulsar-package-management/...test/PortManager.java` (unused duplicate).
- `BasePortManager` inner class in `LocalBookkeeperEnsemble`.
- `bkBasePort` constructor variants of `LocalBookkeeperEnsemble` (no callers
  left after the standalone change).
- `--bookkeeper-port` CLI option from `PulsarStandalone` (and its setter,
  getter, builder method, plus 3 test calls). Multi-bookie standalone now
  always uses kernel-assigned ports — bookies are identified by `bookieId`.
- `useDynamicBrokerPorts()` toggle in `MultiBrokerBaseTest` (now always true
  in spirit; ports come from `getBrokerListenPort()` after start).
- `bkPort` field in `BKCluster.BKClusterConf`.

Refactored:
- `BKCluster`: bookies bind to port 0; `bookieId` is derived from the data dir
  hash so multiple cluster instances on a shared metadata store don't collide
  on cookie validation, while restarts on the same dir keep cookies stable.
  The unused cookie-parsing port-recovery logic is dropped.
- `MultiBrokerBaseTest`: `mainBrokerPort` and `additionalBrokerPorts` are
  populated post-start from `pulsar.getBrokerListenPort()` instead of
  pre-allocated.
- `PulsarTestContext.handlePreallocatePorts`: inlined `ServerSocket(0)`
  allocation. Still needed by tests that build advertised-listener URLs
  before the broker starts.
- 17 test files: `PortManager.nextLockedFreePort()` calls replaced with
  `0` (let kernel pick) or inline `ServerSocket(0)` (only where pre-allocation
  is genuinely needed, e.g. for advertised-listener URLs).
- `PulsarStandaloneTest.testStandaloneWithRocksDB`: assertion switched from
  `getBookiePort()` (now returns kernel-assigned ports that change per restart)
  to `getBookieId()` (the persistent identity).

Behavior change: `--bookkeeper-port` is gone from `pulsar standalone`. Bookies
get kernel-assigned ports — no impact for normal usage since clients only
connect to the broker.
Now that all callers pass () -> 0, the supplier is dead weight. Remove it
from LocalBookkeeperEnsemble entirely; bookies always bind to port 0.
Updated all 41 call sites accordingly.
Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a need for PortManager in cases where ephemeral ports (0) can't be used. Where they can, ephemeral is obviously preferred.

To avoid collisions for assigned ports across multiple processes, PortManager should allocate from outside the ephemeral range. On Linux the default emphemeral range is 32768–60999:

❯ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768    60999

Proposal: reserve a block per JVM starting at a fixed offset. Each instance claims block n by binding a server socket on 20000 + (n * 1000), giving it 999 usable ports in that range. That handles coordination across multiple test JVMs without a shared registry. If ports run out, a new block could be allocated.

On release, PortManager should verify the port is actually free before returning it to the pool — sockets can linger in TIME_WAIT after close, so the free-list update needs to handle async release rather than marking the port available immediately.

Address review feedback: PortManager is still useful for tests that need to
know a port BEFORE binding (advertised-listener URLs, znode pre-creation,
unavailable-node URL synthesis). The old implementation suffered from the
JVM-level lock not preventing OS-level collisions, since it allocated inside
the kernel's ephemeral range.

The restored PortManager addresses both problems:
- Allocates from outside the ephemeral range (Linux default 32768-60999) by
  reserving a 1000-port block per JVM. Each JVM claims a block by binding a
  lock ServerSocket on the block's base port for the JVM's lifetime; other
  JVMs that hit the same range observe the bind failure and pick the next.
- Released ports do not return to the pool immediately. They sit in a
  pending-release set until a fresh bind to the port succeeds, which avoids
  handing out a port that's still in TIME_WAIT.

Wired PortManager back into the call sites that genuinely need pre-allocation:
- ServiceUrlQuarantineTest (broker port + 40 unavailable-node ports).
- ExtensibleLoadManagerImplWithAdvertisedListenersTest (advertised listeners).
- ModularLoadManagerImplTest.testOwnBrokerZnodeByMultipleBroker (znode pre-creation).
- AdvertisedListenersTest (advertised listeners per broker).
- SimpleProtocolHandlerTestsBase / SimpleProxyExtensionTestBase (handler listener address).
- PulsarTestContext.handlePreallocatePorts (preallocatePorts(true) preserved).

Everything else still uses port 0 with kernel assignment.
@merlimat
Copy link
Copy Markdown
Contributor Author

merlimat commented May 6, 2026

@lhotari thanks — restored PortManager and addressed both improvements you suggested.

The new implementation in b0e844dbab0:

  • Allocates from a 1000-port block outside the ephemeral range. Each JVM claims its block by binding a "lock" ServerSocket on the block's base port for the JVM's lifetime; other JVMs that hit the same range observe the bind failure and walk to the next block. Default range is [20000, 33000) which gives 13 blocks before exhaustion — should be plenty for parallel test JVMs.
  • releaseLockedPort now moves the port to a pending-release set rather than returning it to the pool immediately. The next allocation rebinds and only reclaims pending-release ports that bind successfully, so TIME_WAIT sockets don't get handed out.

Wired it back into the call sites that genuinely need pre-allocation: ServiceUrlQuarantineTest, ExtensibleLoadManagerImplWithAdvertisedListenersTest, ModularLoadManagerImplTest.testOwnBrokerZnodeByMultipleBroker, AdvertisedListenersTest, SimpleProtocolHandlerTestsBase, SimpleProxyExtensionTestBase, and PulsarTestContext.handlePreallocatePorts (the preallocatePorts(true) opt-in is preserved). Everything else still uses port 0 + read-back via getBrokerListenPort() / getZookeeperPort().

merlimat added 2 commits May 6, 2026 08:42
…-manager

# Conflicts:
#	pulsar-broker/src/test/java/org/apache/pulsar/broker/service/persistent/ShadowTopicRealBkTest.java
Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

merlimat added 2 commits May 6, 2026 13:37
Two distinct issues surfaced on https://github.com/apache/pulsar/actions/runs/25449522387:

1. BookKeeperClusterTestCase (3 copies in pulsar-metadata, managed-ledger,
   pulsar-package-management) now starts bookies on port 0. Multiple bookies
   ended up sharing the cookie identity 127.0.0.1:0, breaking cookie validation
   on the second bookie. This code path (unlike BKCluster, which uses bookieId
   in metadata) needs the bookie to be reachable by host:port for the test
   client's address resolver to work — bookieId-only registration leads to
   UnknownHostException when clients try to dial the bookie.

   Fix: route through PortManager.nextLockedFreePort() again for these cases.
   This is exactly the use case lhotari called out in the PR review.

2. PulsarFunctionTlsTest configured both brokers with WebServicePortTls=0.
   PulsarService.initializeWorkerConfigFromBrokerConfig builds the workerId
   from "{cluster}-fw-{host}-{configuredPort}", which means both brokers got
   the same workerId, the function-worker membership manager never elected a
   leader, and the test timed out waiting for one.

   Fix: pre-allocate WebServicePortTls via PortManager so each broker gets
   a unique configured port (and therefore a unique workerId).
@merlimat merlimat merged commit 8cd2f38 into apache:master May 7, 2026
116 of 121 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants