Skip to content

moq-relay: restore gossip-style cluster discovery via --cluster-node#1504

Open
kixelated wants to merge 3 commits into
mainfrom
claude/funny-euler-64507a
Open

moq-relay: restore gossip-style cluster discovery via --cluster-node#1504
kixelated wants to merge 3 commits into
mainfrom
claude/funny-euler-64507a

Conversation

@kixelated
Copy link
Copy Markdown
Collaborator

Summary

Fixes #1499. Restores gossip-style peer discovery for moq-relay clusters by re-introducing --cluster-node <self-url>, while keeping the new hop-routed origin and loop detection from #1322.

  • New: --cluster-node <self-url> publishes a placeholder broadcast at .internal/origins/<url> so peers reached via --cluster-connect discover and dial this relay automatically. Pairs with --cluster-connect for the bootstrap; multiple nodes form a full mesh after a single rendezvous.
  • New: .block(prefix) view on OriginProducer / OriginConsumer (symmetric with scope and with_root) that refuses publishes and hides announces under a prefix. Applied to non-mTLS sessions in Cluster::publisher / Cluster::subscriber so JWT and anonymous clients can never see or publish into .internal/*, no matter how broad their declared scope.
  • Removed flag handling: --cluster-root / cluster.root is kept as a hidden Option<String> solely so the relay can bail at startup with a migration message pointing at --cluster-connect and --cluster-node. Catches old configs that were valid pre-91ea43c5.
  • Docs: rewrote doc/bin/relay/cluster.md to cover both the static and gossip modes, the .internal/* access rule, and a migration table. Updated doc/bin/relay/config.md, rs/moq-relay/README.md to match.
  • Demos: switched demo/relay/leaf{0,1}.toml from a hardcoded connect = [root, leaf0] list to gossip (connect = [root] + node) so the existing cluster-loop-detection exercise still runs, but via discovery.

Why

Without gossip, every operator had to enumerate every peer URL in every relay's config, and adding a node required editing every existing node. The hop-based loop detection added in #1322 makes the old gossip trick safe to bring back without the special-case cluster: bool auth flag it used to need.

Reviewer notes

  • The .internal/ prefix matches the existing dotted-infrastructure convention (.stats/... from moq-lite/moq-relay: hop-based clustering #1322). Choosing it (versus internal/origins from the pre-rewrite code) gives a path that's also conventionally hidden from app-level subscribers.
  • Loop detection: when a relay's own registration is gossiped back through a peer, OriginProducer::publish_broadcast already refuses it because the hop chain contains this relay's OriginId. No special case needed.
  • Dial dedup: the gossip discovery loop and static --cluster-connect share a HashMap<String, AbortHandle> so a peer reached via both paths only opens one session.
  • mTLS-only access: AuthToken::unrestricted() (mTLS) sets internal: true and skips the .block(".internal") wrap; JWT sessions always get the block, even with subscribe = [""].

Test plan

  • cargo test -p moq-net — 332 pass, including 5 new tests for the block() view (announce filter, publish refusal, get_broadcast filter, stacking, with_root interaction).
  • cargo test -p moq-relay — 82 pass, including 5 new tests: --cluster-root errors with migration message (CLI and TOML paths), .internal/* invisible to broad-scope JWT, visible to mTLS, cluster.node config round-trips.
  • cargo clippy -p moq-net -p moq-relay --all-targets --all-features -- -D warnings — clean.
  • cargo fmt --all --check and bun biome check — clean.
  • cargo check --workspace --exclude moq-gst — clean. (moq-gst fails on this machine due to missing gstreamer-1.0 system lib, unrelated.)
  • Manual e2e: bring up demo/relay/root.toml + leaf0 + leaf1, publish through leaf0, subscribe from leaf1; verify it flows without leaf1 having leaf0 in its static connect list. (Not run locally.)

🤖 Generated with Claude Code

The 91ea43c rewrite collapsed clustering onto a single hop-routed
origin and removed the old `--cluster-root`/`--cluster-node` flags
that let leaves discover each other through a rendezvous. The
replacement `--cluster-connect` requires every operator to enumerate
every peer URL in every relay's config, which is O(N^2) configuration
and bites anyone adding a node. Issue #1499 hit this after merging
main.

Hop-based loop detection now lives on every broadcast, so the old
"publish a registration broadcast to a well-known path" trick can
return without the `cluster: bool` auth flag that used to gate it.

Bring back `--cluster-node <self-url>`: when set the relay publishes
a placeholder broadcast at `.internal/origins/<url>` on its own
origin, and watches that prefix for other peers' registrations. Each
newly-announced URL is dialed with the same exponential-backoff loop
as `--cluster-connect`, deduplicated by URL so static and gossip
peers don't double-dial.

Scope the `.internal/*` namespace to mTLS sessions only. Add a small
`OriginProducer::block(prefix)` / `OriginConsumer::block(prefix)`
view to moq-net (symmetric with `scope`/`with_root`) that refuses
publishes and hides announces under a prefix. `Cluster::publisher`
and `Cluster::subscriber` apply `.block(".internal")` for any token
where `internal == false`, so even a broad-scope JWT can't see or
publish into the mesh namespace.

`--cluster-root` / `cluster.root` is kept as a hidden Option<String>
field that bails at startup with a migration message pointing at
`--cluster-connect` and `--cluster-node`.

Rewrite doc/bin/relay/cluster.md to describe both the static and
gossip modes, the `.internal/*` access rule, and a migration table.
Update doc/bin/relay/config.md, rs/moq-relay/README.md to match.
Switch demo/relay/leaf{0,1}.toml from static `connect = [root,
leaf]` to gossip (`connect = [root]` + `node`), so the loop-detection
exercise still runs but through discovery.

Closes #1499

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 25, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5db117db-2cbe-4282-96bd-241ec12a5c06

📥 Commits

Reviewing files that changed from the base of the PR and between ea8678f and 0524a10.

📒 Files selected for processing (1)
  • rs/moq-relay/src/cluster.rs

Walkthrough

This PR adds per-handle absolute path-prefix blocking to OriginProducer/OriginConsumer and uses that to enforce .internal/* access restrictions. Cluster behavior is extended with an optional node to self-advertise via a gossip mesh under .internal/origins/<node> while still supporting static peers in connect. Legacy cluster.root is retained as a hidden migration-only option that causes startup errors. Documentation and demo configs are updated for the static vs gossip clustering model and migration guidance.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: restoring gossip-style cluster discovery via the --cluster-node flag, which is the primary objective and feature addition in this PR.
Description check ✅ Passed The description is directly related to the changeset, providing clear context about the gossip-style peer discovery restoration, the new --cluster-node flag, the .block() feature, migration handling for --cluster-root, and comprehensive documentation updates.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch claude/funny-euler-64507a

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@doc/bin/relay/cluster.md`:
- Line 83: The doc incorrectly claims a "cluster JWT must include the cluster
privileges that bypass" the `.internal/*` filter; update the paragraph about
"JWT-authenticated cluster sessions" to remove or correct that bypass claim and
clearly state that `.internal/*` endpoints are mTLS-only (JWT sessions are
tagged external and are filtered from `.internal/*` by default), and recommend
using mTLS for internal access; adjust the sentence referencing "cluster JWT" so
it no longer implies JWT can bypass `.internal/*` (or explicitly note any
supported exception if truly implemented).

In `@rs/moq-relay/src/cluster.rs`:
- Around line 196-205: The active map currently conflates statically configured
peers and gossip-discovered peers so an unannounce can abort static reconnect
loops; change the tracking so static peers are not aborted by gossip unannounces
by either (a) replacing active: Arc<Mutex<HashMap<String, AbortHandle>>> with a
map value that includes a provenance flag (e.g., enum or struct { handle:
AbortHandle, static: bool }) and only abort/remove entries when an unannounce
targets a non-static entry, or (b) maintain two maps (e.g., static_active and
discovered_active) and have spawn_dial, the initial loop that seeds from
self.config.connect, and the unannounce handler use the appropriate map; update
spawn_dial, any code that inserts/looks-up in active, and the unannounce/remove
logic to honor the static marker so static peers (seeded via
self.config.connect) keep their reconnect loop running.
- Around line 132-143: The current subscriber and publisher methods build a view
then call view.block(".internal") for non-internal tokens, which blocks
".internal" relative to the rooted view and allows bypass when token.root ==
".internal"; instead, for non-internal tokens apply block(".internal") to
self.origin before calling with_root(...) and scope(...). Update both
subscriber(&self, token: &AuthToken) and publisher(&self, token: &AuthToken) to
compute a base_origin = if token.internal { &self.origin } else {
&self.origin.block(".internal") } (or equivalent) and then call
base_origin.with_root(&token.root)?.scope(...) so the internal subtree is always
blocked for non-internal tokens while preserving original behavior for internal
tokens.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7e87d291-01d5-481b-bcb8-5b3e5a7e3ee0

📥 Commits

Reviewing files that changed from the base of the PR and between 7a8616d and 83b84a4.

📒 Files selected for processing (7)
  • demo/relay/leaf0.toml
  • demo/relay/leaf1.toml
  • doc/bin/relay/cluster.md
  • doc/bin/relay/config.md
  • rs/moq-net/src/model/origin.rs
  • rs/moq-relay/README.md
  • rs/moq-relay/src/cluster.rs

Comment thread doc/bin/relay/cluster.md Outdated
Comment thread rs/moq-relay/src/cluster.rs Outdated
Comment thread rs/moq-relay/src/cluster.rs
…hurn

Three review fixes for #1504.

1. block-before-root for non-internal tokens. Previously the relay
   built a view as `with_root(token.root).scope(token.subscribe).consume()`
   and *then* called `.block(".internal")` on it. The block stores
   absolute paths, computed at call time as `self.root.join(prefix)`.
   For a token with `root == ".internal"` the block resolved to
   absolute `.internal/.internal`, and the real `.internal/origins/*`
   mesh paths leaked through. Apply `.block(".internal")` to
   `self.origin` *before* `with_root` so the block is anchored at the
   true absolute root regardless of the token's `root` claim.

2. `OriginProducer::consume()` now propagates `blocked` to the derived
   consumer. The view is meaningfully a (publish-refuse, announce-hide)
   pair; the consumer half was unfiltered, so the producer's publish
   block left the announce side wide open. The fix in #1 exposed this:
   the regression test for token.root=".internal" wouldn't have failed
   if the consumer had been blocked. Now `.block().consume()` returns
   a consumer that also filters announces.

3. Static peers survive gossip unannounces. `active: HashMap<String,
   AbortHandle>` used to be shared between `--cluster-connect` (static)
   peers and gossip-discovered peers. A peer that appeared in both
   would have its static reconnect loop aborted as soon as its
   `.internal/origins/<peer>` registration was unannounced (e.g. peer
   restart). Replace the value with `DialEntry { handle, is_static }`
   and only abort discovered entries on unannounce.

Doc: rewrote the JWT subsection in `cluster.md` to drop the incorrect
claim that a "cluster JWT" can bypass `.internal/*` filtering.
`.internal/*` is mTLS-only; JWT cluster peers can relay user traffic
but cannot participate in gossip discovery.

Tests:
- `internal_paths_invisible_when_token_root_is_internal` exercises
  the block-before-root path.
- `gossip_unannounce_preserves_static_peer` exercises the new
  `DialEntry.is_static` branch by inserting both a static and a
  discovered entry into the active map and asserting only the
  discovered one is removed.

Extract `Cluster::handle_gossip_unannounce` so the test can hit the
exact same code path the discovery loop uses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
rs/moq-relay/src/cluster.rs (1)

187-196: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep --cluster-node usable as passive rendezvous mode.

run() currently treats node.is_some() as outbound dial work, so a node-only relay now fails without with_client() and also starts discovery that dials inbound-announced peers. That breaks the documented --cluster-node-only behavior where the relay should just advertise itself and accept inbound cluster connections.

Proposed fix
-		let has_work = !self.config.connect.is_empty() || self.config.node.is_some();
+		let has_static_dials = !self.config.connect.is_empty();
+		let has_work = has_static_dials || self.config.node.is_some();
 		if !has_work {
 			tracing::info!("no cluster peers configured; running standalone");
 			return Ok(());
 		}
 
-		anyhow::ensure!(
-			self.client.is_some(),
-			"cluster peers configured but no QUIC client attached (call Cluster::with_client)"
-		);
+		if has_static_dials {
+			anyhow::ensure!(
+				self.client.is_some(),
+				"cluster peers configured but no QUIC client attached (call Cluster::with_client)"
+			);
+		}
@@
-		if let Some(self_url) = self.config.node.clone() {
+		if has_static_dials {
+			if let Some(self_url) = self.config.node.clone() {
 			let this = self.clone();
 			let token = token.clone();
 			let active = active.clone();
 			tasks.spawn(async move {
 				this.run_discovery(self_url, token, active).await;
 			});
+			}
 		}

This contradicts the documented node-only rendezvous behavior in rs/moq-relay/README.md:67-76.

Also applies to: 231-239

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rs/moq-relay/src/cluster.rs` around lines 187 - 196, In run(), treat a
configured node as passive rendezvous (advertise/accept) and only require a QUIC
client when there is outbound dial work: replace the single has_work check with
two checks (e.g., has_any_config = !self.config.connect.is_empty() ||
self.config.node.is_some() and has_outbound = !self.config.connect.is_empty()),
and change the anyhow::ensure! that checks self.client to only fire when
has_outbound is true; apply the same split/logic to the second occurrence noted
around the 231-239 block so node-only mode no longer demands
Cluster::with_client and does not initiate outbound dialing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@rs/moq-relay/src/cluster.rs`:
- Around line 187-196: In run(), treat a configured node as passive rendezvous
(advertise/accept) and only require a QUIC client when there is outbound dial
work: replace the single has_work check with two checks (e.g., has_any_config =
!self.config.connect.is_empty() || self.config.node.is_some() and has_outbound =
!self.config.connect.is_empty()), and change the anyhow::ensure! that checks
self.client to only fire when has_outbound is true; apply the same split/logic
to the second occurrence noted around the 231-239 block so node-only mode no
longer demands Cluster::with_client and does not initiate outbound dialing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4f02c04d-4e2b-4716-b2fd-6d042b971a7a

📥 Commits

Reviewing files that changed from the base of the PR and between 83b84a4 and ea8678f.

📒 Files selected for processing (3)
  • doc/bin/relay/cluster.md
  • rs/moq-net/src/model/origin.rs
  • rs/moq-relay/src/cluster.rs
✅ Files skipped from review due to trivial changes (1)
  • doc/bin/relay/cluster.md

The README and cluster docs both describe `--cluster-node` alone as a
valid "rendezvous" mode: the relay advertises its URL on the cluster
origin and accepts inbound cluster sessions, but does not dial out.
The actual `run()` implementation was inconsistent with this:

- It required `Cluster::with_client` whenever any cluster work was
  configured, including node-only mode. A relay without outbound dials
  has no need for a QUIC client.
- It spawned the gossip discovery task whenever `node` was set, even
  with no `connect` peers. A passive rendezvous can't usefully dial
  peers it learns about (they already have inbound sessions to us;
  dialing back creates duplicates).
- The `_self_registration` BroadcastProducer was held in a local. If
  no dial tasks were spawned, the `while tasks.join_next()` loop
  exited immediately, dropping the broadcast and unannouncing the
  relay before any peer could see it.

Fix:

- Split `has_work` into `has_outbound = !connect.is_empty()` and
  `has_work = has_outbound || node.is_some()`.
- `with_client` is only required when `has_outbound`.
- Discovery is only spawned when `has_outbound` AND `node.is_some()`.
- When no tasks are spawned (passive rendezvous), park forever with
  `std::future::pending().await` so the self-registration broadcast
  stays alive. `cluster.run()` is one arm of a `tokio::select!` in
  `main.rs`, so the process still exits cleanly via the other arms.

Tests:
- `passive_rendezvous_runs_without_client_and_advertises_self` builds
  a Cluster with only `node` set, spawns `run()`, asserts the
  self-registration is visible on the origin, and confirms `run()`
  parks (doesn't return) so the registration stays published.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kixelated
Copy link
Copy Markdown
Collaborator Author

@coderabbitai, fixed in 0524a10 — split has_work into has_outbound = !connect.is_empty() and a node-only check. The with_client requirement and the discovery spawn are now both gated behind has_outbound. Also fixed a related bug where the _self_registration broadcast would be dropped immediately in passive-rendezvous mode (no tasks to wait on → while tasks.join_next() exited): run() now parks with std::future::pending() when no dial tasks are spawned, keeping the registration alive. Regression test: passive_rendezvous_runs_without_client_and_advertises_self. (Written by Claude)

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 25, 2026

Tip

For best results, initiate chat on the files or code changes.

@kixelated Thanks for the detailed write-up! The has_outbound split cleanly separates the two concerns, and parking with std::future::pending() is the right call to keep the registration broadcast alive in passive-rendezvous mode. The regression test covering that path is a nice addition too. Looks good!


If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

old leaf discovery strategy gone?

1 participant