moq-relay: restore gossip-style cluster discovery via --cluster-node#1504
moq-relay: restore gossip-style cluster discovery via --cluster-node#1504kixelated wants to merge 3 commits into
Conversation
The 91ea43c rewrite collapsed clustering onto a single hop-routed origin and removed the old `--cluster-root`/`--cluster-node` flags that let leaves discover each other through a rendezvous. The replacement `--cluster-connect` requires every operator to enumerate every peer URL in every relay's config, which is O(N^2) configuration and bites anyone adding a node. Issue #1499 hit this after merging main. Hop-based loop detection now lives on every broadcast, so the old "publish a registration broadcast to a well-known path" trick can return without the `cluster: bool` auth flag that used to gate it. Bring back `--cluster-node <self-url>`: when set the relay publishes a placeholder broadcast at `.internal/origins/<url>` on its own origin, and watches that prefix for other peers' registrations. Each newly-announced URL is dialed with the same exponential-backoff loop as `--cluster-connect`, deduplicated by URL so static and gossip peers don't double-dial. Scope the `.internal/*` namespace to mTLS sessions only. Add a small `OriginProducer::block(prefix)` / `OriginConsumer::block(prefix)` view to moq-net (symmetric with `scope`/`with_root`) that refuses publishes and hides announces under a prefix. `Cluster::publisher` and `Cluster::subscriber` apply `.block(".internal")` for any token where `internal == false`, so even a broad-scope JWT can't see or publish into the mesh namespace. `--cluster-root` / `cluster.root` is kept as a hidden Option<String> field that bails at startup with a migration message pointing at `--cluster-connect` and `--cluster-node`. Rewrite doc/bin/relay/cluster.md to describe both the static and gossip modes, the `.internal/*` access rule, and a migration table. Update doc/bin/relay/config.md, rs/moq-relay/README.md to match. Switch demo/relay/leaf{0,1}.toml from static `connect = [root, leaf]` to gossip (`connect = [root]` + `node`), so the loop-detection exercise still runs but through discovery. Closes #1499 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThis PR adds per-handle absolute path-prefix blocking to OriginProducer/OriginConsumer and uses that to enforce 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@doc/bin/relay/cluster.md`:
- Line 83: The doc incorrectly claims a "cluster JWT must include the cluster
privileges that bypass" the `.internal/*` filter; update the paragraph about
"JWT-authenticated cluster sessions" to remove or correct that bypass claim and
clearly state that `.internal/*` endpoints are mTLS-only (JWT sessions are
tagged external and are filtered from `.internal/*` by default), and recommend
using mTLS for internal access; adjust the sentence referencing "cluster JWT" so
it no longer implies JWT can bypass `.internal/*` (or explicitly note any
supported exception if truly implemented).
In `@rs/moq-relay/src/cluster.rs`:
- Around line 196-205: The active map currently conflates statically configured
peers and gossip-discovered peers so an unannounce can abort static reconnect
loops; change the tracking so static peers are not aborted by gossip unannounces
by either (a) replacing active: Arc<Mutex<HashMap<String, AbortHandle>>> with a
map value that includes a provenance flag (e.g., enum or struct { handle:
AbortHandle, static: bool }) and only abort/remove entries when an unannounce
targets a non-static entry, or (b) maintain two maps (e.g., static_active and
discovered_active) and have spawn_dial, the initial loop that seeds from
self.config.connect, and the unannounce handler use the appropriate map; update
spawn_dial, any code that inserts/looks-up in active, and the unannounce/remove
logic to honor the static marker so static peers (seeded via
self.config.connect) keep their reconnect loop running.
- Around line 132-143: The current subscriber and publisher methods build a view
then call view.block(".internal") for non-internal tokens, which blocks
".internal" relative to the rooted view and allows bypass when token.root ==
".internal"; instead, for non-internal tokens apply block(".internal") to
self.origin before calling with_root(...) and scope(...). Update both
subscriber(&self, token: &AuthToken) and publisher(&self, token: &AuthToken) to
compute a base_origin = if token.internal { &self.origin } else {
&self.origin.block(".internal") } (or equivalent) and then call
base_origin.with_root(&token.root)?.scope(...) so the internal subtree is always
blocked for non-internal tokens while preserving original behavior for internal
tokens.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7e87d291-01d5-481b-bcb8-5b3e5a7e3ee0
📒 Files selected for processing (7)
demo/relay/leaf0.tomldemo/relay/leaf1.tomldoc/bin/relay/cluster.mddoc/bin/relay/config.mdrs/moq-net/src/model/origin.rsrs/moq-relay/README.mdrs/moq-relay/src/cluster.rs
…hurn Three review fixes for #1504. 1. block-before-root for non-internal tokens. Previously the relay built a view as `with_root(token.root).scope(token.subscribe).consume()` and *then* called `.block(".internal")` on it. The block stores absolute paths, computed at call time as `self.root.join(prefix)`. For a token with `root == ".internal"` the block resolved to absolute `.internal/.internal`, and the real `.internal/origins/*` mesh paths leaked through. Apply `.block(".internal")` to `self.origin` *before* `with_root` so the block is anchored at the true absolute root regardless of the token's `root` claim. 2. `OriginProducer::consume()` now propagates `blocked` to the derived consumer. The view is meaningfully a (publish-refuse, announce-hide) pair; the consumer half was unfiltered, so the producer's publish block left the announce side wide open. The fix in #1 exposed this: the regression test for token.root=".internal" wouldn't have failed if the consumer had been blocked. Now `.block().consume()` returns a consumer that also filters announces. 3. Static peers survive gossip unannounces. `active: HashMap<String, AbortHandle>` used to be shared between `--cluster-connect` (static) peers and gossip-discovered peers. A peer that appeared in both would have its static reconnect loop aborted as soon as its `.internal/origins/<peer>` registration was unannounced (e.g. peer restart). Replace the value with `DialEntry { handle, is_static }` and only abort discovered entries on unannounce. Doc: rewrote the JWT subsection in `cluster.md` to drop the incorrect claim that a "cluster JWT" can bypass `.internal/*` filtering. `.internal/*` is mTLS-only; JWT cluster peers can relay user traffic but cannot participate in gossip discovery. Tests: - `internal_paths_invisible_when_token_root_is_internal` exercises the block-before-root path. - `gossip_unannounce_preserves_static_peer` exercises the new `DialEntry.is_static` branch by inserting both a static and a discovered entry into the active map and asserting only the discovered one is removed. Extract `Cluster::handle_gossip_unannounce` so the test can hit the exact same code path the discovery loop uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rs/moq-relay/src/cluster.rs (1)
187-196:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
--cluster-nodeusable as passive rendezvous mode.
run()currently treatsnode.is_some()as outbound dial work, so a node-only relay now fails withoutwith_client()and also starts discovery that dials inbound-announced peers. That breaks the documented--cluster-node-only behavior where the relay should just advertise itself and accept inbound cluster connections.Proposed fix
- let has_work = !self.config.connect.is_empty() || self.config.node.is_some(); + let has_static_dials = !self.config.connect.is_empty(); + let has_work = has_static_dials || self.config.node.is_some(); if !has_work { tracing::info!("no cluster peers configured; running standalone"); return Ok(()); } - anyhow::ensure!( - self.client.is_some(), - "cluster peers configured but no QUIC client attached (call Cluster::with_client)" - ); + if has_static_dials { + anyhow::ensure!( + self.client.is_some(), + "cluster peers configured but no QUIC client attached (call Cluster::with_client)" + ); + } @@ - if let Some(self_url) = self.config.node.clone() { + if has_static_dials { + if let Some(self_url) = self.config.node.clone() { let this = self.clone(); let token = token.clone(); let active = active.clone(); tasks.spawn(async move { this.run_discovery(self_url, token, active).await; }); + } }This contradicts the documented node-only rendezvous behavior in
rs/moq-relay/README.md:67-76.Also applies to: 231-239
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rs/moq-relay/src/cluster.rs` around lines 187 - 196, In run(), treat a configured node as passive rendezvous (advertise/accept) and only require a QUIC client when there is outbound dial work: replace the single has_work check with two checks (e.g., has_any_config = !self.config.connect.is_empty() || self.config.node.is_some() and has_outbound = !self.config.connect.is_empty()), and change the anyhow::ensure! that checks self.client to only fire when has_outbound is true; apply the same split/logic to the second occurrence noted around the 231-239 block so node-only mode no longer demands Cluster::with_client and does not initiate outbound dialing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@rs/moq-relay/src/cluster.rs`:
- Around line 187-196: In run(), treat a configured node as passive rendezvous
(advertise/accept) and only require a QUIC client when there is outbound dial
work: replace the single has_work check with two checks (e.g., has_any_config =
!self.config.connect.is_empty() || self.config.node.is_some() and has_outbound =
!self.config.connect.is_empty()), and change the anyhow::ensure! that checks
self.client to only fire when has_outbound is true; apply the same split/logic
to the second occurrence noted around the 231-239 block so node-only mode no
longer demands Cluster::with_client and does not initiate outbound dialing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4f02c04d-4e2b-4716-b2fd-6d042b971a7a
📒 Files selected for processing (3)
doc/bin/relay/cluster.mdrs/moq-net/src/model/origin.rsrs/moq-relay/src/cluster.rs
✅ Files skipped from review due to trivial changes (1)
- doc/bin/relay/cluster.md
The README and cluster docs both describe `--cluster-node` alone as a valid "rendezvous" mode: the relay advertises its URL on the cluster origin and accepts inbound cluster sessions, but does not dial out. The actual `run()` implementation was inconsistent with this: - It required `Cluster::with_client` whenever any cluster work was configured, including node-only mode. A relay without outbound dials has no need for a QUIC client. - It spawned the gossip discovery task whenever `node` was set, even with no `connect` peers. A passive rendezvous can't usefully dial peers it learns about (they already have inbound sessions to us; dialing back creates duplicates). - The `_self_registration` BroadcastProducer was held in a local. If no dial tasks were spawned, the `while tasks.join_next()` loop exited immediately, dropping the broadcast and unannouncing the relay before any peer could see it. Fix: - Split `has_work` into `has_outbound = !connect.is_empty()` and `has_work = has_outbound || node.is_some()`. - `with_client` is only required when `has_outbound`. - Discovery is only spawned when `has_outbound` AND `node.is_some()`. - When no tasks are spawned (passive rendezvous), park forever with `std::future::pending().await` so the self-registration broadcast stays alive. `cluster.run()` is one arm of a `tokio::select!` in `main.rs`, so the process still exits cleanly via the other arms. Tests: - `passive_rendezvous_runs_without_client_and_advertises_self` builds a Cluster with only `node` set, spawns `run()`, asserts the self-registration is visible on the origin, and confirms `run()` parks (doesn't return) so the registration stays published. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@coderabbitai, fixed in 0524a10 — split |
|
Tip For best results, initiate chat on the files or code changes.
If you found this review helpful, would you consider giving us a shout-out on X? Thank you for using CodeRabbit! |
Summary
Fixes #1499. Restores gossip-style peer discovery for
moq-relayclusters by re-introducing--cluster-node <self-url>, while keeping the new hop-routed origin and loop detection from #1322.--cluster-node <self-url>publishes a placeholder broadcast at.internal/origins/<url>so peers reached via--cluster-connectdiscover and dial this relay automatically. Pairs with--cluster-connectfor the bootstrap; multiple nodes form a full mesh after a single rendezvous..block(prefix)view onOriginProducer/OriginConsumer(symmetric withscopeandwith_root) that refuses publishes and hides announces under a prefix. Applied to non-mTLS sessions inCluster::publisher/Cluster::subscriberso JWT and anonymous clients can never see or publish into.internal/*, no matter how broad their declared scope.--cluster-root/cluster.rootis kept as a hiddenOption<String>solely so the relay can bail at startup with a migration message pointing at--cluster-connectand--cluster-node. Catches old configs that were valid pre-91ea43c5.doc/bin/relay/cluster.mdto cover both the static and gossip modes, the.internal/*access rule, and a migration table. Updateddoc/bin/relay/config.md,rs/moq-relay/README.mdto match.demo/relay/leaf{0,1}.tomlfrom a hardcodedconnect = [root, leaf0]list to gossip (connect = [root]+node) so the existing cluster-loop-detection exercise still runs, but via discovery.Why
Without gossip, every operator had to enumerate every peer URL in every relay's config, and adding a node required editing every existing node. The hop-based loop detection added in #1322 makes the old gossip trick safe to bring back without the special-case
cluster: boolauth flag it used to need.Reviewer notes
.internal/prefix matches the existing dotted-infrastructure convention (.stats/...from moq-lite/moq-relay: hop-based clustering #1322). Choosing it (versusinternal/originsfrom the pre-rewrite code) gives a path that's also conventionally hidden from app-level subscribers.OriginProducer::publish_broadcastalready refuses it because the hop chain contains this relay'sOriginId. No special case needed.--cluster-connectshare aHashMap<String, AbortHandle>so a peer reached via both paths only opens one session.AuthToken::unrestricted()(mTLS) setsinternal: trueand skips the.block(".internal")wrap; JWT sessions always get the block, even withsubscribe = [""].Test plan
cargo test -p moq-net— 332 pass, including 5 new tests for theblock()view (announce filter, publish refusal, get_broadcast filter, stacking, with_root interaction).cargo test -p moq-relay— 82 pass, including 5 new tests:--cluster-rooterrors with migration message (CLI and TOML paths),.internal/*invisible to broad-scope JWT, visible to mTLS,cluster.nodeconfig round-trips.cargo clippy -p moq-net -p moq-relay --all-targets --all-features -- -D warnings— clean.cargo fmt --all --checkandbun biome check— clean.cargo check --workspace --exclude moq-gst— clean. (moq-gst fails on this machine due to missinggstreamer-1.0system lib, unrelated.)demo/relay/root.toml+ leaf0 + leaf1, publish through leaf0, subscribe from leaf1; verify it flows without leaf1 having leaf0 in its staticconnectlist. (Not run locally.)🤖 Generated with Claude Code