feat: periodic DNS re-resolution for client-side service discovery by cgreeno · Pull Request #513 · elixir-grpc/grpc

cgreeno · 2026-03-30T17:57:46Z

Summary

Adds periodic DNS re-resolution to the gRPC client, enabling automatic discovery of new backends and removal of stale ones. This is essential for Kubernetes deployments where pods scale up/down behind headless services, matching the behaviour of grpc-go and grpc-java.

What's included

New module: GRPC.Client.DnsResolver — a dedicated GenServer linked to each Connection that owns the entire DNS resolve lifecycle:

Periodic re-resolution on a configurable interval (default 30s)
Exponential backoff on DNS failure (doubles per failure, caps at 5min, resets on success)
Rate limiting to prevent DNS flooding from resolve_now/1 calls (default 5s floor)
Telemetry events: [:grpc, :client, :resolve, :stop] on success, [:grpc, :client, :resolve, :error] on failure
Sends {:dns_result, result} to the Connection process for channel reconciliation

Changes to GRPC.Client.Connection:

Channel reconciliation on re-resolution: new backends get channels created, removed backends get disconnected, unchanged backends are left as-is
Load balancer re-initialized with updated address list after reconciliation
Fallback to any healthy channel when LB picks a failed one
Retry previously failed channels on subsequent re-resolution cycles
resolve_now/1 public API for on-demand re-resolution (e.g. from health checks)
Fixed :refresh handler to handle {:error, _} channels without crashing

Opt-in via dns:// scheme — only DNS targets trigger re-resolution. Static targets (ipv4:, ipv6:, unix:) are unaffected. No existing behaviour changes for http://, https://, or bare host:port targets.

New options for `connect/2`

Option	Default	Description
`:resolve_interval`	30000	Base interval between DNS re-resolutions (ms)
`:max_resolve_interval`	300000	Backoff cap (ms)
`:min_resolve_interval`	5000	Rate-limit floor for `resolve_now/1` (ms)

Note on `persistent_term`

The existing codebase uses :persistent_term for zero-copy channel passing to client stubs. This works well for infrequent updates, but persistent_term updates trigger a global garbage collection pass across all BEAM processes (Erlang docs). With periodic re-resolution, channel reconciliation runs every 30s+ per connection.

To mitigate this, we guard persistent_term.put calls to only write when the picked channel actually changes — skipping redundant writes on no-op re-resolution cycles. A future improvement would be migrating to ETS with read_concurrency: true, which has no global GC cost on writes while maintaining the same(or close) read performance for pick_channel.

Test coverage (28 tests)

#	Category	Test
1-4	Core	Scale-up, scale-down, no-op, complete replacement
5-8	Errors	DNS failure, recovery, empty addresses, recovery after empty
9-10	Stability	pick_channel during/after re-resolution
11	Timer	Repeated cycles accumulate changes
12	Edge	Non-DNS targets skip re-resolution
13	Edge	Disconnect kills linked resolver cleanly
14	Edge	LB error during re-resolution — connection survives
15	Edge	Port change on same host detected
16	Backoff	Doubles on failure, resets on success, caps at max
17	Rate limit	resolve_now calls coalesced within min_resolve_interval
18	Telemetry	:stop on success, :error on failure/empty
19-20	Persistence	Fallback to healthy channel, retry failed channels
21	Async	Connection stays responsive during slow DNS
22	Refresh	GenServer survives :refresh picking a failed channel

Test plan

All 28 re-resolution tests pass (10 consecutive runs, 0 flakes)
Full test suite: 244 tests, 0 failures
End-to-end tested with Docker Compose (scale up/down/outage/recovery)
End-to-end tested on Kubernetes with minikube + headless service + CoreDNS

… telemetry The gRPC client now periodically re-resolves DNS targets to discover new backends and remove stale ones, matching the behaviour of grpc-go and grpc-java. This is essential for Kubernetes deployments where pods scale up/down behind headless services. Key changes in GRPC.Client.Connection: - Periodic re-resolution: DNS targets are re-resolved every 30s (configurable via :resolve_interval). Static targets (ipv4:, ipv6:, unix:) are excluded. - Channel reconciliation: On re-resolution, new backends get channels created, removed backends get disconnected, unchanged backends are kept as-is. The load balancer is re-initialized with the updated address list. - Exponential backoff: On DNS failure or empty results, the interval doubles (capped at :max_resolve_interval, default 5min). Resets to base on success. - Rate limiting: A :min_resolve_interval (default 5s) prevents rapid resolve_now() calls from flooding DNS. Intended for future health check / heartbeat integration. - Telemetry: Emits [:grpc, :client, :resolve, :stop] on success and [:grpc, :client, :resolve, :error] on failure, with duration, target, and address_count measurements. - resolve_now/1: Public API to trigger on-demand re-resolution (subject to rate limiting). New options for connect/2: :resolve_interval, :max_resolve_interval, :min_resolve_interval Test suite: 22 test cases modelled after grpc-go's dns_resolver_test.go and grpc-java's DnsNameResolverTest.java, covering scale-up, scale-down, complete replacement, DNS failure, empty addresses, recovery, backoff, rate limiting, telemetry, pick_channel stability, and edge cases.

Addresses code review feedback: 1. Timer leak in resolve_now/1: Each resolve_now call was spawning a new concurrent timer loop. Now schedule_re_resolve/1 cancels any existing timer via Process.cancel_timer before scheduling a new one. 2. Stale persistent_term after re-resolution: When the LB picked an unhealthy channel ({:error, _}), persistent_term was left holding a dead channel. Now falls back to any healthy channel in the pool, or clears persistent_term if none exist so pick_channel returns {:error, :no_connection}. 3. Bare hostname normalization: "my-service:50051" was normalized to "ipv4:my-service:50051" which skipped DNS resolution and re-resolution. Now normalizes to "dns://my-service:50051" matching the Resolver docs: "If no scheme is specified, dns is assumed". Removed dead is_nil check from dns_target?/1. Added FailingClientAdapter test helper and 2 new tests for the stale persistent_term fix. Replaced counter-based test stubs with phase-based stubs to eliminate race conditions.

Addresses remaining code review feedback: 1. Retry previously failed channels: If an address was in DNS but failed to connect initially ({:error, _}), subsequent re-resolutions now retry the connection instead of leaving it permanently broken. 2. DNS resolution timeout: resolver.resolve() is now called in a spawned process with a configurable timeout (:resolve_timeout, default 10s). If DNS hangs, the GenServer recovers after the timeout and triggers backoff instead of blocking indefinitely. New connect/2 option: :resolve_timeout — max time for a single DNS resolution (default 10s) Added 2 new tests: retry-on-reconnect and resolve timeout.

The :refresh handler only matched nil and {:ok, channel} when the LB picked a channel. If round-robin picked an address with a failed connection ({:error, reason}), it would crash with CaseClauseError. Now uses a catch-all that logs a warning and keeps the current virtual_channel, preventing GenServer crash loops when some backends are temporarily unreachable.

DNS re-resolution now runs in an async task spawned via Task.Supervisor.async_nolink, so the GenServer stays fully responsive to :disconnect, :refresh, and pick_channel calls during slow DNS queries. - Added GRPC.Client.ResolveSupervisor (Task.Supervisor) to the application supervision tree - :re_resolve handler spawns a task instead of blocking - Task result handled via {ref, result} message pattern - Task crash handled via {:DOWN, ...} message, triggers backoff - Duplicate resolve prevented: skips if resolve_task already in flight - Removed resolve_timeout option (no longer needed — task lifecycle managed by OTP supervision)

http://my-service:50051 and https://my-service:50051 were being normalized to ipv4: which skipped DNS resolution and re-resolution. Now coerced to dns:// like bare host:port targets, so Kubernetes users with protocol-prefixed targets get periodic re-resolution.

Reverted the dns:// normalization for http/https/bare targets back to the original ipv4: behavior. This ensures full backwards compatibility — no existing behavior changes for any target format. Re-resolution is now opt-in: users pass dns://my-service:50051 to get periodic DNS re-resolution. All other target formats (http://, https://, bare host:port, ipv4:, unix:) work exactly as before. Also removed verify_on_exit! from re_resolve_test.exs to eliminate a flaky interaction between async resolve tasks and Mox ownership cleanup. Tests verify correctness through state assertions instead.

Moved the DNS resolve loop, backoff, rate limiting, and telemetry out of the Connection GenServer into a new GRPC.Client.DnsResolver process. The resolver is linked to Connection and owns the entire retry state machine, sending {:dns_result, result} messages back when it resolves. This removes 8 state fields, 3 handle_info clauses, and 5 private functions from Connection, replacing them with a single {:dns_result, _} handler. The Task.Supervisor (ResolveSupervisor) is no longer needed since DnsResolver calls the resolver synchronously in its own process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Avoid redundant persistent_term.put calls during re-resolution when the picked channel hasn't changed. persistent_term updates trigger a global GC pass across all BEAM processes, so we skip writes on no-op cycles. Added a comment noting that a future migration to ETS with read_concurrency: true would eliminate this concern entirely. Also fixed the rate limiting test to be deterministic by using a 60s min_resolve_interval (guarantees only 1 resolve per burst) and waiting for DnsResolver to drain its mailbox via :sys.get_state instead of relying on sleep timing. Fixed disconnect_and_wait to ensure the linked DnsResolver process is fully stopped before test cleanup, preventing Mox ownership leaks between tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@impl

- resolve_now uses GenServer.cast instead of call (fire-and-forget) - Rename dns_resolver → dns_resolver_pid, resolver → resolver_module for clarity between pid and module references - Remove duplicate defaults — Keyword.validate! already sets them, DnsResolver.start_link reads from connect_opts directly - Document resolver contract in DnsResolver moduledoc (must implement GRPC.Client.Resolver behaviour) - Add resolve_interval/max/min to connect/2 options docs - Add missing @impl annotation on handle_call for disconnect Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

polvalente

Gave this a very cursory pass. @sleipnir is probably the best to review this one in more depth, but I'll take another look later!

grpc_client/lib/grpc/client/dns_resolver.ex

grpc_client/lib/grpc/client/connection.ex

cgreeno · 2026-03-31T13:55:50Z

@polvalente did you want me to look at changing the persistent_term to ETS as well or that is not something that your worried about? Looking at Dockyyard blog It looks like performance would be hit but not massively. The reason I ask, this will probably become problematic for DNS as well as service mesh as they will be more write heavy than what is currently happening.

WDYT?

- Rename DnsResolver → DNSResolver (prefer non-downcased acronyms) - resolve_now uses via(ref) instead of :global.whereis_name (matches upstream pattern, cast is fire-and-forget so no error handling needed) - Rename target → resolver_target (too broad for Connection struct) - Channel state tuples: {:ok, ch} → {:connected, ch} and {:error, reason} → {:failed, reason} to distinguish from Map.fetch return values and make the intent clear - Extract reconcile_channels into disconnect_removed_channels, connect_new_channels, and rebalance_after_reconcile defps

polvalente · 2026-03-31T14:10:18Z

@cgreeno I think it would be a good change to move onto an ETS!

The current persistent term approach I think is prone to memory leaks, which is a bit more of a concern to me than read-write performance.

cgreeno · 2026-04-02T08:25:14Z

@polvalente @sleipnir - any update?

sleipnir · 2026-04-02T13:40:26Z

Hi guys @cgreeno @polvalente, sorry I've been away for a few days.
Let me understand what this does before reviewing, okay? My first question is why all these changes are necessary when it would be enough to have the pick from the lb call the resolver that already exists to update the channel list? I think that would be much simpler, wouldn't it?

sleipnir · 2026-04-02T14:43:51Z

grpc_client/lib/grpc/client/dns_resolver.ex

@@ -0,0 +1,157 @@
+defmodule GRPC.Client.DNSResolver do


I have a small concern about this module, mainly because it seems to introduce some overlap with responsibilities that were previously handled in a stateless way.

From what I understand, we might not need to introduce a new resolver here, and possibly not even a stateful approach. The Connection module already manages state and has access to both resolvers and load balancing, which seems aligned with the original design.

Initially, resolvers were intended to remain stateless, with Connection acting as the orchestrator—handling inputs, outputs, and coordination. It feels like a good portion of the logic being added here might already fit naturally within that existing structure.

Before moving forward with this PR, it might be worth revisiting this design choice to see if we can keep things simpler and more consistent with the original architecture.

Thanks for the feedback @sleipnir. I appreciate you looking at this with the original architecture in mind. I actually started with exactly what you're describing: the resolve loop, backoff, rate limiting, and timer management all lived inside Connection as handle_info clauses. That was the first iteration.

The problem I ran into was that Connection's GenServer was accumulating a lot of state and responsibility that had nothing to do with its core job of managing channels and load balancing. The re-resolution feature added 8 new state fields (resolve_interval, base_resolve_interval, max_resolve_interval, min_resolve_interval, last_resolve_at, resolve_timer_ref, resolve_task, resolve_start_time), 3 new handle_info clauses (for the timer, task completion, and task crashes), and 5 private functions (backoff, reset_backoff, rate_limited?, schedule_re_resolve, handle_resolve_result with telemetry). Connection went from being a clean orchestrator to also being a DNS retry state machine.

You can see that version here: cgreeno@c5f10d1 --> still needs a bit of work but you get the idea of what I was trying.

Moving it to a linked process made Connection simpler.... it went back to handling one message type ({:dns_result, result}) and doing what it's good at: reconciling channels and updating the LB. The resolver itself stays stateless. DNSResolver doesn't replace the Resolver behaviour, it just wraps it with the periodic retry loop. The actual DNS lookup still goes through the existing stateless Resolver.resolve/1 contract.

The benefit, at least to me, is separation of concerns: DNSResolver owns "when and how often to call resolve" (timers, backoff, rate limiting, telemetry). Connection owns "what to do with the result" (channel reconciliation, LB re-init, persistent_term). Neither needs to know about the other's internals, and because they're linked, lifecycle management is automatic.

That said, I'm happy to discuss further or adjust the approach if you feel strongly. I can rollback the change to my previous commit.

What do you think about a mixed approach using delegation?

We could add a DNS attribute to the connection state map and handle it similarly to how we already deal with the load balancing module. In other words, within the Connection GenServer loop, we would delegate to a stateless DNS module.

My concern with introducing another process linked to Connection is that it creates an additional point of failure and adds more complexity in terms of supervision and lifecycle management. By using delegation instead, we avoid that overhead, don’t need to handle potential failures of a separate DNS process, and keep the overall design simpler.

Curious to hear your thoughts on this approach.

Thanks @sleipnir, I like the delegation pattern and I can see how it mirrors the lb_mod pattern. One thing I'd note is that either way I think we would need a separate process for the DNS I/O to avoid blocking Connection. That was actually one of the original reason I moved to an async approach. Sorry, I should have mentioned this more clearly in my initial reply above. The very first version had everything inline in Connection, and during testing with slow/unreachable DNS, the GenServer became unresponsive.

NOW whether that's a Task per resolve call or a long-lived linked process is for sure debateable... I went with the linked process because it felt cleaner to have one stable process own the full lifecycle (timer, backoff, rate limiting, I/O) rather than Connection managing short-lived Tasks and tracking their refs alongside the delegation state.

Keeping this in mind as well as DNS re-resolution is just the first piece... the gRPC spec also defines:

SRV record resolution — _grpc-lb._tcp. lookups for port/priority/weight discovery

Health checking — periodic grpc.health.v1.Health/Check RPCs to each backend, removing unhealthy ones between DNS cycles

xDS resolver — a persistent gRPC stream to the control plane (Istio/Envoy/Traffic Director) for real-time endpoint updates

I planned to have a stab at implementing these as well. sidebar: Fresha is planning to move to a service mesh(possibly) and I just wanted to be ahead of the curve if we decide to go that way. :)

With the delegation pattern, all of these would add their state maps and handle_info clauses to Connection. Health checking alone needs per-backend timers and RPC calls. xDS needs a persistent streaming connection. Delegation works for DNS but doesn't generalize to the other features in the gRPC spec, so we'd end up with mixed patterns. At least, that is how I am thinking about it. I think, once all 3 others are implemented, you would end up with some annoying complex routing logic.

With the linked process pattern, each concern is its own process sending a common {:resolve_result, addresses} message back to Connection. Connection stays focused on channel reconciliation and LB, it doesn't care whether addresses came from DNS, SRV, health checks, or xDS. And if a DNS lookup hangs or an xDS stream crashes, Connection keeps serving traffic from its existing channels.

I haven't given it a HUGE amount of thought but roughly this is what connection could look like after DNS + SRV + health checking + xDS is added with delegation:

defstruct virtual_channel: nil, real_channels: %{}, lb_mod: nil, lb_state: nil, resolver: nil, adapter: GRPC.Client.Adapters.Gun, # DNS re-resolution state dns_state: %{}, dns_task: nil, dns_timer_ref: nil, # SRV resolution state srv_state: %{}, srv_task: nil, srv_timer_ref: nil, # Health checking state (per backend) health_state: %{}, health_tasks: %{}, # one per backend health_timer_refs: %{}, # one per backend # xDS control plane state xds_state: %{}, xds_stream_pid: nil, xds_subscriptions: %{} And the handle_info clauses: # --- Existing upstream --- def handle_info(:refresh, state) # --- DNS re-resolution --- def handle_info(:dns_resolve, state) # timer fires def handle_info({ref, result}, state) # dns task completed when ref == state.dns_task.ref def handle_info({:DOWN, ref, ...}, state) # dns task crashed when ref == state.dns_task.ref # --- SRV resolution --- def handle_info(:srv_resolve, state) # timer fires def handle_info({ref, result}, state) # srv task completed when ref == state.srv_task.ref def handle_info({:DOWN, ref, ...}, state) # srv task crashed when ref == state.srv_task.ref # --- Health checking (per backend) --- def handle_info({:health_check, address}, state) # timer fires for one backend def handle_info({ref, result}, state) # health task completed when Map.has_key?(state.health_tasks, ref) def handle_info({:DOWN, ref, ...}, state) # health task crashed when Map.has_key?(state.health_tasks, ref) # --- xDS control plane --- def handle_info(:xds_connect, state) # initial connect def handle_info({:xds_update, resource_type, data}, state) # stream update def handle_info({:xds_error, reason}, state) # stream error def handle_info(:xds_reconnect, state) # reconnect timer # --- Catch-all --- def handle_info({:DOWN, _ref, ...}, state) # now you need to figure out WHICH task/stream this belongs to

**That's ~15 handle_info clauses, ~20 state fields, and the {ref, result} / {:DOWN, ...} handlers need to disambiguate which subsystem the task belongs to. Every time you add a feature, Connection gets wider.

As I said, have not given this a huge amount of thought so take it with a HUGE grain of salt and that I am using this example more to make my point. 😄**

Compare that to my proposal of a separate process:

defstruct virtual_channel: nil, real_channels: %{}, lb_mod: nil, lb_state: nil, resolver: nil, adapter: GRPC.Client.Adapters.Gun, dns_resolver_pid: nil, srv_resolver_pid: nil, health_checker_pid: nil, xds_resolver_pid: nil # Connection only handles results.... doesn't care who sent them def handle_info({:resolve_result, addresses}, state) def handle_info({:health_update, address, :healthy | :unhealthy}, state)

Two handle_info clauses. Four pid fields. Each subsystem manages its own timers, tasks, backoff, and failure recovery internally.

BUT I am not looking at the wider project, only what I feel is best in this instance. AND I am happy to take your lead and what the plan is for the long term architecture. I am open to whatever direction you think is best for the project long-term. So honestly happy with either approach. ❤️

FYI - going on holiday tomorrow for a week so if I don't reply it is because my wife took away my laptop ;)

cgreeno and others added 10 commits March 27, 2026 16:46

polvalente reviewed Mar 30, 2026

View reviewed changes

sleipnir requested changes Apr 2, 2026

View reviewed changes

Conversation

cgreeno commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

New options for connect/2

Note on persistent_term

Test coverage (28 tests)

Test plan

Uh oh!

polvalente left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgreeno commented Mar 31, 2026

Uh oh!

polvalente commented Mar 31, 2026

Uh oh!

cgreeno commented Apr 2, 2026

Uh oh!

sleipnir commented Apr 2, 2026

Uh oh!

sleipnir Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

cgreeno Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sleipnir Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

cgreeno Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgreeno Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cgreeno commented Mar 30, 2026 •

edited

Loading

New options for `connect/2`

Note on `persistent_term`

cgreeno Apr 2, 2026 •

edited

Loading

cgreeno Apr 3, 2026 •

edited

Loading