feat: periodic DNS re-resolution for client-side service discovery#513
feat: periodic DNS re-resolution for client-side service discovery#513cgreeno wants to merge 11 commits intoelixir-grpc:masterfrom
Conversation
… telemetry The gRPC client now periodically re-resolves DNS targets to discover new backends and remove stale ones, matching the behaviour of grpc-go and grpc-java. This is essential for Kubernetes deployments where pods scale up/down behind headless services. Key changes in GRPC.Client.Connection: - Periodic re-resolution: DNS targets are re-resolved every 30s (configurable via :resolve_interval). Static targets (ipv4:, ipv6:, unix:) are excluded. - Channel reconciliation: On re-resolution, new backends get channels created, removed backends get disconnected, unchanged backends are kept as-is. The load balancer is re-initialized with the updated address list. - Exponential backoff: On DNS failure or empty results, the interval doubles (capped at :max_resolve_interval, default 5min). Resets to base on success. - Rate limiting: A :min_resolve_interval (default 5s) prevents rapid resolve_now() calls from flooding DNS. Intended for future health check / heartbeat integration. - Telemetry: Emits [:grpc, :client, :resolve, :stop] on success and [:grpc, :client, :resolve, :error] on failure, with duration, target, and address_count measurements. - resolve_now/1: Public API to trigger on-demand re-resolution (subject to rate limiting). New options for connect/2: :resolve_interval, :max_resolve_interval, :min_resolve_interval Test suite: 22 test cases modelled after grpc-go's dns_resolver_test.go and grpc-java's DnsNameResolverTest.java, covering scale-up, scale-down, complete replacement, DNS failure, empty addresses, recovery, backoff, rate limiting, telemetry, pick_channel stability, and edge cases.
Addresses code review feedback:
1. Timer leak in resolve_now/1: Each resolve_now call was spawning a new
concurrent timer loop. Now schedule_re_resolve/1 cancels any existing
timer via Process.cancel_timer before scheduling a new one.
2. Stale persistent_term after re-resolution: When the LB picked an
unhealthy channel ({:error, _}), persistent_term was left holding a
dead channel. Now falls back to any healthy channel in the pool, or
clears persistent_term if none exist so pick_channel returns
{:error, :no_connection}.
3. Bare hostname normalization: "my-service:50051" was normalized to
"ipv4:my-service:50051" which skipped DNS resolution and re-resolution.
Now normalizes to "dns://my-service:50051" matching the Resolver docs:
"If no scheme is specified, dns is assumed". Removed dead is_nil check
from dns_target?/1.
Added FailingClientAdapter test helper and 2 new tests for the stale
persistent_term fix. Replaced counter-based test stubs with phase-based
stubs to eliminate race conditions.
Addresses remaining code review feedback:
1. Retry previously failed channels: If an address was in DNS but failed
to connect initially ({:error, _}), subsequent re-resolutions now
retry the connection instead of leaving it permanently broken.
2. DNS resolution timeout: resolver.resolve() is now called in a spawned
process with a configurable timeout (:resolve_timeout, default 10s).
If DNS hangs, the GenServer recovers after the timeout and triggers
backoff instead of blocking indefinitely.
New connect/2 option:
:resolve_timeout — max time for a single DNS resolution (default 10s)
Added 2 new tests: retry-on-reconnect and resolve timeout.
The :refresh handler only matched nil and {:ok, channel} when the LB
picked a channel. If round-robin picked an address with a failed
connection ({:error, reason}), it would crash with CaseClauseError.
Now uses a catch-all that logs a warning and keeps the current
virtual_channel, preventing GenServer crash loops when some backends
are temporarily unreachable.
DNS re-resolution now runs in an async task spawned via
Task.Supervisor.async_nolink, so the GenServer stays fully responsive
to :disconnect, :refresh, and pick_channel calls during slow DNS
queries.
- Added GRPC.Client.ResolveSupervisor (Task.Supervisor) to the
application supervision tree
- :re_resolve handler spawns a task instead of blocking
- Task result handled via {ref, result} message pattern
- Task crash handled via {:DOWN, ...} message, triggers backoff
- Duplicate resolve prevented: skips if resolve_task already in flight
- Removed resolve_timeout option (no longer needed — task lifecycle
managed by OTP supervision)
http://my-service:50051 and https://my-service:50051 were being normalized to ipv4: which skipped DNS resolution and re-resolution. Now coerced to dns:// like bare host:port targets, so Kubernetes users with protocol-prefixed targets get periodic re-resolution.
Reverted the dns:// normalization for http/https/bare targets back to the original ipv4: behavior. This ensures full backwards compatibility — no existing behavior changes for any target format. Re-resolution is now opt-in: users pass dns://my-service:50051 to get periodic DNS re-resolution. All other target formats (http://, https://, bare host:port, ipv4:, unix:) work exactly as before. Also removed verify_on_exit! from re_resolve_test.exs to eliminate a flaky interaction between async resolve tasks and Mox ownership cleanup. Tests verify correctness through state assertions instead.
Moved the DNS resolve loop, backoff, rate limiting, and telemetry out
of the Connection GenServer into a new GRPC.Client.DnsResolver process.
The resolver is linked to Connection and owns the entire retry state
machine, sending {:dns_result, result} messages back when it resolves.
This removes 8 state fields, 3 handle_info clauses, and 5 private
functions from Connection, replacing them with a single {:dns_result, _}
handler. The Task.Supervisor (ResolveSupervisor) is no longer needed
since DnsResolver calls the resolver synchronously in its own process.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid redundant persistent_term.put calls during re-resolution when the picked channel hasn't changed. persistent_term updates trigger a global GC pass across all BEAM processes, so we skip writes on no-op cycles. Added a comment noting that a future migration to ETS with read_concurrency: true would eliminate this concern entirely. Also fixed the rate limiting test to be deterministic by using a 60s min_resolve_interval (guarantees only 1 resolve per burst) and waiting for DnsResolver to drain its mailbox via :sys.get_state instead of relying on sleep timing. Fixed disconnect_and_wait to ensure the linked DnsResolver process is fully stopped before test cleanup, preventing Mox ownership leaks between tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- resolve_now uses GenServer.cast instead of call (fire-and-forget) - Rename dns_resolver → dns_resolver_pid, resolver → resolver_module for clarity between pid and module references - Remove duplicate defaults — Keyword.validate! already sets them, DnsResolver.start_link reads from connect_opts directly - Document resolver contract in DnsResolver moduledoc (must implement GRPC.Client.Resolver behaviour) - Add resolve_interval/max/min to connect/2 options docs - Add missing @impl annotation on handle_call for disconnect Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
polvalente
left a comment
There was a problem hiding this comment.
Gave this a very cursory pass. @sleipnir is probably the best to review this one in more depth, but I'll take another look later!
|
@polvalente did you want me to look at changing the persistent_term to ETS as well or that is not something that your worried about? Looking at Dockyyard blog It looks like performance would be hit but not massively. The reason I ask, this will probably become problematic for DNS as well as service mesh as they will be more write heavy than what is currently happening. WDYT? |
- Rename DnsResolver → DNSResolver (prefer non-downcased acronyms)
- resolve_now uses via(ref) instead of :global.whereis_name (matches
upstream pattern, cast is fire-and-forget so no error handling needed)
- Rename target → resolver_target (too broad for Connection struct)
- Channel state tuples: {:ok, ch} → {:connected, ch} and
{:error, reason} → {:failed, reason} to distinguish from Map.fetch
return values and make the intent clear
- Extract reconcile_channels into disconnect_removed_channels,
connect_new_channels, and rebalance_after_reconcile defps
|
@cgreeno I think it would be a good change to move onto an ETS! The current persistent term approach I think is prone to memory leaks, which is a bit more of a concern to me than read-write performance. |
|
@polvalente @sleipnir - any update? |
|
Hi guys @cgreeno @polvalente, sorry I've been away for a few days. |
| @@ -0,0 +1,157 @@ | |||
| defmodule GRPC.Client.DNSResolver do | |||
There was a problem hiding this comment.
I have a small concern about this module, mainly because it seems to introduce some overlap with responsibilities that were previously handled in a stateless way.
From what I understand, we might not need to introduce a new resolver here, and possibly not even a stateful approach. The Connection module already manages state and has access to both resolvers and load balancing, which seems aligned with the original design.
Initially, resolvers were intended to remain stateless, with Connection acting as the orchestrator—handling inputs, outputs, and coordination. It feels like a good portion of the logic being added here might already fit naturally within that existing structure.
Before moving forward with this PR, it might be worth revisiting this design choice to see if we can keep things simpler and more consistent with the original architecture.
There was a problem hiding this comment.
Thanks for the feedback @sleipnir. I appreciate you looking at this with the original architecture in mind. I actually started with exactly what you're describing: the resolve loop, backoff, rate limiting, and timer management all lived inside Connection as handle_info clauses. That was the first iteration.
The problem I ran into was that Connection's GenServer was accumulating a lot of state and responsibility that had nothing to do with its core job of managing channels and load balancing. The re-resolution feature added 8 new state fields (resolve_interval, base_resolve_interval, max_resolve_interval, min_resolve_interval, last_resolve_at, resolve_timer_ref, resolve_task, resolve_start_time), 3 new handle_info clauses (for the timer, task completion, and task crashes), and 5 private functions (backoff, reset_backoff, rate_limited?, schedule_re_resolve, handle_resolve_result with telemetry). Connection went from being a clean orchestrator to also being a DNS retry state machine.
You can see that version here: cgreeno@c5f10d1 --> still needs a bit of work but you get the idea of what I was trying.
Moving it to a linked process made Connection simpler.... it went back to handling one message type ({:dns_result, result}) and doing what it's good at: reconciling channels and updating the LB. The resolver itself stays stateless. DNSResolver doesn't replace the Resolver behaviour, it just wraps it with the periodic retry loop. The actual DNS lookup still goes through the existing stateless Resolver.resolve/1 contract.
The benefit, at least to me, is separation of concerns: DNSResolver owns "when and how often to call resolve" (timers, backoff, rate limiting, telemetry). Connection owns "what to do with the result" (channel reconciliation, LB re-init, persistent_term). Neither needs to know about the other's internals, and because they're linked, lifecycle management is automatic.
That said, I'm happy to discuss further or adjust the approach if you feel strongly. I can rollback the change to my previous commit.
There was a problem hiding this comment.
What do you think about a mixed approach using delegation?
We could add a DNS attribute to the connection state map and handle it similarly to how we already deal with the load balancing module. In other words, within the Connection GenServer loop, we would delegate to a stateless DNS module.
My concern with introducing another process linked to Connection is that it creates an additional point of failure and adds more complexity in terms of supervision and lifecycle management. By using delegation instead, we avoid that overhead, don’t need to handle potential failures of a separate DNS process, and keep the overall design simpler.
Curious to hear your thoughts on this approach.
There was a problem hiding this comment.
Thanks @sleipnir, I like the delegation pattern and I can see how it mirrors the lb_mod pattern. One thing I'd note is that either way I think we would need a separate process for the DNS I/O to avoid blocking Connection. That was actually one of the original reason I moved to an async approach. Sorry, I should have mentioned this more clearly in my initial reply above. The very first version had everything inline in Connection, and during testing with slow/unreachable DNS, the GenServer became unresponsive.
NOW whether that's a Task per resolve call or a long-lived linked process is for sure debateable... I went with the linked process because it felt cleaner to have one stable process own the full lifecycle (timer, backoff, rate limiting, I/O) rather than Connection managing short-lived Tasks and tracking their refs alongside the delegation state.
Keeping this in mind as well as DNS re-resolution is just the first piece... the gRPC spec also defines:
- SRV record resolution — _grpc-lb._tcp. lookups for port/priority/weight discovery
- Health checking — periodic grpc.health.v1.Health/Check RPCs to each backend, removing unhealthy ones between DNS cycles
- xDS resolver — a persistent gRPC stream to the control plane (Istio/Envoy/Traffic Director) for real-time endpoint updates
I planned to have a stab at implementing these as well. sidebar: Fresha is planning to move to a service mesh(possibly) and I just wanted to be ahead of the curve if we decide to go that way. :)
With the delegation pattern, all of these would add their state maps and handle_info clauses to Connection. Health checking alone needs per-backend timers and RPC calls. xDS needs a persistent streaming connection. Delegation works for DNS but doesn't generalize to the other features in the gRPC spec, so we'd end up with mixed patterns. At least, that is how I am thinking about it. I think, once all 3 others are implemented, you would end up with some annoying complex routing logic.
With the linked process pattern, each concern is its own process sending a common {:resolve_result, addresses} message back to Connection. Connection stays focused on channel reconciliation and LB, it doesn't care whether addresses came from DNS, SRV, health checks, or xDS. And if a DNS lookup hangs or an xDS stream crashes, Connection keeps serving traffic from its existing channels.
I haven't given it a HUGE amount of thought but roughly this is what connection could look like after DNS + SRV + health checking + xDS is added with delegation:
defstruct virtual_channel: nil,
real_channels: %{},
lb_mod: nil,
lb_state: nil,
resolver: nil,
adapter: GRPC.Client.Adapters.Gun,
# DNS re-resolution state
dns_state: %{},
dns_task: nil,
dns_timer_ref: nil,
# SRV resolution state
srv_state: %{},
srv_task: nil,
srv_timer_ref: nil,
# Health checking state (per backend)
health_state: %{},
health_tasks: %{}, # one per backend
health_timer_refs: %{}, # one per backend
# xDS control plane state
xds_state: %{},
xds_stream_pid: nil,
xds_subscriptions: %{}
And the handle_info clauses:
# --- Existing upstream ---
def handle_info(:refresh, state)
# --- DNS re-resolution ---
def handle_info(:dns_resolve, state) # timer fires
def handle_info({ref, result}, state) # dns task completed
when ref == state.dns_task.ref
def handle_info({:DOWN, ref, ...}, state) # dns task crashed
when ref == state.dns_task.ref
# --- SRV resolution ---
def handle_info(:srv_resolve, state) # timer fires
def handle_info({ref, result}, state) # srv task completed
when ref == state.srv_task.ref
def handle_info({:DOWN, ref, ...}, state) # srv task crashed
when ref == state.srv_task.ref
# --- Health checking (per backend) ---
def handle_info({:health_check, address}, state) # timer fires for one backend
def handle_info({ref, result}, state) # health task completed
when Map.has_key?(state.health_tasks, ref)
def handle_info({:DOWN, ref, ...}, state) # health task crashed
when Map.has_key?(state.health_tasks, ref)
# --- xDS control plane ---
def handle_info(:xds_connect, state) # initial connect
def handle_info({:xds_update, resource_type, data}, state) # stream update
def handle_info({:xds_error, reason}, state) # stream error
def handle_info(:xds_reconnect, state) # reconnect timer
# --- Catch-all ---
def handle_info({:DOWN, _ref, ...}, state) # now you need to figure out WHICH task/stream this belongs to
**That's ~15 handle_info clauses, ~20 state fields, and the {ref, result} / {:DOWN, ...} handlers need to disambiguate which subsystem the task belongs to. Every time you add a feature, Connection gets wider.
As I said, have not given this a huge amount of thought so take it with a HUGE grain of salt and that I am using this example more to make my point. 😄**
Compare that to my proposal of a separate process:
defstruct virtual_channel: nil,
real_channels: %{},
lb_mod: nil,
lb_state: nil,
resolver: nil,
adapter: GRPC.Client.Adapters.Gun,
dns_resolver_pid: nil,
srv_resolver_pid: nil,
health_checker_pid: nil,
xds_resolver_pid: nil
# Connection only handles results.... doesn't care who sent them
def handle_info({:resolve_result, addresses}, state)
def handle_info({:health_update, address, :healthy | :unhealthy}, state)
Two handle_info clauses. Four pid fields. Each subsystem manages its own timers, tasks, backoff, and failure recovery internally.
BUT I am not looking at the wider project, only what I feel is best in this instance. AND I am happy to take your lead and what the plan is for the long term architecture. I am open to whatever direction you think is best for the project long-term. So honestly happy with either approach. ❤️
There was a problem hiding this comment.
FYI - going on holiday tomorrow for a week so if I don't reply it is because my wife took away my laptop ;)
Summary
Adds periodic DNS re-resolution to the gRPC client, enabling automatic discovery of new backends and removal of stale ones. This is essential for Kubernetes deployments where pods scale up/down behind headless services, matching the behaviour of grpc-go and grpc-java.
What's included
New module:
GRPC.Client.DnsResolver— a dedicated GenServer linked to each Connection that owns the entire DNS resolve lifecycle:resolve_now/1calls (default 5s floor)[:grpc, :client, :resolve, :stop]on success,[:grpc, :client, :resolve, :error]on failure{:dns_result, result}to the Connection process for channel reconciliationChanges to
GRPC.Client.Connection:resolve_now/1public API for on-demand re-resolution (e.g. from health checks):refreshhandler to handle{:error, _}channels without crashingOpt-in via
dns://scheme — only DNS targets trigger re-resolution. Static targets (ipv4:,ipv6:,unix:) are unaffected. No existing behaviour changes forhttp://,https://, or barehost:porttargets.New options for
connect/2:resolve_interval:max_resolve_interval:min_resolve_intervalresolve_now/1(ms)Note on
persistent_termThe existing codebase uses
:persistent_termfor zero-copy channel passing to client stubs. This works well for infrequent updates, butpersistent_termupdates trigger a global garbage collection pass across all BEAM processes (Erlang docs). With periodic re-resolution, channel reconciliation runs every 30s+ per connection.To mitigate this, we guard
persistent_term.putcalls to only write when the picked channel actually changes — skipping redundant writes on no-op re-resolution cycles. A future improvement would be migrating to ETS withread_concurrency: true, which has no global GC cost on writes while maintaining the same(or close) read performance forpick_channel.Test coverage (28 tests)
Test plan