Skip to content

feat: periodic DNS re-resolution for client-side service discovery#513

Open
cgreeno wants to merge 11 commits intoelixir-grpc:masterfrom
cgreeno:feature/periodic-dns-re-resolution
Open

feat: periodic DNS re-resolution for client-side service discovery#513
cgreeno wants to merge 11 commits intoelixir-grpc:masterfrom
cgreeno:feature/periodic-dns-re-resolution

Conversation

@cgreeno
Copy link
Copy Markdown

@cgreeno cgreeno commented Mar 30, 2026

Summary

Adds periodic DNS re-resolution to the gRPC client, enabling automatic discovery of new backends and removal of stale ones. This is essential for Kubernetes deployments where pods scale up/down behind headless services, matching the behaviour of grpc-go and grpc-java.

What's included

New module: GRPC.Client.DnsResolver — a dedicated GenServer linked to each Connection that owns the entire DNS resolve lifecycle:

  • Periodic re-resolution on a configurable interval (default 30s)
  • Exponential backoff on DNS failure (doubles per failure, caps at 5min, resets on success)
  • Rate limiting to prevent DNS flooding from resolve_now/1 calls (default 5s floor)
  • Telemetry events: [:grpc, :client, :resolve, :stop] on success, [:grpc, :client, :resolve, :error] on failure
  • Sends {:dns_result, result} to the Connection process for channel reconciliation

Changes to GRPC.Client.Connection:

  • Channel reconciliation on re-resolution: new backends get channels created, removed backends get disconnected, unchanged backends are left as-is
  • Load balancer re-initialized with updated address list after reconciliation
  • Fallback to any healthy channel when LB picks a failed one
  • Retry previously failed channels on subsequent re-resolution cycles
  • resolve_now/1 public API for on-demand re-resolution (e.g. from health checks)
  • Fixed :refresh handler to handle {:error, _} channels without crashing

Opt-in via dns:// scheme — only DNS targets trigger re-resolution. Static targets (ipv4:, ipv6:, unix:) are unaffected. No existing behaviour changes for http://, https://, or bare host:port targets.

New options for connect/2

Option Default Description
:resolve_interval 30000 Base interval between DNS re-resolutions (ms)
:max_resolve_interval 300000 Backoff cap (ms)
:min_resolve_interval 5000 Rate-limit floor for resolve_now/1 (ms)

Note on persistent_term

The existing codebase uses :persistent_term for zero-copy channel passing to client stubs. This works well for infrequent updates, but persistent_term updates trigger a global garbage collection pass across all BEAM processes (Erlang docs). With periodic re-resolution, channel reconciliation runs every 30s+ per connection.

To mitigate this, we guard persistent_term.put calls to only write when the picked channel actually changes — skipping redundant writes on no-op re-resolution cycles. A future improvement would be migrating to ETS with read_concurrency: true, which has no global GC cost on writes while maintaining the same(or close) read performance for pick_channel.

Test coverage (28 tests)

# Category Test
1-4 Core Scale-up, scale-down, no-op, complete replacement
5-8 Errors DNS failure, recovery, empty addresses, recovery after empty
9-10 Stability pick_channel during/after re-resolution
11 Timer Repeated cycles accumulate changes
12 Edge Non-DNS targets skip re-resolution
13 Edge Disconnect kills linked resolver cleanly
14 Edge LB error during re-resolution — connection survives
15 Edge Port change on same host detected
16 Backoff Doubles on failure, resets on success, caps at max
17 Rate limit resolve_now calls coalesced within min_resolve_interval
18 Telemetry :stop on success, :error on failure/empty
19-20 Persistence Fallback to healthy channel, retry failed channels
21 Async Connection stays responsive during slow DNS
22 Refresh GenServer survives :refresh picking a failed channel

Test plan

  • All 28 re-resolution tests pass (10 consecutive runs, 0 flakes)
  • Full test suite: 244 tests, 0 failures
  • End-to-end tested with Docker Compose (scale up/down/outage/recovery)
  • End-to-end tested on Kubernetes with minikube + headless service + CoreDNS

cgreeno and others added 10 commits March 27, 2026 16:46
… telemetry

The gRPC client now periodically re-resolves DNS targets to discover
new backends and remove stale ones, matching the behaviour of grpc-go
and grpc-java. This is essential for Kubernetes deployments where pods
scale up/down behind headless services.

Key changes in GRPC.Client.Connection:

- Periodic re-resolution: DNS targets are re-resolved every 30s
  (configurable via :resolve_interval). Static targets (ipv4:, ipv6:,
  unix:) are excluded.

- Channel reconciliation: On re-resolution, new backends get channels
  created, removed backends get disconnected, unchanged backends are
  kept as-is. The load balancer is re-initialized with the updated
  address list.

- Exponential backoff: On DNS failure or empty results, the interval
  doubles (capped at :max_resolve_interval, default 5min). Resets to
  base on success.

- Rate limiting: A :min_resolve_interval (default 5s) prevents rapid
  resolve_now() calls from flooding DNS. Intended for future health
  check / heartbeat integration.

- Telemetry: Emits [:grpc, :client, :resolve, :stop] on success and
  [:grpc, :client, :resolve, :error] on failure, with duration, target,
  and address_count measurements.

- resolve_now/1: Public API to trigger on-demand re-resolution
  (subject to rate limiting).

New options for connect/2:
  :resolve_interval, :max_resolve_interval, :min_resolve_interval

Test suite: 22 test cases modelled after grpc-go's dns_resolver_test.go
and grpc-java's DnsNameResolverTest.java, covering scale-up, scale-down,
complete replacement, DNS failure, empty addresses, recovery, backoff,
rate limiting, telemetry, pick_channel stability, and edge cases.
Addresses code review feedback:

1. Timer leak in resolve_now/1: Each resolve_now call was spawning a new
   concurrent timer loop. Now schedule_re_resolve/1 cancels any existing
   timer via Process.cancel_timer before scheduling a new one.

2. Stale persistent_term after re-resolution: When the LB picked an
   unhealthy channel ({:error, _}), persistent_term was left holding a
   dead channel. Now falls back to any healthy channel in the pool, or
   clears persistent_term if none exist so pick_channel returns
   {:error, :no_connection}.

3. Bare hostname normalization: "my-service:50051" was normalized to
   "ipv4:my-service:50051" which skipped DNS resolution and re-resolution.
   Now normalizes to "dns://my-service:50051" matching the Resolver docs:
   "If no scheme is specified, dns is assumed". Removed dead is_nil check
   from dns_target?/1.

Added FailingClientAdapter test helper and 2 new tests for the stale
persistent_term fix. Replaced counter-based test stubs with phase-based
stubs to eliminate race conditions.
Addresses remaining code review feedback:

1. Retry previously failed channels: If an address was in DNS but failed
   to connect initially ({:error, _}), subsequent re-resolutions now
   retry the connection instead of leaving it permanently broken.

2. DNS resolution timeout: resolver.resolve() is now called in a spawned
   process with a configurable timeout (:resolve_timeout, default 10s).
   If DNS hangs, the GenServer recovers after the timeout and triggers
   backoff instead of blocking indefinitely.

New connect/2 option:
  :resolve_timeout — max time for a single DNS resolution (default 10s)

Added 2 new tests: retry-on-reconnect and resolve timeout.
The :refresh handler only matched nil and {:ok, channel} when the LB
picked a channel. If round-robin picked an address with a failed
connection ({:error, reason}), it would crash with CaseClauseError.

Now uses a catch-all that logs a warning and keeps the current
virtual_channel, preventing GenServer crash loops when some backends
are temporarily unreachable.
DNS re-resolution now runs in an async task spawned via
Task.Supervisor.async_nolink, so the GenServer stays fully responsive
to :disconnect, :refresh, and pick_channel calls during slow DNS
queries.

- Added GRPC.Client.ResolveSupervisor (Task.Supervisor) to the
  application supervision tree
- :re_resolve handler spawns a task instead of blocking
- Task result handled via {ref, result} message pattern
- Task crash handled via {:DOWN, ...} message, triggers backoff
- Duplicate resolve prevented: skips if resolve_task already in flight
- Removed resolve_timeout option (no longer needed — task lifecycle
  managed by OTP supervision)
http://my-service:50051 and https://my-service:50051 were being
normalized to ipv4: which skipped DNS resolution and re-resolution.
Now coerced to dns:// like bare host:port targets, so Kubernetes
users with protocol-prefixed targets get periodic re-resolution.
Reverted the dns:// normalization for http/https/bare targets back to
the original ipv4: behavior. This ensures full backwards compatibility
— no existing behavior changes for any target format.

Re-resolution is now opt-in: users pass dns://my-service:50051 to get
periodic DNS re-resolution. All other target formats (http://, https://,
bare host:port, ipv4:, unix:) work exactly as before.

Also removed verify_on_exit! from re_resolve_test.exs to eliminate a
flaky interaction between async resolve tasks and Mox ownership cleanup.
Tests verify correctness through state assertions instead.
Moved the DNS resolve loop, backoff, rate limiting, and telemetry out
of the Connection GenServer into a new GRPC.Client.DnsResolver process.
The resolver is linked to Connection and owns the entire retry state
machine, sending {:dns_result, result} messages back when it resolves.

This removes 8 state fields, 3 handle_info clauses, and 5 private
functions from Connection, replacing them with a single {:dns_result, _}
handler. The Task.Supervisor (ResolveSupervisor) is no longer needed
since DnsResolver calls the resolver synchronously in its own process.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid redundant persistent_term.put calls during re-resolution when
the picked channel hasn't changed. persistent_term updates trigger a
global GC pass across all BEAM processes, so we skip writes on no-op
cycles. Added a comment noting that a future migration to ETS with
read_concurrency: true would eliminate this concern entirely.

Also fixed the rate limiting test to be deterministic by using a
60s min_resolve_interval (guarantees only 1 resolve per burst) and
waiting for DnsResolver to drain its mailbox via :sys.get_state
instead of relying on sleep timing. Fixed disconnect_and_wait to
ensure the linked DnsResolver process is fully stopped before
test cleanup, preventing Mox ownership leaks between tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- resolve_now uses GenServer.cast instead of call (fire-and-forget)
- Rename dns_resolver → dns_resolver_pid, resolver → resolver_module
  for clarity between pid and module references
- Remove duplicate defaults — Keyword.validate! already sets them,
  DnsResolver.start_link reads from connect_opts directly
- Document resolver contract in DnsResolver moduledoc (must implement
  GRPC.Client.Resolver behaviour)
- Add resolve_interval/max/min to connect/2 options docs
- Add missing @impl annotation on handle_call for disconnect

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@polvalente polvalente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a very cursory pass. @sleipnir is probably the best to review this one in more depth, but I'll take another look later!

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Mar 31, 2026

@polvalente did you want me to look at changing the persistent_term to ETS as well or that is not something that your worried about? Looking at Dockyyard blog It looks like performance would be hit but not massively. The reason I ask, this will probably become problematic for DNS as well as service mesh as they will be more write heavy than what is currently happening.

WDYT?

- Rename DnsResolver → DNSResolver (prefer non-downcased acronyms)
- resolve_now uses via(ref) instead of :global.whereis_name (matches
  upstream pattern, cast is fire-and-forget so no error handling needed)
- Rename target → resolver_target (too broad for Connection struct)
- Channel state tuples: {:ok, ch} → {:connected, ch} and
  {:error, reason} → {:failed, reason} to distinguish from Map.fetch
  return values and make the intent clear
- Extract reconcile_channels into disconnect_removed_channels,
  connect_new_channels, and rebalance_after_reconcile defps
@polvalente
Copy link
Copy Markdown
Contributor

@cgreeno I think it would be a good change to move onto an ETS!

The current persistent term approach I think is prone to memory leaks, which is a bit more of a concern to me than read-write performance.

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Apr 2, 2026

@polvalente @sleipnir - any update?

@sleipnir
Copy link
Copy Markdown
Collaborator

sleipnir commented Apr 2, 2026

Hi guys @cgreeno @polvalente, sorry I've been away for a few days.
Let me understand what this does before reviewing, okay? My first question is why all these changes are necessary when it would be enough to have the pick from the lb call the resolver that already exists to update the channel list? I think that would be much simpler, wouldn't it?

@@ -0,0 +1,157 @@
defmodule GRPC.Client.DNSResolver do
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a small concern about this module, mainly because it seems to introduce some overlap with responsibilities that were previously handled in a stateless way.

From what I understand, we might not need to introduce a new resolver here, and possibly not even a stateful approach. The Connection module already manages state and has access to both resolvers and load balancing, which seems aligned with the original design.

Initially, resolvers were intended to remain stateless, with Connection acting as the orchestrator—handling inputs, outputs, and coordination. It feels like a good portion of the logic being added here might already fit naturally within that existing structure.

Before moving forward with this PR, it might be worth revisiting this design choice to see if we can keep things simpler and more consistent with the original architecture.

Copy link
Copy Markdown
Author

@cgreeno cgreeno Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @sleipnir. I appreciate you looking at this with the original architecture in mind. I actually started with exactly what you're describing: the resolve loop, backoff, rate limiting, and timer management all lived inside Connection as handle_info clauses. That was the first iteration.

The problem I ran into was that Connection's GenServer was accumulating a lot of state and responsibility that had nothing to do with its core job of managing channels and load balancing. The re-resolution feature added 8 new state fields (resolve_interval, base_resolve_interval, max_resolve_interval, min_resolve_interval, last_resolve_at, resolve_timer_ref, resolve_task, resolve_start_time), 3 new handle_info clauses (for the timer, task completion, and task crashes), and 5 private functions (backoff, reset_backoff, rate_limited?, schedule_re_resolve, handle_resolve_result with telemetry). Connection went from being a clean orchestrator to also being a DNS retry state machine.

You can see that version here: cgreeno@c5f10d1 --> still needs a bit of work but you get the idea of what I was trying.

Moving it to a linked process made Connection simpler.... it went back to handling one message type ({:dns_result, result}) and doing what it's good at: reconciling channels and updating the LB. The resolver itself stays stateless. DNSResolver doesn't replace the Resolver behaviour, it just wraps it with the periodic retry loop. The actual DNS lookup still goes through the existing stateless Resolver.resolve/1 contract.

The benefit, at least to me, is separation of concerns: DNSResolver owns "when and how often to call resolve" (timers, backoff, rate limiting, telemetry). Connection owns "what to do with the result" (channel reconciliation, LB re-init, persistent_term). Neither needs to know about the other's internals, and because they're linked, lifecycle management is automatic.

That said, I'm happy to discuss further or adjust the approach if you feel strongly. I can rollback the change to my previous commit.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about a mixed approach using delegation?

We could add a DNS attribute to the connection state map and handle it similarly to how we already deal with the load balancing module. In other words, within the Connection GenServer loop, we would delegate to a stateless DNS module.

My concern with introducing another process linked to Connection is that it creates an additional point of failure and adds more complexity in terms of supervision and lifecycle management. By using delegation instead, we avoid that overhead, don’t need to handle potential failures of a separate DNS process, and keep the overall design simpler.

Curious to hear your thoughts on this approach.

Copy link
Copy Markdown
Author

@cgreeno cgreeno Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sleipnir, I like the delegation pattern and I can see how it mirrors the lb_mod pattern. One thing I'd note is that either way I think we would need a separate process for the DNS I/O to avoid blocking Connection. That was actually one of the original reason I moved to an async approach. Sorry, I should have mentioned this more clearly in my initial reply above. The very first version had everything inline in Connection, and during testing with slow/unreachable DNS, the GenServer became unresponsive.

NOW whether that's a Task per resolve call or a long-lived linked process is for sure debateable... I went with the linked process because it felt cleaner to have one stable process own the full lifecycle (timer, backoff, rate limiting, I/O) rather than Connection managing short-lived Tasks and tracking their refs alongside the delegation state.

Keeping this in mind as well as DNS re-resolution is just the first piece... the gRPC spec also defines:

  • SRV record resolution — _grpc-lb._tcp. lookups for port/priority/weight discovery
  • Health checking — periodic grpc.health.v1.Health/Check RPCs to each backend, removing unhealthy ones between DNS cycles
  • xDS resolver — a persistent gRPC stream to the control plane (Istio/Envoy/Traffic Director) for real-time endpoint updates

I planned to have a stab at implementing these as well. sidebar: Fresha is planning to move to a service mesh(possibly) and I just wanted to be ahead of the curve if we decide to go that way. :)

With the delegation pattern, all of these would add their state maps and handle_info clauses to Connection. Health checking alone needs per-backend timers and RPC calls. xDS needs a persistent streaming connection. Delegation works for DNS but doesn't generalize to the other features in the gRPC spec, so we'd end up with mixed patterns. At least, that is how I am thinking about it. I think, once all 3 others are implemented, you would end up with some annoying complex routing logic.

With the linked process pattern, each concern is its own process sending a common {:resolve_result, addresses} message back to Connection. Connection stays focused on channel reconciliation and LB, it doesn't care whether addresses came from DNS, SRV, health checks, or xDS. And if a DNS lookup hangs or an xDS stream crashes, Connection keeps serving traffic from its existing channels.

I haven't given it a HUGE amount of thought but roughly this is what connection could look like after DNS + SRV + health checking + xDS is added with delegation:

defstruct virtual_channel: nil,                                                                                                                                                 
          real_channels: %{},                                                                                                                                                   
          lb_mod: nil,                                                                                                                                                          
          lb_state: nil,                                                                                                                                                        
          resolver: nil,                                                                            
          adapter: GRPC.Client.Adapters.Gun,
          # DNS re-resolution state                                                                                                                                             
          dns_state: %{},                                                                                                                                                       
          dns_task: nil,                                                                                                                                                        
          dns_timer_ref: nil,                                                                                                                                                   
          # SRV resolution state                                                                    
          srv_state: %{},                                                                                                                                                       
          srv_task: nil,
          srv_timer_ref: nil,                                                                                                                                                   
          # Health checking state (per backend)                                                     
          health_state: %{},
          health_tasks: %{},      # one per backend                                                                                                                             
          health_timer_refs: %{}, # one per backend
          # xDS control plane state                                                                                                                                             
          xds_state: %{},                                                                                                                                                       
          xds_stream_pid: nil,
          xds_subscriptions: %{}                                                                                                                                                
                                                                                                                                                                                
And the handle_info clauses:
                                                                                                                                                                                
# --- Existing upstream ---                                                                         
def handle_info(:refresh, state)                                                                                                                                                
 
# --- DNS re-resolution ---                                                                                                                                                     
def handle_info(:dns_resolve, state)           # timer fires                                        
def handle_info({ref, result}, state)          # dns task completed                                                                                                             
  when ref == state.dns_task.ref                                                                    
def handle_info({:DOWN, ref, ...}, state)      # dns task crashed                                                                                                               
  when ref == state.dns_task.ref                                                                                                                                                
                                                                                                                                                                                
# --- SRV resolution ---                                                                                                                                                        
def handle_info(:srv_resolve, state)           # timer fires                                        
def handle_info({ref, result}, state)          # srv task completed
  when ref == state.srv_task.ref                                                                                                                                                
def handle_info({:DOWN, ref, ...}, state)      # srv task crashed
  when ref == state.srv_task.ref                                                                                                                                                
                                                                                                    
# --- Health checking (per backend) ---                                                                                                                                         
def handle_info({:health_check, address}, state)        # timer fires for one backend
def handle_info({ref, result}, state)                    # health task completed                                                                                                
  when Map.has_key?(state.health_tasks, ref)                                                                                                                                    
def handle_info({:DOWN, ref, ...}, state)                # health task crashed
  when Map.has_key?(state.health_tasks, ref)                                                                                                                                    
                                                                                                                                                                                
# --- xDS control plane ---
def handle_info(:xds_connect, state)                     # initial connect                                                                                                      
def handle_info({:xds_update, resource_type, data}, state) # stream update                          
def handle_info({:xds_error, reason}, state)             # stream error                                                                                                         
def handle_info(:xds_reconnect, state)                   # reconnect timer                                                                                                      
                                                                                                                                                                                
# --- Catch-all ---                                                                                                                                                             
def handle_info({:DOWN, _ref, ...}, state)    # now you need to figure out  WHICH task/stream this belongs to

**That's ~15 handle_info clauses, ~20 state fields, and the {ref, result} / {:DOWN, ...} handlers need to disambiguate which subsystem the task belongs to. Every time you add a feature, Connection gets wider.

As I said, have not given this a huge amount of thought so take it with a HUGE grain of salt and that I am using this example more to make my point. 😄**

Compare that to my proposal of a separate process:

  defstruct virtual_channel: nil,                                                                                                                                                 
            real_channels: %{},                                                                       
            lb_mod: nil,
            lb_state: nil,
            resolver: nil,
            adapter: GRPC.Client.Adapters.Gun,
            dns_resolver_pid: nil,                                                                                                                                                
            srv_resolver_pid: nil,
            health_checker_pid: nil,                                                                                                                                              
            xds_resolver_pid: nil                                                                     
                                                                                                                                                                                  
  # Connection only handles results....  doesn't care who sent them                                                                                                                  
  def handle_info({:resolve_result, addresses}, state)                                                                                                                            
  def handle_info({:health_update, address, :healthy | :unhealthy}, state) 

Two handle_info clauses. Four pid fields. Each subsystem manages its own timers, tasks, backoff, and failure recovery internally.

BUT I am not looking at the wider project, only what I feel is best in this instance. AND I am happy to take your lead and what the plan is for the long term architecture. I am open to whatever direction you think is best for the project long-term. So honestly happy with either approach. ❤️

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - going on holiday tomorrow for a week so if I don't reply it is because my wife took away my laptop ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants