Skip to content

feat(relay): schedule deferred binary release with grace-period reclaim (#20)#22

Merged
ilmoniemi merged 3 commits into
mainfrom
feature/20
May 8, 2026
Merged

feat(relay): schedule deferred binary release with grace-period reclaim (#20)#22
ilmoniemi merged 3 commits into
mainfrom
feature/20

Conversation

@ilmoniemi
Copy link
Copy Markdown
Contributor

What

Adds Registry.ScheduleReleaseServer(serverID, d) — a deferred-release primitive that, on timer expiry, removes the binary entry and closes phones outside the registry lock (snapshot-and-iterate, same shape as PhonesFor).

ClaimServer gains a grace-aware fast path: when a timer is pending for serverID, the claim cancels the timer and atomically replaces the binary Conn, returning nil instead of ErrServerIDConflict. The grace window IS the reclaim path — phones registered during the window inherit the new binary on reclaim, or are Close()'d on expiry.

ReleaseServer (synchronous) is unchanged. No callers migrate in this slice; the /v1/server handler swap lives in #21.

Issue

Closes #20. Split from #8.

Testing

Seven new subtests in internal/relay/registry_test.go, mapped 1:1 to AC bullets:

  • TestScheduleReleaseServer_ReclaimWithinGrace_NoReleaseFires
  • TestScheduleReleaseServer_ExpiryRemovesBinaryAndClosesPhones
  • TestScheduleReleaseServer_PhoneRegisteredDuringGrace_SurvivesReclaim
  • TestScheduleReleaseServer_PhoneRegisteredDuringGrace_ClosedOnExpiry
  • TestScheduleReleaseServer_ReplacesPendingTimer
  • TestScheduleReleaseServer_OnUnheldID_NoOp
  • TestScheduleReleaseServer_RaceFreedomUnderRapidCycles — 16 goroutines × 200 ops hammering claim/reclaim/schedule/cancel; clean under -race -count=20.

fakeConn.closed is now mutex-guarded with an optional closeCh for deterministic synchronisation in expiry tests.

go test -race ./..., go vet ./..., go build ./... all clean.

Architecture compliance

  • Single sync.RWMutex continues to guard all maps (now three) per ADR-0003.
  • Stale-fire race closed by pointer-identity check on a wrapper struct (graceEntry); using *time.Timer directly tripped the race detector because the closure captures the local t variable assigned after time.AfterFunc returns. Wrapping in graceEntry lets the closure capture a stable pointer set before the timer is created.
  • Phones closed outside the lock; matches PhonesFor snapshot pattern.
  • No new sentinels; degenerate durations (d<=0, very large) are safe per security review.

🤖 Generated with Claude Code

ilmoniemi and others added 2 commits May 8, 2026 22:24
…im (#20)

Adds Registry.ScheduleReleaseServer arming a timer that, on expiry,
removes the binary entry and closes phones outside the registry lock.
ClaimServer gains a grace-aware fast path: a reclaim within the window
cancels the pending timer and atomically replaces the binary Conn,
returning nil instead of ErrServerIDConflict. Phones registered during
the grace window inherit the new binary on reclaim or are closed on
expiry.

The expiry handler uses pointer-identity on the timer entry to detect
stale fires (timer already-firing while a Stop+replace runs under the
lock). The wrapper struct (graceEntry) lets the closure capture a
stable pointer set before AfterFunc returns, avoiding the
self-referential local-var race.

Existing ReleaseServer (synchronous) is unchanged. No callers migrate
in this slice; the /v1/server handler swap lives in the sibling ticket.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ilmoniemi
Copy link
Copy Markdown
Contributor Author

Code Review: #20

Decision: PASS

Findings

  • [SHOULD FIX] internal/relay/registry.go:81-89ClaimServer's doc comment is now stale. It still asserts "First-claim-wins: a second concurrent caller for the same serverID receives ErrServerIDConflict and the conflicting caller's conn is left untouched (the registry does not Close it)." That's no longer strictly true: during a pending grace window, a second caller succeeds (the reclaim path) and returns nil. A reader who only sees ClaimServer's doc has no way to know that its behavior is conditional on r.timers[serverID]. Add a sentence noting the grace-window reclaim semantic and pointing at ScheduleReleaseServer for details, e.g.: "If a grace-period release is pending for serverID (see ScheduleReleaseServer), ClaimServer cancels the pending timer, atomically replaces the binary Conn, inherits the existing phones, and returns nil." Consider rewording the "First-claim-wins" sentence to scope it to the no-timer case.

  • [NIT] internal/relay/registry.go:63-70 — The graceEntry wrapper works, but the doc comment justifies it as avoiding "the race-detector flag on a self-referential local var." The standard var t *time.Timer; t = time.AfterFunc(d, func() { ... captures t ... }) pattern is idiomatic Go (closures capture by reference) and isn't a race-detector violation under any version I'm aware of. The actual benefit of the wrapper is that the closure no longer needs to read the timer pointer at all (the expiry handler compares entry pointers, not timer pointers), so entry.timer only matters on the cancellation paths. If you'd like to keep the wrapper, consider rewording the comment to that effect; otherwise the simpler *time.Timer map works fine.

  • [NIT] internal/relay/registry_test.go:378-399TestScheduleReleaseServer_ReplacesPendingTimer proves the 20ms timer fires (closing p1) but doesn't directly prove the 1-hour timer was stopped — if the second Stop() were a no-op, the test would still pass since p.Close() is idempotent and delete on an absent key is a no-op. The spec acknowledged this trade-off; flagging only so future readers don't over-rely on the test as a cancellation guarantee.

Summary

Faithful implementation of the spec. The pointer-identity stale-fire defense is correct, the lock discipline matches ADR-0003 (single mutex over all three maps), Close() is invoked outside the lock per the established pattern, and goroutine lifetime is bounded by d. Tests are comprehensive and all seven spec'd subtests are present. make test (with -race) and CI's test and security checks are green.

Security goggles applied (label: security-sensitive): the architect's spec contains a Security review section with verdict PASS; the diff matches its findings (no new untrusted-input boundaries, no logging of sensitive data, time.AfterFunc lifetime bounded, timers map cleanup symmetric in ClaimServer grace path and handleGraceExpiry). Phone-side close code 1011 is correctly deferred per the ticket's "Out of scope" section.

Doc-comment polish before merging would be ideal but is not blocking.

Document ScheduleReleaseServer and the grace-window-as-reclaim semantics
shipped in #20. New ADR-0006 captures why a ClaimServer during a pending
grace timer succeeds rather than conflicts, and the pointer-identity
defence against stale time.AfterFunc fires after Stop() returns false.

- ADR-0006: grace window IS the reclaim path
- features/connection-registry.md: ScheduleReleaseServer surface, lock
  graph (now three maps), goroutine lifecycle, expanded testing section
- INDEX.md: ADR-0006 + updated registry one-liner
- PROJECT-MEMORY.md: grace-period entry; new pattern (pointer-identity
  for stale AfterFunc fires)
- lessons.md: Stop() returning false during in-flight fn; why we capture
  a wrapper pointer rather than *time.Timer

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ilmoniemi ilmoniemi merged commit ba6f01a into main May 8, 2026
2 checks passed
@ilmoniemi ilmoniemi deleted the feature/20 branch May 8, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

relay: registry — schedule deferred binary release with grace-period reclaim

1 participant