Skip to content

fix(ruler): stop per-tenant notifier when rule manager creation fails#7597

Open
sandy2008 wants to merge 3 commits into
cortexproject:masterfrom
sandy2008:fix/ruler-notifier-leak-on-managerfactory-error
Open

fix(ruler): stop per-tenant notifier when rule manager creation fails#7597
sandy2008 wants to merge 3 commits into
cortexproject:masterfrom
sandy2008:fix/ruler-notifier-leak-on-managerfactory-error

Conversation

@sandy2008

Copy link
Copy Markdown
Contributor

What this PR does

Fixes a per-tenant resource leak in the ruler's DefaultMultiTenantManager (issue #7595).

newManager (pkg/ruler/manager.go) starts a per-user notifier before the rule manager exists: getOrCreateNotifier calls n.run() — which spawns the Alertmanager service-discovery and notification goroutines — and registers the notifier in r.notifiers. It also adds a per-user metrics registry via AddUserRegistry. Only after that does it call managerFactory.

If managerFactory returns an error, createRulesManager logs and returns nil, so the user is never added to r.userManagers. The only place a notifier is stopped during normal operation is removeNotifier, called from the user-deletion loop in SyncRuleGroups — which iterates r.userManagers. Because the failed user isn't there, that cleanup never runs: the notifier, its two goroutines, and the per-user metrics registry leak until the process exits.

This PR tears the partially-initialized state down on any early return from newManager, and fixes the same class of leak on the applyConfig error path inside getOrCreateNotifier.

Why this is reachable

DefaultTenantManagerFactory returns real errors in normal operation — e.g. resolveFrontendClient fails, or neither an internal PromQL engine nor a query-frontend client is configured (pkg/ruler/compat.go). A misconfigured or transiently-unresolvable ruler hits this on every sync for the affected tenant; under tenant churn the leaked notifiers, goroutines and registries accumulate and grow memory unbounded.

It was introduced by #6151, which changed ManagerFactory to return an error and made newManager propagate it — without adding cleanup for the notifier it had just started. (The earlier #4718 added removeNotifier only to the user-removal path.)

How the fix resolves it

In newManager, a success flag + defer runs the cleanup (removeNotifier(userID) + RemoveUserRegistry(userID)) on any early return and is disarmed only once a manager is successfully returned, so future early-returns can't reintroduce the leak. This mirrors the existing cleanup in the SyncRuleGroups removal loop. It is deadlock-free: newManager runs under userManagerMtx and removeNotifier takes notifiersMtx; those two locks are always acquired in that order and never the reverse.

In getOrCreateNotifier, if applyConfig fails after n.run(), the notifier was started but not yet inserted into r.notifiers, so it is stopped directly with n.stop() (not removeNotifier, which would re-acquire the already-held notifiersMtx). There is no double-stop: a not-yet-inserted notifier is invisible to the newManager defer's removeNotifier.

Tests

Three regression tests in pkg/ruler/manager_test.go, each verified to fail without the fix and pass with it (under -race, high -count):

  • TestSyncRuleGroupsCleansUpNotifierOnManagerFactoryError — drives the real SyncRuleGroupscreateRulesManagernewManager path with a failing factory; asserts the user is not managed, the notifier is removed from r.notifiers, the per-user registry is removed, and the notifier-run goroutines return to baseline.
  • TestGetOrCreateNotifierStopsNotifierOnApplyConfigError — forces applyConfig to fail (bad Alertmanager TLS CA file) and asserts the notifier and its goroutines are stopped.
  • TestSyncRuleGroupsRecoversAfterManagerFactoryError — a user whose first creation failed is created normally once the factory later succeeds (the cleanup doesn't leave it unrecoverable).

Goroutine leakage is asserted with a pprof-scoped helper that counts only (*rulerNotifier).run goroutines (robust against unrelated goroutines and closure inlining), using a baseline delta.

Why not other approaches

  • Explicit cleanup on each error branch (instead of the disarm-defer): equivalent today, but the defer also covers a panic in managerFactory and any future early-return added before the manager is returned.
  • Stopping the notifier from SyncRuleGroups: the failed user never enters r.userManagers, so the existing removal loop can't see it; the cleanup has to live where the notifier was created.

Scope

Production change is two small edits in pkg/ruler/manager.go; tests in pkg/ruler/manager_test.go; one CHANGELOG.md line. No flags, config, or behavior change beyond the added cleanup.

Two related items are intentionally out of scope (noted for follow-up):

  • The pre-existing stale-notifier reuse branch in getOrCreateNotifier (which re-calls run()) becomes unreachable for the failure scenario after this fix; its now-misleading comment is left untouched to keep this PR tight.
  • syncRulesToManager also allocates mapper rule files / external-label map entries / a lastReloadSuccessful series before manager creation; those likewise persist for a perpetually-failing user. That is a distinct (bounded, non-goroutine) leak and a separate change.

Which issue(s) this PR fixes

Fixes #7595

Checklist

  • CHANGELOG.md updated — [BUGFIX] Ruler entry.
  • Documentation updated — N/A; no flags or config changed (make doc not required).
  • Tests: three regression tests added, each fails without the fix.
  • Commit signed off (DCO).

Test plan

  • gofmt -l / goimports -local github.com/cortexproject/cortex -l — clean
  • go vet -tags "netgo slicelabels" ./pkg/ruler/ — clean
  • go test -tags "netgo slicelabels" -race -count=10 -run 'TestSyncRuleGroupsCleansUpNotifierOnManagerFactoryError|TestGetOrCreateNotifierStopsNotifierOnApplyConfigError|TestSyncRuleGroupsRecoversAfterManagerFactoryError' ./pkg/ruler/ — PASS
  • go test -tags "netgo slicelabels" -count=1 ./pkg/ruler/ — PASS (full package, ~35s)
  • Reverting manager.go to master makes all three new tests fail.

🤖 Generated with Claude Code

@sandy2008 sandy2008 force-pushed the fix/ruler-notifier-leak-on-managerfactory-error branch from 747c0e4 to 612ad2f Compare June 7, 2026 12:15
@sandy2008 sandy2008 marked this pull request as ready for review June 7, 2026 13:09
@dosubot dosubot Bot added component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. type/bug labels Jun 7, 2026
@sandy2008 sandy2008 closed this Jun 7, 2026
@sandy2008 sandy2008 reopened this Jun 7, 2026
Comment thread pkg/ruler/manager.go Outdated
@@ -374,6 +399,12 @@ func (r *DefaultMultiTenantManager) getOrCreateNotifier(userID string, userManag

// This should never fail, unless there's a programming mistake.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is ambiguous.. wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded it. The original was ambiguous on two counts: "This" had no clear referent, and "should never fail … unless there's a programming mistake" read oddly directly above the new leak-cleanup. The replacement states the actual invariant instead:

// r.notifierCfg is built once at startup (buildNotifierConfig) and applied
// unchanged to every tenant's notifier, so applyConfig does not depend on
// per-tenant or user input and is not expected to fail here. Handle the
// error defensively anyway:

So the error branch is purely defensive — it stops the discovery/notification goroutines n.run() already started, which would otherwise leak. Done in 375cf67. Let me know if you were getting at something different.

sandy2008 and others added 3 commits June 22, 2026 11:00
newManager started the per-user notifier (and its discovery/notification
goroutines) and registered it in r.notifiers before creating the rule
manager. When managerFactory returned an error the user was never added to
r.userManagers, so the removal loop in SyncRuleGroups never stopped the
notifier, leaking it and its goroutines until the process exited. The same
applied to the applyConfig error path in getOrCreateNotifier.

Tear down the partially-initialized notifier and per-user metrics registry
on any early return from newManager (disarmed on success), and stop the
notifier directly when applyConfig fails. Adds regression tests for both
leak paths.

Fixes cortexproject#7595

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
The lint job's check-modernize step failed because the notifier-leak
regression tests used pre-modern idioms that `make modernize`
(golang.org/x/tools modernize@v0.22.0) rewrites:

- func() interface{} -> func() any (3x)
- for _, x := range bytes.Split(...) -> for x := range bytes.SplitSeq(...)

Apply exactly what the pinned modernize tool produces so
`make check-modernize` reports a clean tree. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
The pre-existing "This should never fail, unless there's a programming
mistake" comment was ambiguous, especially next to the newly added
notifier-leak cleanup on the error path. Replace it with the concrete
invariant: r.notifierCfg is built once at startup and applied unchanged
to every tenant's notifier, so applyConfig does not depend on per-tenant
or user input and is not expected to fail here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 force-pushed the fix/ruler-notifier-leak-on-managerfactory-error branch from 3f862ca to 375cf67 Compare June 22, 2026 02:03
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. lgtm This PR has been approved by a maintainer size/L type/bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ruler: per-tenant notifier and its service-discovery goroutines leak when managerFactory returns an error

2 participants