OCPBUGS-86229: apiserver: cache etcd storage monitors to avoid recreating clients on each metrics scrape#2674
OCPBUGS-86229: apiserver: cache etcd storage monitors to avoid recreating clients on each metrics scrape#2674tchap wants to merge 1 commit into
Conversation
…reating clients on each metrics scrape Signed-off-by: xigang <wangxigang2014@gmail.com>
|
@tchap: This pull request references Jira Issue OCPBUGS-86229, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tchap: the contents of this pull request could be automatically validated. The following commits are valid:
Comment |
WalkthroughThis PR refactors etcd storage monitor lifecycle from per-call creation to a cached, thread-safe initialization. A new ChangesMonitor cache implementation and lifecycle
Sequence DiagramsequenceDiagram
participant Server as Server Init
participant Cache as monitorCache
participant Factory as StorageFactory
participant Caller as Metric Caller
participant Cleanup as Cleanup Goroutine
Server->>Cache: newMonitorCache()
Server->>Cache: register get as StorageMonitorGetter
Caller->>Cache: get() [first call]
Cache->>Cache: acquire WLock
Cache->>Factory: createMonitor loop
Factory-->>Cache: [Monitor1, Monitor2, ...]
Cache->>Cache: store in cache
Cache-->>Caller: [Monitor1, Monitor2, ...]
Caller->>Cache: get() [second call]
Cache->>Cache: acquire RLock
Cache-->>Caller: [Monitor1, Monitor2, ...] (cached)
Server->>Cleanup: stopCh closed
Cleanup->>Cache: close all monitors
Cleanup->>Cache: clear cache
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tchap The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@tchap: This pull request references Jira Issue OCPBUGS-86229, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@tchap: This pull request references Jira Issue OCPBUGS-86229, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
staging/src/k8s.io/apiserver/pkg/server/options/etcd_test.go (1)
500-645: ⚡ Quick winConsider adding test coverage for empty configs and partial initialization failure.
The test suite covers the happy path well, but consider adding tests for:
factory.Configs()returning an empty slice (this would catch the goroutine leak bug ininitialize())createMonitorreturning an error mid-iteration to verify partial initialization cleanup🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@staging/src/k8s.io/apiserver/pkg/server/options/etcd_test.go` around lines 500 - 645, Add two new subtests inside TestMonitorCache: one to exercise an empty factory.Configs() by using a SimpleStorageFactory variant that returns an empty slice and asserting cache.get() behaves/returns an empty monitor list (and that no goroutines leak by still allowing stopCh cleanup via newTestCache), and a second that simulates createMonitor failing mid-iteration by using setCreateMonitor so the first call returns a valid fakeMonitor and the second returns an error; assert that the error propagates from cache.get(), the first monitor's closeCalls increased (cleanup happened), and cache.monitors does not contain partially initialized entries. Use the existing helpers newTestCache, setCreateMonitor, fakeMonitor, createMonitor, cache.get, cache.close, and monitor.closeCalls to locate and implement these tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@staging/src/k8s.io/apiserver/pkg/server/options/etcd.go`:
- Around line 304-316: The code can leave c.monitors nil when
c.factory.Configs() returns an empty slice, causing get() to repeatedly call
initialize() and leak goroutines; to fix it, ensure monitors is a non-nil empty
slice before assigning to c.monitors (e.g., initialize monitors :=
make([]metrics.Monitor, 0, len(c.factory.Configs())) or ensure c.monitors =
[]metrics.Monitor{} when no monitors were created), so checks like c.monitors !=
nil behave correctly; update the block using createMonitor, c.factory.Configs(),
and the assignment to c.monitors in initialize()/get() accordingly.
---
Nitpick comments:
In `@staging/src/k8s.io/apiserver/pkg/server/options/etcd_test.go`:
- Around line 500-645: Add two new subtests inside TestMonitorCache: one to
exercise an empty factory.Configs() by using a SimpleStorageFactory variant that
returns an empty slice and asserting cache.get() behaves/returns an empty
monitor list (and that no goroutines leak by still allowing stopCh cleanup via
newTestCache), and a second that simulates createMonitor failing mid-iteration
by using setCreateMonitor so the first call returns a valid fakeMonitor and the
second returns an error; assert that the error propagates from cache.get(), the
first monitor's closeCalls increased (cleanup happened), and cache.monitors does
not contain partially initialized entries. Use the existing helpers
newTestCache, setCreateMonitor, fakeMonitor, createMonitor, cache.get,
cache.close, and monitor.closeCalls to locate and implement these tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f53702a7-374f-401a-ba1f-a090c1eb2529
📒 Files selected for processing (3)
staging/src/k8s.io/apiserver/pkg/server/options/etcd.gostaging/src/k8s.io/apiserver/pkg/server/options/etcd_test.gostaging/src/k8s.io/apiserver/pkg/storage/etcd3/metrics/metrics.go
💤 Files with no reviewable changes (1)
- staging/src/k8s.io/apiserver/pkg/storage/etcd3/metrics/metrics.go
|
I am actually going to fix the upstream issue found by CR and get back to this. /hold |
|
I created kubernetes#139358 . It's actually not critical as the issue is only theoretical right now since /unhold |
|
/retest |
|
@tchap: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
The apiserver was creating new etcd storage monitor clients on every Prometheus metrics scrape cycle and immediately closing them afterward. This caused repeated gRPC connection churn against etcd, as each scrape round-tripped through client creation and teardown.
There are multiple client cases linked to this, either complaining about warnings or about memory consumption, so we decided on RIT to cherry-pick this and not wait for 1.37 to be incorporated.
Which issue(s) this PR is related to:
n/a
Special notes for your reviewer:
n/a
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
n/a
Summary by CodeRabbit
Refactor
Tests