[CONTP-1569] Add podCollectionMode field on kubeStateMetricsCore for node-side pod collection#3027
[CONTP-1569] Add podCollectionMode field on kubeStateMetricsCore for node-side pod collection#3027zhuminyi wants to merge 2 commits into
Conversation
🛑 Gate Violations
ℹ️ Info🎯 Code Coverage (details) Useful? React with 👍 / 👎 This comment will be updated automatically if new data arrives.🔗 Commit SHA: 27a9fc0 | Docs | Datadog PR Page | Give us feedback! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3027 +/- ##
==========================================
+ Coverage 41.50% 41.87% +0.36%
==========================================
Files 335 336 +1
Lines 28714 29182 +468
==========================================
+ Hits 11919 12221 +302
- Misses 16001 16144 +143
- Partials 794 817 +23
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 5 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Introduces features.kubeStateMetricsCore.podCollectionMode on the v2alpha1 DatadogAgent CRD (enum: default | node_kubelet). When set to node_kubelet, the operator: - injects pod_collection_mode: cluster_unassigned into the operator-generated cluster-side KSM ConfigMap (skipped when the user supplies their own .Conf override; the operator never mutates user-supplied YAML); - generates a second ConfigMap with a pods-only check (pod_collection_mode: node_kubelet, collectors: [pods]); - mounts that ConfigMap into every node agent (multi-container and single-container) at /etc/datadog-agent/conf.d/kubernetes_state_core.d/. The resolved mode is part of the default-config checksum, so toggling the field changes the cluster-agent pod-template annotation and forces a rollout instead of leaving the in-memory ConfigMap stale. Version gate: when either the cluster-side component (cluster-checks-runner if enabled, otherwise cluster-agent) OR the node-agent override image is parseable AND below 7.60, the operator skips the feature with a warning log rather than mounting an unsupported file into an older node-agent. Unparseable tags (:dev, :latest, custom registries) are assumed compatible. When podCollectionMode=node_kubelet is set alongside features.kubeStateMetricsCore.conf, the operator still deploys the node-side check but logs a descriptive warning pointing the user at the two valid ways to avoid double pod collection (omit pods from collectors, or set pod_collection_mode: cluster_unassigned themselves). Default behavior is byte-identical to today when the field is unset.
dcdead3 to
9c80b42
Compare
- Clarify PodCollectionMode godoc to disambiguate snake_case agent check options (pod_collection_mode in the rendered YAML) from the camelCase CRD field name (podCollectionMode). - Replace manual if/t.Fatalf checks in Test_ksmFeature_buildKSMCorePodsOnNodeConfigMap with require.True / require.NoError for consistency with the rest of the package's test style. - Use deep-equality on the parsed YAML in that test so future accidental additions to the generated node-side ConfigMap (e.g. an unwanted cluster_check field) fail the test rather than silently passing.
What does this PR do?
Adds
features.kubeStateMetricsCore.podCollectionModeoptions aredefaultandnode_kubeleton thev2alpha1
DatadogAgentCRD.When set to
node_kubelet, the operator:pod_collection_mode: cluster_unassignedinto the operator-generated cluster-side KSMConfigMap (skipped if the user supplied their own
conf— the operator never modifies user-suppliedYAML).
pod_collection_mode: node_kubelet,collectors: [pods]).This is equivalent to the following config:
Result: scheduled-pod metrics are emitted by each node-agent locally from the Kubelet (no API-server
traffic for pods); unscheduled pods plus every other KSM resource continue to be collected by the
cluster-side check. Default behavior is byte-identical to today when the field is unset.
Motivation
In large clusters, pod metrics dominate the cardinality of the monolithic
kubernetes_state_corecluster check, making it the scaling bottleneck. The agent has supported offloading pod collection to
node-agents via
pod_collection_mode: node_kubeletsince 7.58, but customers had to assemble twocoordinated configs manually (
features.kubeStateMetricsCore.conf.configDatafor the cluster side,override.nodeAgent.extraConfdfor the node side) and carefully avoid double-collection or losingunscheduled-pod metrics. This field collapses that into a single typed toggle.
Describe your test plan
Unit tests added in this PR:
TestKsmCheckConfigPodCollectionOnNode— assertspod_collection_mode: cluster_unassignedis emittedin the cluster-side ConfigMap when the flag is set, and absent when it isn't.
Test_ksmFeature_buildKSMCorePodsOnNodeConfigMap— asserts the node-side ConfigMap shape (nocluster_check,collectors: [pods],pod_collection_mode: node_kubelet).e2e test
Minimum Agent Versions