Skip to content

[CONTP-1569] Add podCollectionMode field on kubeStateMetricsCore for node-side pod collection#3027

Open
zhuminyi wants to merge 2 commits into
mainfrom
minyi/ksm-pod-collection-on-node
Open

[CONTP-1569] Add podCollectionMode field on kubeStateMetricsCore for node-side pod collection#3027
zhuminyi wants to merge 2 commits into
mainfrom
minyi/ksm-pod-collection-on-node

Conversation

@zhuminyi
Copy link
Copy Markdown
Contributor

@zhuminyi zhuminyi commented May 18, 2026

What does this PR do?

Adds features.kubeStateMetricsCore.podCollectionMode options are default and node_kubelet on the
v2alpha1 DatadogAgent CRD.

  apiVersion: datadoghq.com/v2alpha1
  kind: DatadogAgent
  metadata:
    name: datadog
    namespace: datadog
  spec:
    features:
      clusterChecks:
        enabled: true
        useClusterChecksRunners: true  
      kubeStateMetricsCore:
        enabled: true
        podCollectionMode: node_kubelet   # THE SINGLE NEW TOGGLE 

When set to node_kubelet, the operator:

  • Injects pod_collection_mode: cluster_unassigned into the operator-generated cluster-side KSM
    ConfigMap (skipped if the user supplied their own conf — the operator never modifies user-supplied
    YAML).
  • Generates a second ConfigMap with a pods-only check (pod_collection_mode: node_kubelet, collectors: [pods]).

This is equivalent to the following config:

spec:
  features:
    kubeStateMetricsCore:
      enabled: true
      conf:
        configData: |-
          cluster_check: true
          init_config:
          instances:
            - skip_leader_election: true
               pod_collection_mode: cluster_unassigned

    nodeAgent:
      extraConfd:
        configDataMap:
          kubernetes_state_core.yaml: |-
            init_config:
            instances:
              - collectors:
                  - pods
                pod_collection_mode: node_kubelet

Result: scheduled-pod metrics are emitted by each node-agent locally from the Kubelet (no API-server
traffic for pods); unscheduled pods plus every other KSM resource continue to be collected by the
cluster-side check. Default behavior is byte-identical to today when the field is unset.

Motivation

In large clusters, pod metrics dominate the cardinality of the monolithic kubernetes_state_core
cluster check, making it the scaling bottleneck. The agent has supported offloading pod collection to
node-agents via pod_collection_mode: node_kubelet since 7.58, but customers had to assemble two
coordinated configs manually (features.kubeStateMetricsCore.conf.configData for the cluster side,
override.nodeAgent.extraConfd for the node side) and carefully avoid double-collection or losing
unscheduled-pod metrics. This field collapses that into a single typed toggle.

Describe your test plan

Unit tests added in this PR:

  • TestKsmCheckConfigPodCollectionOnNode — asserts pod_collection_mode: cluster_unassigned is emitted
    in the cluster-side ConfigMap when the flag is set, and absent when it isn't.
  • Test_ksmFeature_buildKSMCorePodsOnNodeConfigMap — asserts the node-side ConfigMap shape (no
    cluster_check, collectors: [pods], pod_collection_mode: node_kubelet).

e2e test

# Both ConfigMaps generated by the operator:
kubectl get cm -n datadog | grep state-metrics
# Expect:
#   <dda>-kube-state-metrics-core-config                 (cluster-side, with cluster_unassigned)
#   <dda>-kube-state-metrics-core-pods-on-node-config    (node-side, pods-only node_kubelet)

# Node-agent mounts the node-side check:
kubectl get ds -n datadog datadog-agent \
  -o jsonpath='{.spec.template.spec.containers[?(@.name=="agent")].volumeMounts[?(@.name=="ksm-core-pods
-on-node-config")]}'
# Expect: mountPath /etc/datadog-agent/conf.d/kubernetes_state_core.d, readOnly: true

# Node-agent KSM check is loaded and running:
kubectl exec -n datadog ds/datadog-agent -c agent -- agent configcheck \
  | sed -n '/kubernetes_state_core check/,/===/p'
# Expect: pod_collection_mode: node_kubelet, collectors: [pods]

kubectl exec -n datadog ds/datadog-agent -c agent -- agent status \
  | awk '/kubernetes_state_core/,/Status:/' | grep -E "Total Runs|Metric Samples"
# Expect Total Runs > 0 and Metric Samples > 0 after ~30s.

# Cluster-side check is dispatched with cluster_unassigned:
kubectl exec -n datadog deploy/datadog-cluster-agent -- agent clusterchecks \
  | sed -n '/===== Checks on .*cluster-checks-runner/,/===== /p' \
  | grep -E "kubernetes_state_core|pod_collection_mode"
# Expect: pod_collection_mode: cluster_unassigned

Minimum Agent Versions

  • Agent: v7.60.0
  • Cluster Agent: v7.60.0

@datadog-official
Copy link
Copy Markdown

datadog-official Bot commented May 18, 2026

Code Coverage

Fix all issues with BitsAI

🛑 Gate Violations

🎯 1 Code Coverage issue detected

A Patch coverage percentage gate may be blocking this PR.

Patch coverage: 79.12% (threshold: 80.00%)

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 79.12%
Overall Coverage: 42.16% (+0.09%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 27a9fc0 | Docs | Datadog PR Page | Give us feedback!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 18, 2026

Codecov Report

❌ Patch coverage is 78.57143% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.87%. Comparing base (4e59e1d) to head (27a9fc0).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/testutils/builder.go 0.00% 12 Missing ⚠️
...atadogagent/feature/kubernetesstatecore/feature.go 85.24% 4 Missing and 5 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3027      +/-   ##
==========================================
+ Coverage   41.50%   41.87%   +0.36%     
==========================================
  Files         335      336       +1     
  Lines       28714    29182     +468     
==========================================
+ Hits        11919    12221     +302     
- Misses      16001    16144     +143     
- Partials      794      817      +23     
Flag Coverage Δ
unittests 41.87% <78.57%> (+0.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...adogagent/feature/kubernetesstatecore/configmap.go 97.84% <100.00%> (+0.62%) ⬆️
.../datadogagent/feature/kubernetesstatecore/const.go 100.00% <ø> (ø)
...atadogagent/feature/kubernetesstatecore/feature.go 80.62% <85.24%> (+1.36%) ⬆️
pkg/testutils/builder.go 0.00% <0.00%> (ø)

... and 5 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e59e1d...27a9fc0. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Introduces features.kubeStateMetricsCore.podCollectionMode on the v2alpha1
DatadogAgent CRD (enum: default | node_kubelet).

When set to node_kubelet, the operator:

- injects pod_collection_mode: cluster_unassigned into the operator-generated
  cluster-side KSM ConfigMap (skipped when the user supplies their own .Conf
  override; the operator never mutates user-supplied YAML);
- generates a second ConfigMap with a pods-only check (pod_collection_mode:
  node_kubelet, collectors: [pods]);
- mounts that ConfigMap into every node agent (multi-container and
  single-container) at /etc/datadog-agent/conf.d/kubernetes_state_core.d/.

The resolved mode is part of the default-config checksum, so toggling the
field changes the cluster-agent pod-template annotation and forces a rollout
instead of leaving the in-memory ConfigMap stale.

Version gate: when either the cluster-side component (cluster-checks-runner
if enabled, otherwise cluster-agent) OR the node-agent override image is
parseable AND below 7.60, the operator skips the feature with a warning log
rather than mounting an unsupported file into an older node-agent.
Unparseable tags (:dev, :latest, custom registries) are assumed compatible.

When podCollectionMode=node_kubelet is set alongside
features.kubeStateMetricsCore.conf, the operator still deploys the node-side
check but logs a descriptive warning pointing the user at the two valid ways
to avoid double pod collection (omit pods from collectors, or set
pod_collection_mode: cluster_unassigned themselves).

Default behavior is byte-identical to today when the field is unset.
@zhuminyi zhuminyi force-pushed the minyi/ksm-pod-collection-on-node branch from dcdead3 to 9c80b42 Compare May 19, 2026 02:31
@zhuminyi zhuminyi changed the title Minyi/ksm pod collection on node [CONTP-1569] Add podCollectionMode field on kubeStateMetricsCore for node-side pod collection May 19, 2026
@zhuminyi zhuminyi marked this pull request as ready for review May 19, 2026 02:51
@zhuminyi zhuminyi requested a review from a team May 19, 2026 02:51
@zhuminyi zhuminyi requested review from a team as code owners May 19, 2026 02:51
Comment thread api/datadoghq/v2alpha1/datadogagent_types.go Outdated
Comment thread internal/controller/datadogagent/feature/kubernetesstatecore/configmap_test.go Outdated
Comment thread internal/controller/datadogagent/feature/kubernetesstatecore/configmap_test.go Outdated
- Clarify PodCollectionMode godoc to disambiguate snake_case agent check
  options (pod_collection_mode in the rendered YAML) from the camelCase
  CRD field name (podCollectionMode).
- Replace manual if/t.Fatalf checks in Test_ksmFeature_buildKSMCorePodsOnNodeConfigMap
  with require.True / require.NoError for consistency with the rest of the
  package's test style.
- Use deep-equality on the parsed YAML in that test so future accidental
  additions to the generated node-side ConfigMap (e.g. an unwanted
  cluster_check field) fail the test rather than silently passing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants