Skip to content

Commit 2fd183a

Browse files
committed
Introduce a k8s based lading example
This commit demonstrates how one might use lading in a k8s environment to determine the memory bounds for a target container, in this case Datadog Agent. The example used here is harsh and comes from Datadog Agent's own Regression Detector experiment `uds_dogstatsd_to_api`. Run like so: ``` ./k8s/experiment.sh --total-limit 1200 --agent-memory 700 --trace-memory 100 --sysprobe-memory 300 --process-memory 100 --duration 600 --tags "purpose:smp-experiment,agent-limit:2048" ``` This invocation demostrates a memory allocation that works for Agent under these conditions, results: ``` ======================================== RESULT: SUCCESS ======================================== No restarts detected Test duration: 600 seconds Tags: purpose:smp-experiment,agent-limit:2048 Container memory usage: agent: 640.67 MB / 700 MB (91.5%) trace-agent: 31.36 MB / 100 MB (31.4%) system-probe: 266.26 MB / 300 MB (88.8%) process-agent: 48.00 MB / 100 MB (48.0%) TOTAL: 986.29 MB / 1200 MB (82.2%) ``` Instructions are present in the `k8s/README.md` for changing lading's configuration and Datadog Agent's own configuration. Signed-off-by: Brian L. Troutwine <brian.troutwine@datadoghq.com>
1 parent 268a16a commit 2fd183a

8 files changed

Lines changed: 797 additions & 0 deletions

File tree

k8s/README.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Lading in k8s Demonstration
2+
3+
Testing setup to demonstrate memory limits for Datadog Agent under lading load.
4+
5+
Experiment is rigged up through `experiment.sh`. That script takes multiple
6+
memory parameters for each configured Agent pod container, setting them as
7+
limits in `manifests/datadog-agent.yaml`. Experiment runs for a given duration
8+
-- suggested, 300 seconds at a minimum -- and does two things:
9+
10+
* watches for container restarts during the experiment, signaling failure if one
11+
is detected or
12+
* executes to experiment duration and queries Prometheus to calculate the peak
13+
memory consumed by each Agent container, relative to configured limits.
14+
15+
Experiments are **isolated from the internet** to avoid sending metrics et al to
16+
actual Datadog intake. See `manifests/deny-egress.yaml` for details.
17+
18+
## Prerequisites
19+
20+
- kind: `brew install kind`
21+
- kubectl: `brew install kubectl`
22+
- helm: `brew install helm`
23+
- jq: `brew install jq`
24+
- python3: System Python 3
25+
- Docker running
26+
27+
## Usage
28+
29+
### Test a specific memory limit
30+
31+
```bash
32+
# Test 2000 MB total for 5 minutes with explicit per-container limits
33+
./k8s/experiment.sh --total-limit 2000 --agent-memory 1200 --trace-memory 400 --sysprobe-memory 300 --process-memory 100 --tags "purpose:test,limit:2000mb"
34+
```
35+
36+
All memory flags are mandatory and must sum to `--total-limit`, which acts as a check flag.
37+
38+
### To find a minimum memory limit
39+
40+
Run the script multiple times with different limits. Results are:
41+
42+
- **OOMKilled** (FAILURE): Agent needs more memory, script exits
43+
- **Stable** (SUCCESS): Agent survived test duration, cluster kept running for examination
44+
45+
## Manifests
46+
47+
All manifests are in `manifests/` directory. The script uses template
48+
substitution for:
49+
50+
- **manifests/datadog-agent.yaml**: DatadogAgent CRD for Datadog Operator
51+
- Uses `{{ AGENT_MEMORY_MB }}`, `{{ TRACE_MEMORY_MB }}`, `{{
52+
SYSPROBE_MEMORY_MB }}`, `{{ PROCESS_MEMORY_MB }}`, and `{{ DD_TAGS }}`
53+
placeholders
54+
- Configured for DogStatsD via Unix domain socket at `/var/run/datadog/dsd.socket`
55+
- Shares `/var/run/datadog` via hostPath with lading pod
56+
57+
- **manifests/lading.yaml**: Lading load generator (lading 0.29.2)
58+
- ConfigMap with exact config from `uds_dogstatsd_to_api` test
59+
- Sends 100 MiB/s of DogStatsD metrics
60+
- High cardinality: 1k-10k contexts, many tags
61+
- Service with Prometheus scrape annotations for lading metrics
62+
63+
- **manifests/lading-intake.yaml**: Lading intake (blackhole) mimicking Datadog
64+
API (lading 0.29.2)
65+
- Receives and discards agent output for self-contained testing
66+
67+
- **manifests/datadog-secret.yaml**: Placeholder secret (fake API key, not validated)
68+
- **manifests/deny-egress.yaml**: NetworkPolicy blocking internet egress (security isolation)
69+
70+
## Test configuration
71+
72+
Taken from
73+
[`datadog-agent/test/regression/cases/uds_dogstatsd_to_api`](https://github.com/DataDog/datadog-agent/blob/main/test/regression/cases/uds_dogstatsd_to_api/lading/lading.yaml). This
74+
experiment is **high stress** for metrics intake and high memory use from
75+
`agent` container is expected.
76+
77+
Adjust lading load generation configuration in the ConfigMap called
78+
`lading-config`. Adjust Agent configuration in `manifests/datadog-agent.yaml`.
79+
80+
## Cleanup
81+
82+
Cluster is left online after script exits. Re-run of `experiment.sh` will
83+
destroy the cluster. Manually clean up the cluster like so:
84+
85+
```bash
86+
kind delete cluster --name lading-test
87+
```
88+
89+
## Notes
90+
91+
- **Agent version**: 7.72.1
92+
- **Lading version**: 0.29.2
93+
- **Agent features enabled**: APM (trace-agent), Log Collection, NPM/system-probe, DogStatsD, Prometheus scrape

k8s/analyze_memory.py

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
#!/usr/bin/env python3
2+
import sys
3+
import json
4+
import urllib.request
5+
import urllib.parse
6+
7+
def query_container(prom_url, pod, container, duration):
8+
query = f'max_over_time(container_memory_working_set_bytes{{namespace="default",pod="{pod}",container="{container}"}}[{duration}s])'
9+
params = {'query': query}
10+
url = f"{prom_url}?{urllib.parse.urlencode(params)}"
11+
12+
try:
13+
with urllib.request.urlopen(url, timeout=10) as response:
14+
data = json.loads(response.read().decode())
15+
16+
if data['status'] == 'success' and data['data']['result']:
17+
value_bytes = float(data['data']['result'][0]['value'][1])
18+
return data, value_bytes
19+
return data, None
20+
except Exception as e:
21+
print(f"Error querying {container}: {e}", file=sys.stderr)
22+
return None, None
23+
24+
def main():
25+
if len(sys.argv) != 8:
26+
print("Usage: analyze_memory.py <prom_url> <pod> <duration> <agent_limit_mb> <trace_limit_mb> <sysprobe_limit_mb> <process_limit_mb>", file=sys.stderr)
27+
sys.exit(1)
28+
29+
prom_url = sys.argv[1]
30+
pod = sys.argv[2]
31+
duration = sys.argv[3]
32+
agent_limit = int(sys.argv[4])
33+
trace_limit = int(sys.argv[5])
34+
sysprobe_limit = int(sys.argv[6])
35+
process_limit = int(sys.argv[7])
36+
total_limit = agent_limit + trace_limit + sysprobe_limit + process_limit
37+
38+
containers = {
39+
'agent': agent_limit,
40+
'trace-agent': trace_limit,
41+
'system-probe': sysprobe_limit,
42+
'process-agent': process_limit
43+
}
44+
45+
results = {}
46+
47+
for container, limit_mb in containers.items():
48+
data, value_bytes = query_container(prom_url, pod, container, duration)
49+
50+
if value_bytes is not None:
51+
value_mb = value_bytes / 1024 / 1024
52+
percent = (value_mb / limit_mb) * 100
53+
results[container] = (value_mb, limit_mb, percent)
54+
print(f" {container}: {value_mb:.2f} MB / {limit_mb} MB ({percent:.1f}%)")
55+
else:
56+
print(f" {container}: Could not retrieve metrics")
57+
results[container] = (0, limit_mb, 0)
58+
59+
# Calculate total
60+
total_mb = sum(r[0] for r in results.values())
61+
total_percent = (total_mb / total_limit) * 100
62+
print(f" TOTAL: {total_mb:.2f} MB / {total_limit} MB ({total_percent:.1f}%)")
63+
64+
if __name__ == '__main__':
65+
main()

0 commit comments

Comments
 (0)