Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
228 changes: 228 additions & 0 deletions .claude/commands/tnf-power.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
---
description: Show TNF cluster power consumption from Kepler metrics
---

You are generating a power consumption report for a TNF (Two Nodes with Fencing) cluster using Kepler metrics.

## Step 0: Setup Cluster Access

**IMPORTANT**: Before running any `oc` commands, you MUST ensure KUBECONFIG is set.

First, check if cluster access works:
```bash
oc get nodes 2>&1 | head -3
```

If you get an error about missing config, look for and source the proxy.env file:

```bash
# Find proxy.env in the repository
PROXY_ENV=$(find . -name "proxy.env" -type f 2>/dev/null | head -1)
if [ -n "$PROXY_ENV" ]; then
echo "Found: $PROXY_ENV"
source "$PROXY_ENV"
echo "KUBECONFIG=$KUBECONFIG"
fi
```

If proxy.env doesn't exist, check common locations:
- `deploy/openshift-clusters/proxy.env`
- Look for KUBECONFIG in the dev-scripts directory

Only proceed to Step 1 after `oc get nodes` works successfully.

## Prerequisites

Before running queries, verify:
1. Kepler is deployed: `oc get pods -n kepler`
2. User workload monitoring is running: `oc get pods -n openshift-user-workload-monitoring`
3. KUBECONFIG is set (handled in Step 0)

## Query Steps

### Step 1: Check Kepler Status

First, verify Kepler is running and metrics are being scraped:

```bash
# Check Kepler pods
oc get pods -n kepler -l app.kubernetes.io/name=kepler-exporter

# Check if metrics are being scraped (should return 2 targets for TNF)
oc exec -n openshift-user-workload-monitoring prometheus-user-workload-0 -c prometheus -- \
curl -s http://localhost:9090/api/v1/targets 2>/dev/null | \
jq '[.data.activeTargets[] | select(.labels.job == "kepler-exporter")] | length'
```

### Step 2: Check Power Measurement Mode

Determine if we're getting real or estimated power:

```bash
# Check if RAPL is available (real power) or not (estimated)
POD=$(oc get pods -n kepler -l app.kubernetes.io/name=kepler-exporter -o jsonpath='{.items[0].metadata.name}')
RAPL_CHECK=$(oc exec -n kepler $POD -- ls /sys/class/powercap/intel-rapl 2>/dev/null || echo "NOT_FOUND")
if [ -z "$RAPL_CHECK" ] || [ "$RAPL_CHECK" = "NOT_FOUND" ]; then
echo "Mode: ESTIMATED (VMs - no RAPL hardware access)"
else
echo "Mode: REAL (Bare metal - RAPL available)"
fi
```

### Step 3: Query Power Metrics

Run these queries against the user workload Prometheus.

**Important label notes for Kepler v0.11.x:**
- Container metrics (`kepler_container_cpu_watts`) have `namespace: kepler` (the exporter's own namespace), NOT the workload namespace
- The `container_name` label holds the pod/process name discovered by Kepler
- The `instance` label identifies which node reported the metric (e.g., `192.168.111.20:9188`)
- To find control plane components, use `container_name` regex matching against known pod prefixes

```bash
# Total cluster power (watts) - sum of node CPU power
oc exec -n openshift-user-workload-monitoring prometheus-user-workload-0 -c prometheus -- \
curl -s 'http://localhost:9090/api/v1/query?query=sum(kepler_node_cpu_watts)' 2>/dev/null | \
jq -r '.data.result[0].value[1] // "0"'

# Power by node (using node CPU watts)
oc exec -n openshift-user-workload-monitoring prometheus-user-workload-0 -c prometheus -- \
curl -s 'http://localhost:9090/api/v1/query?query=sum%20by%20(instance)(kepler_node_cpu_watts)' 2>/dev/null | \
jq -r '.data.result[] | "\(.metric.instance): \(.value[1])W"'

# Top 10 containers by power (container_name = pod/process name, instance = node)
oc exec -n openshift-user-workload-monitoring prometheus-user-workload-0 -c prometheus -- \
curl -s 'http://localhost:9090/api/v1/query?query=topk(10,sum%20by%20(container_name,%20instance)(kepler_container_cpu_watts))' 2>/dev/null | \
jq -r '.data.result[] | "\(.metric.container_name) [\(.metric.instance)]: \(.value[1])W"'

# Control plane component power (matched by known pod name prefixes)
# etcd
oc exec -n openshift-user-workload-monitoring prometheus-user-workload-0 -c prometheus -- \
curl -s 'http://localhost:9090/api/v1/query?query=sum(kepler_container_cpu_watts%7Bcontainer_name%3D~%22etcd.*%22%7D)' 2>/dev/null | \
jq -r '"etcd: \(.data.result[0].value[1] // "0")W"'

# kube-apiserver
oc exec -n openshift-user-workload-monitoring prometheus-user-workload-0 -c prometheus -- \
curl -s 'http://localhost:9090/api/v1/query?query=sum(kepler_container_cpu_watts%7Bcontainer_name%3D~%22kube-apiserver.*%22%7D)' 2>/dev/null | \
jq -r '"kube-apiserver: \(.data.result[0].value[1] // "0")W"'

# kube-controller-manager
oc exec -n openshift-user-workload-monitoring prometheus-user-workload-0 -c prometheus -- \
curl -s 'http://localhost:9090/api/v1/query?query=sum(kepler_container_cpu_watts%7Bcontainer_name%3D~%22kube-controller-manager.*%22%7D)' 2>/dev/null | \
jq -r '"kube-controller-manager: \(.data.result[0].value[1] // "0")W"'

# kube-scheduler
oc exec -n openshift-user-workload-monitoring prometheus-user-workload-0 -c prometheus -- \
curl -s 'http://localhost:9090/api/v1/query?query=sum(kepler_container_cpu_watts%7Bcontainer_name%3D~%22kube-scheduler.*%22%7D)' 2>/dev/null | \
jq -r '"kube-scheduler: \(.data.result[0].value[1] // "0")W"'

# Power over time using joules (rate gives watts)
oc exec -n openshift-user-workload-monitoring prometheus-user-workload-0 -c prometheus -- \
curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(kepler_node_cpu_joules_total%5B5m%5D))' 2>/dev/null | \
jq -r '.data.result[0].value[1] // "0"'
```

**Note on metrics**:
- `kepler_node_cpu_watts` - Instantaneous CPU power per node (labels: `instance`, `zone`)
- `kepler_container_cpu_watts` - Instantaneous CPU power per container (labels: `container_name`, `instance`)
- `kepler_node_cpu_joules_total` - Cumulative energy (use `rate()` for watts)
- All container metrics report under `namespace: kepler` — use `container_name` regex to identify workloads

In **estimation mode** (VMs without RAPL), values will be very small (microwatts to milliwatts) because they're based on CPU activity models, not real power measurements. On **bare metal with RAPL**, expect realistic values (tens to hundreds of watts).

### Step 4: Get Kepler Build Info

```bash
# Kepler version and configuration
oc exec -n openshift-user-workload-monitoring prometheus-user-workload-0 -c prometheus -- \
curl -s 'http://localhost:9090/api/v1/query?query=kepler_build_info' | \
jq -r '.data.result[0].metric | "Version: \(.version), Branch: \(.branch)"'
```

## Output Format

Present the results in this format:

```
## TNF Cluster Power Report

**Measurement Mode**: [REAL/ESTIMATED]
**Kepler Version**: [version]
**Report Time**: [current time]

### Cluster Summary
| Metric | Value |
|--------|-------|
| Total Power | XX.X W |
| Nodes Monitored | 2 |

### Power by Node
| Node | Power (W) |
|------|-----------|
| master-0 (192.168.111.20) | XX.X |
| master-1 (192.168.111.21) | XX.X |

### TNF Control Plane Overhead
| Component | Power (W) |
|-----------|-----------|
| etcd | XX.X |
| kube-apiserver | XX.X |
| kube-controller-manager | XX.X |
| kube-scheduler | XX.X |

### Top Containers by Power
| Container (Pod) | Node | Power (W) |
|-----------------|------|-----------|
| container-1 | 192.168.111.20 | XX.X |
| container-2 | 192.168.111.21 | XX.X |
| ... | ... | ... |

---
*Note: [If ESTIMATED mode] Power values are ML-based estimates.
On production bare metal TNF clusters, real RAPL measurements are used.*
```

## Error Handling

If Kepler is not deployed:
```
Kepler is not deployed on this cluster.

To deploy Kepler power monitoring:
cd deploy/openshift-clusters
ansible-playbook kepler.yml -i inventory.ini

Or using make:
cd deploy && make deploy-kepler
```

If metrics are not available:
```
Kepler pods are running but no metrics found in Prometheus.

Check:
1. ServiceMonitor exists: oc get servicemonitor -n kepler
2. User workload monitoring is enabled
3. Wait a few minutes for metrics to be scraped

Troubleshooting:
oc logs -n kepler -l app.kubernetes.io/name=kepler-exporter --tail=50
```

## Grafana Dashboard Link

After showing the report, remind the user about the Grafana dashboard:

```
For detailed visualizations, access the Grafana dashboard:

Via port-forward (recommended for dev environments):
oc port-forward -n grafana svc/grafana 3000:3000
Open: http://localhost:3000

Via route (if exposed):
URL: https://grafana-grafana.apps.<cluster-domain>

Credentials: admin / admin
Dashboard: TNF Power Monitoring
```
12 changes: 11 additions & 1 deletion deploy/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,12 @@ patch-nodes:
get-tnf-logs:
@./openshift-clusters/scripts/get-tnf-logs.sh

deploy-kepler:
@./openshift-clusters/scripts/deploy-kepler.sh

remove-kepler:
@./openshift-clusters/scripts/remove-kepler.sh

help:
@echo "Available commands:"
@echo ""
Expand All @@ -98,7 +104,7 @@ help:
@echo "Instance Utils:"
@echo " ssh - SSH into the EC2 instance"
@echo " info - Display instance information"
@echo " inventory - Update inventory.ini with current instance IP"
@echo " inventory - Update inventory.ini with current instance IP"
@echo ""
@echo "OpenShift Cluster Deployment:"
@echo " fencing-ipi - Deploy fencing IPI cluster (non-interactive)"
Expand All @@ -118,4 +124,8 @@ help:
@echo ""
@echo "Cluster Utilities:"
@echo " get-tnf-logs - Collect pacemaker and etcd logs from cluster nodes"
@echo ""
@echo "Power Monitoring:"
@echo " deploy-kepler - Deploy Kepler power monitoring (v0.11.3)"
@echo " remove-kepler - Remove Kepler power monitoring from cluster"

105 changes: 105 additions & 0 deletions deploy/openshift-clusters/kepler.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
---
# Kepler Power Monitoring Deployment for TNF Clusters
#
# This playbook deploys Kepler power monitoring with Grafana dashboards
# on TNF (Two Nodes with Fencing) clusters.
#
# Usage:
# ansible-playbook kepler.yml -i inventory.ini
#
# To remove Kepler:
# ansible-playbook kepler.yml -i inventory.ini -e kepler_state=absent
#
# Options:
# -e grafana_enabled=false Skip Grafana deployment
# -e kepler_state=absent Remove Kepler from cluster

- name: Deploy Kepler Power Monitoring on TNF Cluster
hosts: localhost
connection: local
gather_facts: no

vars:
# Load proxy configuration if proxy.env exists
proxy_env_file: "./proxy.env"

pre_tasks:
- name: Check if proxy.env file exists
ansible.builtin.stat:
path: "{{ proxy_env_file }}"
register: proxy_env_stat

- name: Source proxy.env and extract environment variables
ansible.builtin.shell: |
source {{ proxy_env_file }} && env | grep -E '^(KUBECONFIG|HTTP_PROXY|HTTPS_PROXY|NO_PROXY)='
register: proxy_env_vars
when: proxy_env_stat.stat.exists
failed_when: false
changed_when: false

- name: Parse environment variables from proxy.env
ansible.builtin.set_fact:
proxy_vars: "{{ proxy_vars | default({}) | combine({item.split('=')[0]: item.split('=')[1:]|join('=')}) }}"
loop: "{{ proxy_env_vars.stdout_lines | default([]) }}"
when:
- proxy_env_stat.stat.exists
- proxy_env_vars.stdout_lines is defined

- name: Set proxy variables for role
ansible.builtin.set_fact:
proxy_kubeconfig: "{{ proxy_vars.KUBECONFIG | default(lookup('env', 'KUBECONFIG')) }}"
proxy_http_proxy: "{{ proxy_vars.HTTP_PROXY | default('') }}"
proxy_https_proxy: "{{ proxy_vars.HTTPS_PROXY | default('') }}"
proxy_no_proxy: "{{ proxy_vars.NO_PROXY | default('') }}"
proxy_k8s_auth_proxy: "{{ proxy_vars.HTTP_PROXY | default('') }}"
when: proxy_env_stat.stat.exists

- name: Use environment KUBECONFIG if no proxy.env
ansible.builtin.set_fact:
proxy_kubeconfig: "{{ lookup('env', 'KUBECONFIG') }}"
when: not proxy_env_stat.stat.exists

- name: Verify cluster access
ansible.builtin.shell: |
oc get namespace default -o name
environment:
KUBECONFIG: "{{ proxy_kubeconfig }}"
HTTP_PROXY: "{{ proxy_http_proxy | default('') }}"
HTTPS_PROXY: "{{ proxy_https_proxy | default('') }}"
NO_PROXY: "{{ proxy_no_proxy | default('') }}"
register: cluster_check
changed_when: false
failed_when: cluster_check.rc != 0

- name: Display deployment information
ansible.builtin.debug:
msg: |
Kepler Power Monitoring Deployment
-----------------------------------
State: {{ kepler_state | default('present') }}
Grafana: {{ grafana_enabled | default(true) }}
KUBECONFIG: {{ proxy_kubeconfig | default('not set') }}

roles:
- role: kepler

post_tasks:
- name: Deployment summary
ansible.builtin.debug:
msg: |
{% if kepler_state | default('present') == 'present' %}
Kepler Power Monitoring deployed successfully!

Next steps:
1. Wait a few minutes for metrics to be collected
2. Access Grafana dashboard (if enabled):
oc get route grafana-route -n grafana -o jsonpath='{.spec.host}'
3. Query metrics in OpenShift Console:
Observe -> Metrics -> kepler_node_cpu_joules_total

Useful PromQL queries:
- Total power: sum(rate(kepler_node_cpu_joules_total[5m])) * 60
- Power by node: sum by (instance) (rate(kepler_node_cpu_joules_total[5m])) * 60
{% else %}
Kepler Power Monitoring has been removed from the cluster.
{% endif %}
26 changes: 26 additions & 0 deletions deploy/openshift-clusters/roles/kepler/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
# Default variables for the kepler role
# These variables can be overridden when calling the role

# Kepler configuration
kepler_namespace: "kepler"
kepler_image: "quay.io/sustainable_computing_io/kepler:v0.11.3"
kepler_port: 9188

# Grafana configuration
grafana_enabled: true
grafana_namespace: "grafana"
grafana_image: "docker.io/grafana/grafana:10.4.1"

# User workload monitoring (required for ServiceMonitor)
enable_user_workload_monitoring: true

# Scrape interval for Kepler metrics
kepler_scrape_interval: "30s"

# State for idempotent operations (present/absent)
kepler_state: "present"

# Wait timeouts (in retries, each retry is 10 seconds)
operator_ready_retries: 30
daemonset_ready_retries: 30
Loading