diff --git a/docs/en/solutions/Prometheus_TSDB_dump_fails_to_collect_under_DiskPressure_85_utilisation.md b/docs/en/solutions/Prometheus_TSDB_dump_fails_to_collect_under_DiskPressure_85_utilisation.md new file mode 100644 index 00000000..c3d0436f --- /dev/null +++ b/docs/en/solutions/Prometheus_TSDB_dump_fails_to_collect_under_DiskPressure_85_utilisation.md @@ -0,0 +1,123 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- + +# Prometheus TSDB dump fails to collect under DiskPressure (>85% utilisation) +## Issue + +Attempting to capture a Prometheus TSDB dump for support / RCA fails. Symptoms during collection: + +- The dump request times out, or `tar` fails with `No space left on device`. +- The resulting `prometheus-db.tar.gz` is incomplete or unreadable. +- The Prometheus pod itself reports high I/O wait and may transition to `CrashLoopBackOff` if the underlying disk reaches 100%. +- The hosting node has the kubelet-set `DiskPressure` taint, and pods are being evicted from it. + +## Root Cause + +Prometheus' TSDB needs working disk headroom to perform the metadata operations involved in a snapshot — index re-build, block compaction, and `tar` of the working directory all consume ephemeral space. When the partition that holds `/prometheus` (or, more commonly, the node ephemeral filesystem the pod runs on) crosses 85–90% utilisation: + +- The kubelet sets the `node.kubernetes.io/disk-pressure` taint, which deprioritises and may evict heavy-I/O workloads — including the very pod from which you are trying to read. +- Copying historical TSDB blocks into `/tmp` for tarring can push the partition the rest of the way to 100%, leaving the runtime unable to return the data and Prometheus itself unable to write incoming samples — the operational deadlock symptom. +- The kubelet's image garbage collector may also be losing the race against the workload to free space, particularly on high-density nodes. + +Three forces compete for the same finite headroom: the live TSDB, the diagnostic copy, and the kubelet's reclaim path. Until headroom is restored, none of them can complete. + +## Resolution + +Two tracks: free space first so the existing TSDB has room to operate, then either reduce retention or expand the underlying volume so the issue does not recur. + +### 1. Free space — remove failed-dump leftovers + +Earlier attempts may have left a partial archive in `/tmp` on the Prometheus pod or on the host. Remove it before retrying: + +```bash +MON_NS= +kubectl -n "$MON_NS" exec prometheus-k8s-0 -c prometheus -- \ + rm -f /tmp/prometheus-db.tar.gz +``` + +If the host's ephemeral filesystem is the bottleneck, free node-side scratch as well: + +```bash +kubectl debug node/ -it --profile=sysadmin --image= \ + -- bash -c ' + df -h /var/lib/containers /var/log /tmp + journalctl --vacuum-size=200M + rm -rf /var/log/journal/*/system@*.journal~ || true + ' +``` + +### 2. Reduce Prometheus retention + +A retention drop scales the on-disk TSDB footprint roughly linearly — halving retention eventually halves disk use after old blocks are pruned. Edit the monitoring config: + +```bash +kubectl -n "$MON_NS" edit configmap cluster-monitoring-config +``` + +```yaml +data: + config.yaml: | + prometheusK8s: + retention: 7d # was 15d — drop until headroom recovers +``` + +Prometheus rolls and starts pruning blocks older than the new horizon. The on-disk drop is gradual (one block compaction window at a time), but disk I/O immediately decreases because compaction touches a smaller working set. + +### 3. Collect the dump immediately after headroom is reclaimed + +After the partition returns below ~80% utilisation, take the dump promptly — once block rotation runs against the new retention, the data the support case wants may be gone. The standard collection path is the platform's documented Prometheus-dump procedure; any mechanism that produces a `prometheus-db.tar.gz` from `/prometheus` works: + +```bash +kubectl -n "$MON_NS" exec prometheus-k8s-0 -c prometheus -- \ + tar -C /prometheus -czf /tmp/prometheus-db.tar.gz . +kubectl -n "$MON_NS" cp prometheus-k8s-0:/tmp/prometheus-db.tar.gz \ + ./prometheus-db.tar.gz -c prometheus +``` + +### 4. Long-term — expand the partition + +If the cluster repeatedly hits this state, the underlying ephemeral filesystem (or the PVC backing `/prometheus`) is undersized. Expand the volume so the TSDB and image cache have headroom: + +- For a PVC-backed Prometheus, expand the StatefulSet's `volumeClaimTemplates` (requires StorageClass with `allowVolumeExpansion: true`) and let the resize take effect on rollout. +- For ephemeral node storage, grow the node's root partition (cloud-side disk resize → host-level `growpart` + `xfs_growfs` / `resize2fs`). + +## Diagnostic Steps + +1. Confirm the Prometheus pod's view of its own disk: + + ```bash + kubectl -n "$MON_NS" exec prometheus-k8s-0 -c prometheus -- df -h /prometheus + ``` + + `Use% > 85` is the boundary at which the kubelet flips DiskPressure on the host node. + +2. Locate the node and confirm the taint: + + ```bash + kubectl get pod prometheus-k8s-0 -n "$MON_NS" -o wide + kubectl describe node | grep -E 'DiskPressure|Allocatable|Capacity' -A 5 + ``` + +3. Verify the TSDB's on-disk integrity — a missing `meta.json` for any block makes the entire database unreadable, even after the dump succeeds: + + ```bash + kubectl -n "$MON_NS" exec prometheus-k8s-0 -c prometheus -- \ + bash -c 'cd /prometheus && for d in */; do + test -f "$d/meta.json" || echo "MISSING $d"; done' + ``` + +4. Check the kubelet's image-GC log on the host to confirm it is not stuck: + + ```bash + kubectl debug node/ -it --profile=sysadmin --image= \ + -- journalctl -u kubelet --since "30 minutes ago" \ + | grep -E 'Image garbage collection|wanted to free' + ``` + + `Image garbage collection failed` paired with `wanted to free X bytes` indicates the kubelet cannot reclaim — that is the operational-deadlock signature.