-
Notifications
You must be signed in to change notification settings - Fork 73
Description
Pods being stuck in creation and termination can happen. Cancel also can fail in some mid-step.
I propose we make cancel more reliable by adding mechanism to the cancel script to force deletion of various resources after some timeout (or force it always). Then we need to ensure that errors where some resource is missing is assumed as success as partial installation or partial deletions can happen.
This happened recently:
- I started prombench textparse: Optimized protobuf parser with custom streaming unmarshal. prometheus#15731 (comment)
- Job fails
21:52:25 provider.go:71: Request for 'applying deployment:prometheus-test-pr-15731' is in progress. Checking in 10s
21:52:35 provider.go:71: Request for 'applying deployment:prometheus-test-pr-15731' is in progress. Checking in 10s
21:52:35 gke.go:582: error while applying a resource err:error applying '/tmp/tmp.NOMpbH/test-infra/prombench/manifests/prombench/benchmark/3_prometheus-test-pr_deployment.yaml' err: Request for 'applying deployment:prometheus-test-pr-15731' hasn't completed after retrying 50 times
make: *** [Makefile:88: resource_apply] Error 1
-
When I logged on GKE it looks like Pod is pending (waiting for Prometheus to start) and init container is green, yet no logs other than some npm warnings from the building.
-
Triggered prombench cancel
-
Cancel fails
22:19:05 provider.go:71: Request for 'deleting namespace:prombench-15731' is in progress. Checking in 10s
22:19:15 provider.go:71: Request for 'deleting namespace:prombench-15731' is in progress. Checking in 10s
22:19:15 gke.go:590: error while deleting objects from a manifest file err:error deleting './manifests/prombench/benchmark/1a_namespace.yaml' err: Request for 'deleting namespace:prombench-[157](https://github.com/prometheus/prometheus/actions/runs/13145657052/job/36683090504#step:4:158)31' hasn't completed after retrying 100 times
make: *** [Makefile:97: resource_delete] Error 1
-
Logged again to GKE and saw everything for the benchmark is deleted except namespace and previously stuck Prometheus pod - now waiting for termination forever (at least 8h).
-
I forced deleted it from GKE shell:
gcloud container clusters get-credentials test-infra --zone europe-west3-a --project macro-mile-203600 && kubectl delete pod prometheus-test-pr-15731-78bdf7bf67-xf6sm --namespace prombench-15731 --grace-period=0 --force
- Ran prombench cancel again and it failed again due to loadgen missing (expected after partial delete):
-f ./manifests/prombench/benchmark/1a_namespace.yaml
08:38:22 gke.go:590: error while deleting objects from a manifest file err:error deleting './manifests/prombench/benchmark/1c_cluster-role-binding.yaml' err: resource delete failed - kind: Role, name: loadgen-scaler: roles.rbac.authorization.k8s.io "loadgen-scaler" not found
make: *** [Makefile:97: resource_delete] Error 1
- I have to manually delete node pools (manually do what cancel is doing).
Ideally steps and (5) and (8) does not fail so resetting the benchmark is easier and perhaps faster.