Improve cancel reliability by force delete and continuing partial delete

Pods being stuck in creation and termination can happen. Cancel also can fail in some mid-step.

I propose we make cancel more reliable by adding mechanism to the cancel script to force deletion of various resources after some timeout (or force it always). Then we need to ensure that errors where some resource is missing is assumed as success as partial installation or partial deletions can happen.

This happened recently:
1) I started prombench https://github.com/prometheus/prometheus/pull/15731#issuecomment-2635136569
2) Job [fails](https://github.com/prometheus/prometheus/actions/runs/13145368014/job/36682120040)

```
21:52:25 provider.go:71: Request for 'applying deployment:prometheus-test-pr-15731' is in progress. Checking in 10s
21:52:35 provider.go:71: Request for 'applying deployment:prometheus-test-pr-15731' is in progress. Checking in 10s
21:52:35 gke.go:582: error while applying a resource err:error applying '/tmp/tmp.NOMpbH/test-infra/prombench/manifests/prombench/benchmark/3_prometheus-test-pr_deployment.yaml' err: Request for 'applying deployment:prometheus-test-pr-15731' hasn't completed after retrying 50 times
make: *** [Makefile:88: resource_apply] Error 1
``` 

3) When I logged on GKE it looks like Pod is pending (waiting for Prometheus to start) and init container is green, yet no logs other than some npm warnings from the building. 

4) Triggered [prombench cancel](https://github.com/prometheus/prometheus/pull/15731#issuecomment-2635171679)
5) Cancel [fails](https://github.com/prometheus/prometheus/actions/runs/13145657052)

```

22:19:05 provider.go:71: Request for 'deleting namespace:prombench-15731' is in progress. Checking in 10s
22:19:15 provider.go:71: Request for 'deleting namespace:prombench-15731' is in progress. Checking in 10s
22:19:15 gke.go:590: error while deleting objects from a manifest file err:error deleting './manifests/prombench/benchmark/1a_namespace.yaml' err: Request for 'deleting namespace:prombench-[157](https://github.com/prometheus/prometheus/actions/runs/13145657052/job/36683090504#step:4:158)31' hasn't completed after retrying 100 times
make: *** [Makefile:97: resource_delete] Error 1
```

6) Logged again to GKE and saw everything for the benchmark is deleted except namespace and previously stuck  Prometheus pod - now waiting for termination forever (at least 8h).

7) I forced deleted it from GKE shell:

```
 gcloud container clusters get-credentials test-infra --zone europe-west3-a --project macro-mile-203600  && kubectl delete pod prometheus-test-pr-15731-78bdf7bf67-xf6sm --namespace prombench-15731 --grace-period=0 --force
```

8) Ran prombench cancel again and it failed [again](https://github.com/prometheus/prometheus/actions/runs/13153353080/job/36704888948) due to loadgen missing (expected after partial delete):

```
	-f ./manifests/prombench/benchmark/1a_namespace.yaml
08:38:22 gke.go:590: error while deleting objects from a manifest file err:error deleting './manifests/prombench/benchmark/1c_cluster-role-binding.yaml' err: resource delete failed - kind: Role, name: loadgen-scaler: roles.rbac.authorization.k8s.io "loadgen-scaler" not found
make: *** [Makefile:97: resource_delete] Error 1
```

9) I have to manually delete node pools (manually do what cancel is doing).

Ideally steps and (5) and (8) does not fail so resetting the benchmark is easier and perhaps faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve cancel reliability by force delete and continuing partial delete #831

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve cancel reliability by force delete and continuing partial delete #831

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions