[multiple] Add OOMKill log by danpawlik · Pull Request #3776 · openstack-k8s-operators/ci-framework

danpawlik · 2026-03-18T14:29:10Z

It might happen that there would be an OOMKill of the container.
We would like to be aware about that issue.

averdagu · 2026-03-18T14:34:24Z

Looks good to me. Did we found any occurrence of this?

danpawlik · 2026-03-18T14:36:40Z

@averdagu in some cases, we see that tempest tests can not pass, where we wondering what can be root cause of it. Maybe it is oomkill, maybe bad health check? Something to discover. We just want to more aware whats going on.

averdagu · 2026-03-18T14:59:22Z

/lgtm

roles/os_must_gather/tasks/main.yml

openshift-ci · 2026-03-18T16:12:31Z

New changes are detected. LGTM label has been removed.

fmount

I'm not against the idea of being aware of any OOMKill that might happen in the cluster, but I think just getting the events is not enough to have a clear idea of the nodes status.
Instead considering to stores the output of oc describe nodes that contains more data about Conditions and Capacity and can drive through a resolution of the problem.

roles/os_must_gather/tasks/main.yml

fmount · 2026-03-18T15:08:37Z

roles/os_must_gather/tasks/main.yml

+    - name: Check if there were some OOMKill
+      ansible.builtin.shell: >
+        oc get events
+        --all-namespaces


--all-namespaces in a large environment might take a bit, and I'm not sure you need to check which Pod is oomikilled all over the cluster.
Did you try in a DC or DZ environment?
Perhaps you can consider running the regular openshift must gather along with the openstack one if you're looking for this kind of data. [1]

[1] e.g.

oc adm must-gather \ --image-stream=openshift/must-gather \ --image=quay.io/openstack-k8s-operators/openstack-must-gather:latest

If you get OOMKill all over the cluster I expect must-gather does not work and you've probably lost access to the OCP API at this point.

How in quick way, I would be able to get information that there was an OOMKill without downloading 1GB logs and make grep? Does must gather provide a dedicated file with OOMKIll log?

roles/os_must_gather/tasks/main.yml

danpawlik · 2026-03-20T09:54:04Z

@fmount I can add the --field-selector type=Warning - why not. I see it is suggested by AI, where Stackoverflow have just simplier way :)

You can get the json and use jq to filter the "Reason" you're interested to, but to me that kind of post- processing should happen later, as an additional task, on the gathered file.

still, I need to download a file, use jq or other tool, grep instead of take a look into a dedicated file. Such way we can consider if OOMKill would be often - something to decide in the future.

Instead considering to stores the output of oc describe nodes that contains more data about Conditions and Capacity and can drive through a resolution of the problem.

So far, we just want to see OOMKill. We can add other info in other PR if needed.

fmount · 2026-03-20T10:02:17Z

@fmount I can add the --field-selector type=Warning - why not. I see it is suggested by AI, where Stackoverflow have just simplier way :)

What? I don't think the point is AI vs stackoverflow vs #anythingelse. the point is that sorting by time doesn't make sense to me. So please stop with that.

You can get the json and use jq to filter the "Reason" you're interested to, but to me that kind of post- processing should happen later, as an additional task, on the gathered file.
still, I need to download a file, use jq or other tool, grep instead of take a look into a dedicated file. Such way we can consider if OOMKill would be often - something to decide in the future.

-o json | jq is what I meant.

Instead considering to stores the output of oc describe nodes that contains more data about Conditions and Capacity and can drive through a resolution of the problem.
So far, we just want to see OOMKill. We can add other info in other PR if needed.

Still I don't think that a query to all-namespaces is the right way. If your etcd, or apiserver or any other kubernetes infra pod gets oomkilled, these tasks are useless.

Feel free to move forward w/ this, but I'm -1 with the idea, believe I'm an AI or not.

danpawlik · 2026-03-20T10:06:01Z

@fmount I can add the --field-selector type=Warning - why not. I see it is suggested by AI, where Stackoverflow have just simplier way :)

What? I don't think the point is AI vs stackoverflow vs #anythingelse. the point is that sorting by time doesn't make sense to me. So please stop with that.
You can get the json and use jq to filter the "Reason" you're interested to, but to me that kind of post- processing should happen later, as an additional task, on the gathered file.
still, I need to download a file, use jq or other tool, grep instead of take a look into a dedicated file. Such way we can consider if OOMKill would be often - something to decide in the future.
-o json | jq is what I meant.
Instead considering to stores the output of oc describe nodes that contains more data about Conditions and Capacity and can drive through a resolution of the problem.
So far, we just want to see OOMKill. We can add other info in other PR if needed.
Still I don't think that a query to all-namespaces is the right way. If your etcd, or apiserver or any other kubernetes infra pod gets oomkilled, these tasks are useless.

Feel free to move forward w/ this, but I'm -1 with the idea, believe I'm an AI or not.

so you just want to collect openstack namespace OOMKill instead of all namespaces, so I need to create new task in other place where I will collect those info. Make sense

danpawlik · 2026-03-20T10:56:14Z

we have an agreement with @fmount that collecting cluster logs/state info should be in other place than os_must_gather role or at least separate task file.

openshift-ci · 2026-03-20T11:35:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign evallesp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

roles/os_must_gather/tasks/os_cluster_info.yml

It might happen that there would be an OOMKill of the container. We would like to be aware about that issue. Signed-off-by: Daniel Pawlik <dpawlik@redhat.com>

danpawlik requested review from a team, arxcruz, averdagu and fmount March 18, 2026 14:29

danpawlik force-pushed the add-oom-kill-info branch 2 times, most recently from 2c71749 to fdad102 Compare March 18, 2026 14:31

openshift-ci bot assigned averdagu Mar 18, 2026

openshift-ci bot added the lgtm label Mar 18, 2026

evallesp previously approved these changes Mar 18, 2026

View reviewed changes

evallesp reviewed Mar 18, 2026

View reviewed changes

roles/os_must_gather/tasks/main.yml Outdated Show resolved Hide resolved

roles/os_must_gather/tasks/main.yml Outdated Show resolved Hide resolved

evallesp self-requested a review March 18, 2026 15:22

danpawlik dismissed evallesp’s stale review via 5436872 March 18, 2026 16:12

danpawlik force-pushed the add-oom-kill-info branch from fdad102 to 5436872 Compare March 18, 2026 16:12

openshift-ci bot removed the lgtm label Mar 18, 2026

danpawlik force-pushed the add-oom-kill-info branch from 5436872 to afa5997 Compare March 18, 2026 16:14

fmount requested changes Mar 18, 2026

View reviewed changes

github-actions bot added the Ready For Review label Mar 18, 2026

danpawlik force-pushed the add-oom-kill-info branch from afa5997 to 99a1fce Compare March 20, 2026 09:56

github-actions bot removed the Ready For Review label Mar 20, 2026

danpawlik requested a review from fmount March 20, 2026 09:58

danpawlik force-pushed the add-oom-kill-info branch from 99a1fce to 9010da0 Compare March 20, 2026 10:54

danpawlik changed the title ~~[os_must_gather] Add OOMKill log~~ [multiple] Add new role for collecting OpenShift cluster info Mar 20, 2026

danpawlik force-pushed the add-oom-kill-info branch 2 times, most recently from 49639e1 to 6c1e8e2 Compare March 20, 2026 11:35

fmount approved these changes Mar 20, 2026

View reviewed changes

roles/os_must_gather/tasks/os_cluster_info.yml Show resolved Hide resolved

github-actions bot added the Ready For Review label Mar 20, 2026

danpawlik changed the title ~~[multiple] Add new role for collecting OpenShift cluster info~~ [os_must_gather] Add OOMKill log Mar 23, 2026

[multiple] Add OOMKill log

4a7c887

It might happen that there would be an OOMKill of the container. We would like to be aware about that issue. Signed-off-by: Daniel Pawlik <dpawlik@redhat.com>

danpawlik force-pushed the add-oom-kill-info branch from 6c1e8e2 to 4a7c887 Compare March 23, 2026 14:25

github-actions bot removed the Ready For Review label Mar 23, 2026

danpawlik changed the title ~~[os_must_gather] Add OOMKill log~~ [multiple] Add OOMKill log Mar 23, 2026

github-actions bot added the Ready For Review label Mar 23, 2026

Conversation

danpawlik commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

averdagu commented Mar 18, 2026

Uh oh!

danpawlik commented Mar 18, 2026

Uh oh!

averdagu commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

openshift-ci bot commented Mar 18, 2026

Uh oh!

fmount left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fmount Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

danpawlik Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danpawlik commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmount commented Mar 20, 2026

Uh oh!

danpawlik commented Mar 20, 2026

Uh oh!

danpawlik commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danpawlik commented Mar 18, 2026 •

edited

Loading

danpawlik commented Mar 20, 2026 •

edited

Loading

danpawlik commented Mar 20, 2026 •

edited

Loading