[multiple] Add OOMKill log#3776
[multiple] Add OOMKill log#3776danpawlik wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
Conversation
2c71749 to
fdad102
Compare
|
Looks good to me. Did we found any occurrence of this? |
|
@averdagu in some cases, we see that tempest tests can not pass, where we wondering what can be root cause of it. Maybe it is oomkill, maybe bad health check? Something to discover. We just want to more aware whats going on. |
|
/lgtm |
fdad102 to
5436872
Compare
|
New changes are detected. LGTM label has been removed. |
5436872 to
afa5997
Compare
fmount
left a comment
There was a problem hiding this comment.
I'm not against the idea of being aware of any OOMKill that might happen in the cluster, but I think just getting the events is not enough to have a clear idea of the nodes status.
Instead considering to stores the output of oc describe nodes that contains more data about Conditions and Capacity and can drive through a resolution of the problem.
roles/os_must_gather/tasks/main.yml
Outdated
| - name: Check if there were some OOMKill | ||
| ansible.builtin.shell: > | ||
| oc get events | ||
| --all-namespaces |
There was a problem hiding this comment.
--all-namespaces in a large environment might take a bit, and I'm not sure you need to check which Pod is oomikilled all over the cluster.
Did you try in a DC or DZ environment?
Perhaps you can consider running the regular openshift must gather along with the openstack one if you're looking for this kind of data. [1]
[1] e.g.
oc adm must-gather \
--image-stream=openshift/must-gather \
--image=quay.io/openstack-k8s-operators/openstack-must-gather:latestIf you get OOMKill all over the cluster I expect must-gather does not work and you've probably lost access to the OCP API at this point.
There was a problem hiding this comment.
How in quick way, I would be able to get information that there was an OOMKill without downloading 1GB logs and make grep? Does must gather provide a dedicated file with OOMKIll log?
|
@fmount I can add the still, I need to download a file, use So far, we just want to see OOMKill. We can add other info in other PR if needed. |
afa5997 to
99a1fce
Compare
What? I don't think the point is AI vs stackoverflow vs #anythingelse. the point is that sorting by time doesn't make sense to me. So please stop with that.
Still I don't think that a query to all-namespaces is the right way. If your etcd, or apiserver or any other kubernetes infra pod gets oomkilled, these tasks are useless. Feel free to move forward w/ this, but I'm -1 with the idea, believe I'm an AI or not. |
so you just want to collect openstack namespace OOMKill instead of all namespaces, so I need to create new task in other place where I will collect those info. Make sense |
99a1fce to
9010da0
Compare
|
we have an agreement with @fmount that collecting cluster logs/state info should be in other place than |
49639e1 to
6c1e8e2
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
It might happen that there would be an OOMKill of the container. We would like to be aware about that issue. Signed-off-by: Daniel Pawlik <dpawlik@redhat.com>
6c1e8e2 to
4a7c887
Compare
It might happen that there would be an OOMKill of the container.
We would like to be aware about that issue.