crio: add retry to ContainerInfo call#3899
Open
Chandan9112 wants to merge 1 commit into
Open
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
f9b18f2 to
5364844
Compare
Author
|
@googlebot I signed it. |
5364844 to
359f8a1
Compare
| } | ||
| } | ||
|
|
||
| func TestErrContainerNotFoundIsDistinguishable(t *testing.T) { |
Contributor
There was a problem hiding this comment.
I don't know if we need a unit test for this, it's just testing behavior golang should test with fmt.Errorf
Author
There was a problem hiding this comment.
Yes correct Peter, removed in the latest commit. Will make it ready for review.
When cadvisor detects a new container cgroup via inotify, the container may not yet be fully registered in CRI-O. This race condition causes ContainerInfo to return HTTP 404, which cadvisor treats as a fatal error and logs "Failed to process watch event". This is the same race condition that the containerd handler already handles with a retry+backoff loop around TaskPid (see lib/container/containerd/handler.go). Apply the same pattern for CRI-O: - Add an ErrContainerNotFound sentinel error in client.go - Wrap HTTP 404 responses from CRI-O with the sentinel - Retry ContainerInfo up to 5 times with 100ms exponential backoff in handler.go This is particularly impactful in environments using GPG-signed container images, where image signature verification delays container registration in CRI-O, widening the race window. Supersedes google#3842
359f8a1 to
3574934
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When cadvisor detects a new container cgroup via inotify, the container may not yet be fully registered in CRI-O. This race condition causes
ContainerInfoto return HTTP 404, which cadvisor treats as a fatal error and logs:This is the same race condition that the containerd handler already handles with a retry+backoff loop around
TaskPid(lib/container/containerd/handler.go, lines 94-118).Root Cause
/containers/<id>APIThe race window is widened in environments using GPG-signed container images, where signature verification delays CRI-O's container registration.
Fix
ErrContainerNotFoundsentinel error inclient.goContainerInfowith the sentinel (using%w)handler.go'snewCrioContainerHandler, mirroring the existing containerd handler patternklog.V(4)logging for observability during retriesPrior Work
Supersedes #3842 by @haircommander, which addressed the same issue but was closed as stale. This PR reimplements the same approach with the following improvements:
ErrContainerNotFoundsentinel (vs unexportederrNotFound)klog.V(4)retry logging for observabilitycc @haircommander