Skip to content

crio: add retry to ContainerInfo call#3899

Open
Chandan9112 wants to merge 1 commit into
google:masterfrom
Chandan9112:fix-crio-container-info-retry
Open

crio: add retry to ContainerInfo call#3899
Chandan9112 wants to merge 1 commit into
google:masterfrom
Chandan9112:fix-crio-container-info-retry

Conversation

@Chandan9112

@Chandan9112 Chandan9112 commented Jun 23, 2026

Copy link
Copy Markdown

Summary

When cadvisor detects a new container cgroup via inotify, the container may not yet be fully registered in CRI-O. This race condition causes ContainerInfo to return HTTP 404, which cadvisor treats as a fatal error and logs:

Failed to process watch event: Error finding container <id>: Status 404 returned error can't find the container with id <id>

This is the same race condition that the containerd handler already handles with a retry+backoff loop around TaskPid (lib/container/containerd/handler.go, lines 94-118).

Root Cause

  1. cadvisor watches for new cgroups via inotify
  2. CRI-O creates the container cgroup early during container setup
  3. cadvisor sees the cgroup and immediately calls CRI-O's /containers/<id> API
  4. CRI-O has not yet finished registering the container internally → returns HTTP 404
  5. cadvisor logs this as a warning and fails to track the container

The race window is widened in environments using GPG-signed container images, where signature verification delays CRI-O's container registration.

Fix

  • Add an ErrContainerNotFound sentinel error in client.go
  • Wrap HTTP 404 responses from CRI-O's ContainerInfo with the sentinel (using %w)
  • Add a retry loop with exponential backoff (5 retries, 100ms initial, doubling) in handler.go's newCrioContainerHandler, mirroring the existing containerd handler pattern
  • Non-404 errors fail immediately without retry
  • Added klog.V(4) logging for observability during retries

Prior Work

Supersedes #3842 by @haircommander, which addressed the same issue but was closed as stale. This PR reimplements the same approach with the following improvements:

  • Exported ErrContainerNotFound sentinel (vs unexported errNotFound)
  • Fixed format string bug in error wrapping
  • Added klog.V(4) retry logging for observability

cc @haircommander

@google-cla

google-cla Bot commented Jun 23, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@Chandan9112 Chandan9112 force-pushed the fix-crio-container-info-retry branch from f9b18f2 to 5364844 Compare June 23, 2026 06:34
@Chandan9112

Chandan9112 commented Jun 23, 2026

Copy link
Copy Markdown
Author

@googlebot I signed it.

@Chandan9112 Chandan9112 force-pushed the fix-crio-container-info-retry branch from 5364844 to 359f8a1 Compare June 23, 2026 07:03
Comment thread lib/container/crio/client_test.go Outdated
}
}

func TestErrContainerNotFoundIsDistinguishable(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we need a unit test for this, it's just testing behavior golang should test with fmt.Errorf

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes correct Peter, removed in the latest commit. Will make it ready for review.

When cadvisor detects a new container cgroup via inotify, the container
may not yet be fully registered in CRI-O. This race condition causes
ContainerInfo to return HTTP 404, which cadvisor treats as a fatal error
and logs "Failed to process watch event".

This is the same race condition that the containerd handler already
handles with a retry+backoff loop around TaskPid (see
lib/container/containerd/handler.go).

Apply the same pattern for CRI-O:
- Add an ErrContainerNotFound sentinel error in client.go
- Wrap HTTP 404 responses from CRI-O with the sentinel
- Retry ContainerInfo up to 5 times with 100ms exponential backoff
  in handler.go

This is particularly impactful in environments using GPG-signed
container images, where image signature verification delays container
registration in CRI-O, widening the race window.

Supersedes google#3842
@Chandan9112 Chandan9112 force-pushed the fix-crio-container-info-retry branch from 359f8a1 to 3574934 Compare June 24, 2026 03:04
@Chandan9112 Chandan9112 marked this pull request as ready for review June 24, 2026 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants