Skip to content

metrics: Add debug metric for non-critical errors#4416

Merged
mtardy merged 1 commit intocilium:mainfrom
AritraDey-Dev:metrics-warnings
Jan 16, 2026
Merged

metrics: Add debug metric for non-critical errors#4416
mtardy merged 1 commit intocilium:mainfrom
AritraDey-Dev:metrics-warnings

Conversation

@AritraDey-Dev
Copy link
Copy Markdown
Member

@AritraDey-Dev AritraDey-Dev commented Dec 9, 2025

Part of #2785

Stop reporting non-errors (as in: no action needed) as errors. Define separate metrics for "casual fails" if needed.

Description

Adds tetragon_debug_events_total metric for non-critical errors to reduce noise in tetragon_errors_total. (see commit)

Example Output:

# HELP tetragon_debug_events_total The total number of Tetragon debug events. For internal use only.
# TYPE tetragon_debug_events_total counter
tetragon_debug_events_total{type="process_metadata_username_ignored_not_in_host_namespaces"} 1

Changelog

Added tetragon_debug_events_total metric to separate non-critical issues from actual errors.

@AritraDey-Dev AritraDey-Dev requested a review from a team as a code owner December 9, 2025 10:04
@netlify
Copy link
Copy Markdown

netlify Bot commented Dec 9, 2025

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit 32fcc3f
🔍 Latest deploy log https://app.netlify.com/projects/tetragon/deploys/6968e6af5fcf3e000860315c
😎 Deploy Preview https://deploy-preview-4416--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@kevsecurity kevsecurity added the release-note/minor This PR introduces a minor user-visible change label Jan 8, 2026
Copy link
Copy Markdown
Contributor

@kevsecurity kevsecurity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@mtardy mtardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is an improvement thanks! But just taking a step back here do we care about this warning at all in general? Should we just drop this altogether instead?

It looks like it was added there 28ce5ef. It looks like it was seen as useful during the review @kkourt, any memory?

@mtardy mtardy requested a review from kkourt January 8, 2026 16:02
Copy link
Copy Markdown
Member

@mtardy mtardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can move this from error to debug/note or an independent metric (with debug in its name or something)

const (
// The username resolution was skipped since the process is not in host
// namespaces.
ProcessMetadataUsernameIgnoredNotInHost WarningType = iota
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ProcessMetadataUsernameIgnoredNotInHost WarningType = iota
ProcessMetadataUsernameIgnoredNotInHost DebugType = iota

@AritraDey-Dev
Copy link
Copy Markdown
Member Author

maybe we can move this from error to debug/note or an independent metric (with debug in its name or something)

tetragon_debug_events_total how does this sound?

"There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors." --- Leon Bambrick😅

@AritraDey-Dev AritraDey-Dev requested a review from mtardy January 15, 2026 13:08
@AritraDey-Dev AritraDey-Dev changed the title metrics: Add warnings metric for non-critical errors metrics: Add debug metric for non-critical errors Jan 15, 2026
Copy link
Copy Markdown
Member

@mtardy mtardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I think it looks good like that, only nit is now the name of the pkg and file is a bit of a mismatch but whatever, hopefully someone will reuse this debug thing instead of recreating one next time.

Copy link
Copy Markdown
Member

@mtardy mtardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah just two nits that should be quick to fix if you're okay:

  • small typos in comment
  • could you put this stuff into a new debugmetrics.go file in the same errormetrics pkg?

I'm also ok to merge like that if you feel that's better

// Get a new handle on the HandlerErrors metric
func GetHandlerErrors(opcode ops.OpCode, er EventHandlerError) prometheus.Counter {
return HandlerErrors.WithLabelValues(strconv.Itoa(int(int32(opcode))), er.String())
// Get a new handle on an DebugTotal metric for an DebugType
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Get a new handle on an DebugTotal metric for an DebugType
// Get a new handle on a DebugTotal metric for a DebugType

// Increment the HandlerErrors metric
func HandlerErrorsInc(opcode ops.OpCode, er EventHandlerError) {
GetHandlerErrors(opcode, er).Inc()
// Increment an DebugTotal for an DebugType
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Increment an DebugTotal for an DebugType
// Increment a DebugTotal for a DebugType

@AritraDey-Dev
Copy link
Copy Markdown
Member Author

could you put this stuff into a new debugmetrics.go file in the same errormetrics pkg?

Yeah, that would be much more structured.

Comment thread pkg/metrics/errormetrics/errormetrics.go Outdated
Right now, tetragon_errors_total counts things that aren't really errors, like ProcessMetadataUsernameIgnoredNotInHost. This happens a lot in containers and isn't something we usually need to fix, but it's spamming the error metrics.

This change adds a new tetragon_debug_events_total metric for these kinds of 'casual fails'. I moved ProcessMetadataUsernameIgnoredNotInHost over to this new debug metric so the main error count is actually useful for spotting real problems.

Signed-off-by: Aritra Dey <adey01027@gmail.com>
Copy link
Copy Markdown
Member

@mtardy mtardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome let's merge like this once green, thanks again

@mtardy mtardy merged commit efe8719 into cilium:main Jan 16, 2026
52 checks passed
@AritraDey-Dev AritraDey-Dev deleted the metrics-warnings branch January 16, 2026 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note/minor This PR introduces a minor user-visible change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants