metrics: Add debug metric for non-critical errors#4416
Conversation
✅ Deploy Preview for tetragon ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
3502235 to
cba34da
Compare
mtardy
left a comment
There was a problem hiding this comment.
maybe we can move this from error to debug/note or an independent metric (with debug in its name or something)
| const ( | ||
| // The username resolution was skipped since the process is not in host | ||
| // namespaces. | ||
| ProcessMetadataUsernameIgnoredNotInHost WarningType = iota |
There was a problem hiding this comment.
| ProcessMetadataUsernameIgnoredNotInHost WarningType = iota | |
| ProcessMetadataUsernameIgnoredNotInHost DebugType = iota |
"There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors." --- Leon Bambrick😅 |
cba34da to
32fcc3f
Compare
mtardy
left a comment
There was a problem hiding this comment.
Thanks I think it looks good like that, only nit is now the name of the pkg and file is a bit of a mismatch but whatever, hopefully someone will reuse this debug thing instead of recreating one next time.
mtardy
left a comment
There was a problem hiding this comment.
yeah just two nits that should be quick to fix if you're okay:
- small typos in comment
- could you put this stuff into a new
debugmetrics.gofile in the sameerrormetricspkg?
I'm also ok to merge like that if you feel that's better
| // Get a new handle on the HandlerErrors metric | ||
| func GetHandlerErrors(opcode ops.OpCode, er EventHandlerError) prometheus.Counter { | ||
| return HandlerErrors.WithLabelValues(strconv.Itoa(int(int32(opcode))), er.String()) | ||
| // Get a new handle on an DebugTotal metric for an DebugType |
There was a problem hiding this comment.
| // Get a new handle on an DebugTotal metric for an DebugType | |
| // Get a new handle on a DebugTotal metric for a DebugType |
| // Increment the HandlerErrors metric | ||
| func HandlerErrorsInc(opcode ops.OpCode, er EventHandlerError) { | ||
| GetHandlerErrors(opcode, er).Inc() | ||
| // Increment an DebugTotal for an DebugType |
There was a problem hiding this comment.
| // Increment an DebugTotal for an DebugType | |
| // Increment a DebugTotal for a DebugType |
Yeah, that would be much more structured. |
32fcc3f to
7330c93
Compare
Right now, tetragon_errors_total counts things that aren't really errors, like ProcessMetadataUsernameIgnoredNotInHost. This happens a lot in containers and isn't something we usually need to fix, but it's spamming the error metrics. This change adds a new tetragon_debug_events_total metric for these kinds of 'casual fails'. I moved ProcessMetadataUsernameIgnoredNotInHost over to this new debug metric so the main error count is actually useful for spotting real problems. Signed-off-by: Aritra Dey <adey01027@gmail.com>
7330c93 to
4ae70a8
Compare
mtardy
left a comment
There was a problem hiding this comment.
awesome let's merge like this once green, thanks again
Part of #2785
Description
Adds
tetragon_debug_events_totalmetric for non-critical errors to reduce noise intetragon_errors_total. (see commit)Example Output:
Changelog