tracer: avoid panic for unknown origins#1046
Conversation
If a trace with an unknown origin is reported, eBPF profiler currently panics: ``` level=WARN msg="Skip handling trace from unexpected 4 origin" panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xb76c39] goroutine 97 [running]: go.opentelemetry.io/ebpf-profiler/tracer.(*Tracer).startTraceEventMonitor.func1() /agent/tracer/events.go:218 +0x359 created by go.opentelemetry.io/ebpf-profiler/tracer.(*Tracer).startTraceEventMonitor in goroutine 1 /agent/tracer/events.go:153 +0x248 ``` Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
|
|
||
| // Keep track of min KTime seen in this batch processing loop | ||
| trace := t.loadBpfTrace(data.RawSample, data.CPU) | ||
| if trace == nil { |
There was a problem hiding this comment.
Given that loadBpfTrace may panic in multiple spots (a behavior which we want to remove), maybe we can start by having it return an error to the caller in this PR. In a future PR, we could remove the panics by returning additional errors that trigger receiver shutdown.
There was a problem hiding this comment.
Updated loadBpfTrace() to return an error instead of panicing.
In a future PR, we could remove the panics by returning additional errors that trigger receiver shutdown.
This is something for the log.Fatal() cases in other places, I think.
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
| frameListOffs := int(unsafe.Offsetof(support.Trace{}.Frame_data)) | ||
|
|
||
| if len(raw) < frameListOffs { | ||
| panic("trace record too small") |
There was a problem hiding this comment.
We should keep these panics for now until we have a functional equivalent (cleanly shutting down the receiver).
If we remove them here, we're changing the behavior as we're effectively ignoring these errors that were previously deemed serious enough to panic (we're also hiding this behavior from a future refactoring).
There was a problem hiding this comment.
Mixing panic() with error returns in loadBpfTrace() is something I would like to avoid.
The panic() here was replaced with a dedicated error. This allows in a future step a graceful shutdown on this dedicated error. As this case should not :Tm: happen in regular setups, as far as I know there is not a single report of such a panic in the past ~5 years, this error is logged as error at the moment.
There was a problem hiding this comment.
The main issue is that we're hiding information with this change, the current code encodes certain semantics: that this error is serious enough to panic. So we at least need to keep the "serious enough" part of these semantics intact, otherwise we're forcing ourselves to remember that these conditions should trigger shutdown. See my other suggestions.
There was a problem hiding this comment.
To mimic the current semantics I added dedicated error handling for these panic() cases with a10309c and stop receiving and handling traces in user space.
The next step, a graceful complete shutdown, then should only be the next step with the handling of log.Fatal() cases in package tracer.
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
| return nil, fmt.Errorf("%d < %d: %w", len(raw), frameListOffs+frameDataLen, | ||
| errRecordUnexpectedSize) |
There was a problem hiding this comment.
How about log the error here, and return err... directly to caller to determine what to do? Same for other similar places.
There was a problem hiding this comment.
I used the fmt.Errorf() here, as I think it is important to have these additional information with the error to act on them later on. As control/program flows can change, I didn't want to put a log here and keep the log, that indicates the termination, close to the place, where the termination actual happens.
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Co-authored-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Co-authored-by: Timo Teräs <timo.teras@iki.fi>
If a trace with an unknown origin is reported, eBPF profiler currently panics: