OEV-779 Send custom event on atlas error#339
Conversation
|
No problem, NewMetaMetrics is only used in chainlink-evm, see here |
There was a problem hiding this comment.
Pull request overview
This PR adds custom event emission for Atlas errors in the dual broadcast transaction manager. The changes enable tracking of FastLane Atlas errors through OpenTelemetry events sent to Beholder, providing better observability for Atlas-related failures.
Changes:
- Added new
emitAtlasErrorandemitAtlasErrorWithHttpStatusCodemethods to emit structured error events - Updated
MetaMetricsconstructor to accept a logger parameter for logging error events - Integrated error event emission at transaction send request and operation failure points
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/txm/clientwrappers/dualbroadcast/meta_metrics.go | Added error emission methods and logger to MetaMetrics struct |
| pkg/txm/clientwrappers/dualbroadcast/meta_metrics_test.go | Added comprehensive tests for the new error emission functionality |
| pkg/txm/clientwrappers/dualbroadcast/meta_client.go | Integrated error emission calls at failure points |
| go.mod | Updated chainlink-protos dependency to include FastLaneAtlasError message |
| Makefile | Added help text comments to existing targets |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
CI failures are due to smartcontractkit/chainlink-protos#264 not being merged, will fix that when it's merged. |
| } | ||
|
|
||
| // emitAtlasErrorWithHttpStatusCode emits an OTel event with an HTTP status code to track FastLane Atlas errors | ||
| func (m *MetaMetrics) emitAtlasErrorWithHttpStatusCode(ctx context.Context, errType string, customURL *url.URL, cause error, httpStatusCode int, tx *types.Transaction) { |
There was a problem hiding this comment.
If the main code doesn't ever utilize the httpStatusCode param, is there a point for tests to test for it? I feel we can avoid the passthrough method on line 115 and instead have one function.
There was a problem hiding this comment.
It's used in the Atlas error struct, see here
There was a problem hiding this comment.
It's in the Atlas error struct but besides tests, is there another value passed other than -1?
There was a problem hiding this comment.
Ah I see what you mean, good call 👍
I've removed the method and set the status code in the event to -1 for now, will remove the field from the schema in a separate PR.
| UseProtoNames: true, | ||
| EmitUnpopulated: true, | ||
| }.Format(msg) | ||
| m.lggr.Infow("[Beholder.emit]", "message", mStr, "attributes", attrKVs) |
There was a problem hiding this comment.
I'm on the fence as to whether this needs to be info or debug.
There was a problem hiding this comment.
We should only get a lot of errors in case of erroneous circumstances, then I think it'd be good to have these logs
There was a problem hiding this comment.
But since this is a node log, we get this information from other sources as well, so this just confirms that it sends the Beholder message on an info level. Anyway, as I mentioned, I don't have a strong preference so we can keep it like that.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ErrorType: errType, | ||
| ErrorMessage: cause.Error(), | ||
| HttpStatusCode: int32(httpStatusCode), | ||
| TransactionId: int64(tx.ID), //nolint:gosec // overflow is acceptable for telemetry |
There was a problem hiding this comment.
The comment states 'overflow is acceptable for telemetry' but doesn't explain the implications. Consider documenting what happens if overflow occurs and why it's acceptable, or what the expected range of tx.ID values is.
| TransactionId: int64(tx.ID), //nolint:gosec // overflow is acceptable for telemetry | |
| // We deliberately cast tx.ID to int64 for the telemetry-only TransactionId field. | |
| // The underlying tx.ID type may be wider than int64, so in theory this cast can | |
| // overflow and wrap according to Go's integer conversion rules. If that happens, | |
| // only the TransactionId reported in telemetry is affected; it does not influence | |
| // on-chain behaviour, database primary keys, or any safety-critical logic. | |
| // In practice, tx.ID values are expected to remain within the int64 range for the | |
| // lifetime of the service, so overflow is not expected in normal operation. | |
| TransactionId: int64(tx.ID), //nolint:gosec // see comment above for justification |
Depends on