feat(metrics): Migrate from OpenCensus to OpenTelemetry#1550
feat(metrics): Migrate from OpenCensus to OpenTelemetry#1550tekton-robot merged 1 commit intotektoncd:mainfrom
Conversation
f1301f9 to
01ba96d
Compare
01ba96d to
83cb0a6
Compare
1157620 to
4e1b733
Compare
|
/kind cleanup |
|
@anithapriyanatarajan can we fix this in same PR https://github.com/tektoncd/chains/blob/main/pkg/taskrunmetrics/injection.go#L47 |
| if !r.initialized { | ||
| return | ||
| } | ||
| r.errCount.Add(ctx, 1, otelmetric.WithAttributes(attribute.String("error_type", string(errType)))) |
There was a problem hiding this comment.
May be not hard code error_type and use a constant ErrorTypeAttrKey
| } | ||
| errRegistering = viewRegister() | ||
| r = &Recorder{} | ||
| meter := otel.Meter("taskrunmetrics") |
There was a problem hiding this comment.
shall we use the entire package path https://opentelemetry.io/docs/languages/go/instrumentation/#acquiring-a-meter
| } | ||
| errRegistering = viewRegister() | ||
| r = &Recorder{} | ||
| meter := otel.Meter("pipelinerunmetrics") |
There was a problem hiding this comment.
Shall we use entrire package path https://opentelemetry.io/docs/languages/go/instrumentation/#acquiring-a-meter
khrm
left a comment
There was a problem hiding this comment.
We will also need to update https://github.com/tektoncd/chains/blob/main/config/config-observability.yaml and set a new default here like in pipelines.
4e1b733 to
538efe0
Compare
538efe0 to
9383eb5
Compare
9383eb5 to
aef63b1
Compare
b8ff203 to
4a12441
Compare
4586806 to
cabc14c
Compare
cabc14c to
ec8fac8
Compare
|
@waveywaves @vdemeester - please share your reviews. Thank you |
There was a problem hiding this comment.
Code Review by Claude
5 inline comments. The OTel migration architecture is correct -- otel.Meter() with the global provider and knative/pkg handling MeterProvider setup is the right pattern. Issues below are about failure recovery, log spam, and test rigor.
| @@ -99,89 +67,91 @@ func NewRecorder(ctx context.Context) (*Recorder, error) { | |||
| var errRegistering error | |||
| logger := logging.FromContext(ctx) | |||
| once.Do(func() { | |||
There was a problem hiding this comment.
sync.Once has no recovery from partial init failure. If any Int64Counter creation fails (lines 77-116), once.Do is consumed forever. Every subsequent NewRecorder() returns the broken recorder permanently with no way to recover short of controller restart.
Consider sync.Mutex with a boolean guard to allow retry:
var mu sync.Mutex
func NewRecorder(ctx context.Context) (*Recorder, error) {
mu.Lock()
defer mu.Unlock()
if r != nil && r.initialized {
return r, nil
}
// ... create counters, allow retry on next call if it fails
}Same issue in pkg/taskrunmetrics/metrics.go:69.
There was a problem hiding this comment.
Addressed -- replaced with sync.Mutex in latest commit. Looks good.
| } | ||
| logger := logging.FromContext(ctx) | ||
| if !r.initialized { | ||
| logger.Errorf("Ignoring the metrics recording as recorder not initialized ") |
There was a problem hiding this comment.
This Errorf fires on every single reconciliation when the recorder is uninitialized. When NewRecorder fails, the partially-initialized recorder (non-nil, initialized == false) gets stored in context via WithClient. Get() finds it, returns it, and every RecordCountMetrics call hits this log line.
Result: continuous error log spam for the lifetime of the controller.
Two fixes needed:
- In
injection.go:40: store(*Recorder)(nil)on error instead of the broken recorder - Here: downgrade to
Debugfor log only once via a separatesync.Once
Same pattern in pkg/taskrunmetrics/metrics.go and pkg/taskrunmetrics/injection.go.
There was a problem hiding this comment.
Addressed -- downgraded to Debugf in latest commit.
| ctx := context.Background() | ||
|
|
||
| // Should not panic or crash when recorder is uninitialized | ||
| recorder.RecordCountMetrics(ctx, metrics.SignedMessagesCount) |
There was a problem hiding this comment.
This test regressed from the OpenCensus version. The old test used metricstest.CheckStatsNotReported() to explicitly assert metrics were NOT emitted when the recorder is uninitialized. The new test only checks "doesn't panic" -- it doesn't verify that no metrics were actually recorded. Consider collecting from the ManualReader and asserting the metric data is empty.
There was a problem hiding this comment.
Addressed -- test now sets up ManualReader, collects metrics, and asserts no metrics were recorded from uninitialized recorder. Good.
| } | ||
|
|
||
| // Record an error metric with a sample error type "signing" | ||
| rec.RecordErrorMetric(ctx, "signing") |
There was a problem hiding this comment.
"signing" should use the appropriate constant from pkg/metrics/common.go (e.g., a typed MetricErrorType constant). If the constant value changes, this test silently passes with the wrong value.
Same issue in pkg/taskrunmetrics/metrics_test.go:95.
There was a problem hiding this comment.
Addressed -- now uses metrics.SigningError constant.
| echo "Setting up port forwards..." | ||
|
|
||
| # Kill any existing port-forwards | ||
| pkill -f "kubectl.*port-forward" || true |
There was a problem hiding this comment.
pkill -f "kubectl.*port-forward" matches ALL kubectl port-forwards on the machine, not just ones from this script. This will kill port-forwards from other projects or debugging sessions running on the same machine.
Scope the pattern to specific namespaces:
pkill -f "kubectl.*port-forward.*-n monitoring" || true
pkill -f "kubectl.*port-forward.*-n observability-system" || true
pkill -f "kubectl.*port-forward.*-n tekton-chains" || trueThere was a problem hiding this comment.
Addressed -- pkill now scoped to specific namespaces (monitoring, observability-system, tekton-chains).
a5f62d3 to
257ffb0
Compare
waveywaves
left a comment
There was a problem hiding this comment.
Follow-up review
2 additional findings from deeper analysis.
Note: jkhelil's comments (ErrorTypeAttrKey constant, full meter package path) and khrm's comment (config-observability update) appear to be addressed in the current code.
| untyped := ctx.Value(RecorderKey{}) | ||
| if untyped == nil { | ||
| logging.FromContext(ctx).Errorf("Unable to fetch *pipelinerunmetrics.Recorder from context.") | ||
| return nil |
There was a problem hiding this comment.
nil-in-interface footgun in WithClient (line 35-41, not in diff). When NewRecorder returns (nil, err), rec is nil. But context.WithValue stores a non-nil interface{} wrapping a nil *Recorder.
Here in Get(), untyped == nil is false because it is an interface holding a nil pointer -- so it falls through to return untyped.(*Recorder) which returns (*Recorder)(nil). This only works because RecordCountMetrics and RecordErrorMetric have nil-receiver guards. One missed guard in a future method = production panic.
Fix WithClient: don't store the recorder on error, just return ctx unchanged and let Get() handle the missing key via the nil check above.
Same issue in pkg/taskrunmetrics/injection.go.
There was a problem hiding this comment.
Addressed -- WithClient now returns ctx unchanged on error instead of storing the broken recorder. Both pipelinerunmetrics and taskrunmetrics fixed.
| if err != nil { | ||
| logger.Error(err) | ||
| o.recordError(ctx, signableType.Type(), metrics.PayloadCreationError) | ||
| o.recordError(ctx, signableType, metrics.PayloadCreationError) |
There was a problem hiding this comment.
Good fix -- the old kind == TaskRunArtifact comparison was dead code because TaskRunArtifact.Type() returns tekton, not TaskRunArtifact. The type switch is correct.
Heads up: signing_failures_total has been always zero in production because of the old bug. After this merges, operators will see non-zero error counts for the first time. If anyone has alert-on-any-non-zero rules, they will get paged. Worth mentioning this behavioral change in the release notes.
There was a problem hiding this comment.
Addressed -- release notes now explicitly state that signing_failures_total counters were previously always zero due to a bug and now report correct non-zero values. Good.
257ffb0 to
a8bc5aa
Compare
waveywaves
left a comment
There was a problem hiding this comment.
One minor suggestion -- the rest looks good to ship.
|
|
||
| assertMetricsPresent(ctx, t, []string{ | ||
| "watcher_taskrun_sign_created_total", | ||
| }) |
There was a problem hiding this comment.
The test already creates a TaskRun and waits for signing to complete. The success path in ObjectSigner.Sign() fires 3 counters (sign_created, payload_stored, marked_signed), but only 1 is checked here. Adding the other two is zero-effort extra coverage:
| }) | |
| assertMetricsPresent(ctx, t, []string{ | |
| "watcher_taskrun_sign_created_total", | |
| "watcher_taskrun_payload_stored_total", | |
| "watcher_taskrun_marked_signed_total", | |
| }) |
Not blocking -- can be a follow-up.
There was a problem hiding this comment.
@waveywaves - Thank you for the detailed review. Had difficulty is getting all 3 metrics in e2e tests. This created flakiness of e2e tests in the runner setup. So reduced the complexity of test to just check the sign created metric alone. To mitigate this shortcoming, I have included the hack script that helps evaluate the metrics in a local kind cluster anytime. Hope this helps.
|
|
||
| // NewRecorder creates a new metrics recorder instance | ||
| // to log the PipelineRun related metrics. | ||
| // to log PipelineRun related metrics. |
There was a problem hiding this comment.
nit: Double space — "to log PipelineRun" → "to log PipelineRun"
There was a problem hiding this comment.
Thank you. It is resolved 😄
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Anitha Natarajan <anataraj@redhat.com>
a8bc5aa to
5c52cc7
Compare
|
/lgtm |
Fixes issue: #1355
Changes
This PR Migrates Chains metrics implementation from the deprecated OpenCensus library to OpenTelemetry to align with Tekton Pipeline and Knative's observability stack modernization.
This migration is necessary because:
Changes include:
The metrics API exposed by Chains remains unchanged - this is an internal implementation migration. All existing metrics (watcher_pipelinerun_sign_created_total,watcher_taskrun_payload_uploaded_total, etc.) continue to be exported with the same names.
Relates to tektoncd/pipeline#9043
Relates to knative/serving observability migration: https://knative.dev/docs/serving/observability/metrics/serving-metrics/
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
Release Notes
/kind feature