feat(metrics): Migrate from OpenCensus to OpenTelemetry by anithapriyanatarajan · Pull Request #1550 · tektoncd/chains

anithapriyanatarajan · 2026-02-09T09:05:25Z

Fixes issue: #1355

Changes

This PR Migrates Chains metrics implementation from the deprecated OpenCensus library to OpenTelemetry to align with Tekton Pipeline and Knative's observability stack modernization.

This migration is necessary because:

Knative deprecated knative.dev/pkg/metrics (OpenCensus-based) and removed it from the latest versions
Knative introduced knative.dev/pkg/observability with OpenTelemetry support as the replacement metrics framework
Tekton Pipeline is migrating to OpenTelemetry (PR #9043)
OpenCensus is no longer maintained and the ecosystem has standardize on OpenTelemetry

Changes include:

Update knative.dev/pkg to v0.0.0-20260225113719-b239e967f175 (includes new observability package with OpenTelemetry support)
Update tektoncd/pipeline dependency to version 1.10.2 for OpenTelemetry compatibility
Replace deprecated knative.dev/pkg/metrics imports with direct OpenTelemetry (go.opentelemetry.io/otel) in pkg/pipelinerunmetrics and pkg/taskrunmetrics
Convert OpenCensus stats.Float64Measure and view.View to OpenTelemetry Int64Counter instruments
Update metric recording from metrics.Record() with tags to OTel's Counter.Add() with attributes
Simplify tests to work without deprecated knative.dev/pkg/metrics/metricstest

The metrics API exposed by Chains remains unchanged - this is an internal implementation migration. All existing metrics (watcher_pipelinerun_sign_created_total,watcher_taskrun_payload_uploaded_total, etc.) continue to be exported with the same names.

Relates to tektoncd/pipeline#9043
Relates to knative/serving observability migration: https://knative.dev/docs/serving/observability/metrics/serving-metrics/

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs included if any changes are user facing
Has Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including
functionality, content, code)
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Migrate Chains metrics from OpenCensus to OpenTelemetry
-----------------------------------------------------------------------------------------
⚠️ Breaking Changes 

1. ConfigMap key renamed

The observability config key in `tekton-chains-config-observability` has been renamed:
from  `metrics.backend-destination`  to `metrics-protocol` |

`prometheus` is now the active default. If you were already using Prometheus and have not customized this Config Map, no action is needed.

 2. Knative infrastructure metric names changed

Metrics emitted by the Knative controller runtime have been renamed. Update any dashboards, recording rules, or alerts that reference the old names:

| Category   | Old name                                    | New name                                        |
|------------|---------------------------------------------|-------------------------------------------------|
| Workqueue  | `watcher_workqueue_adds_total`              | `kn_workqueue_adds_total`                       |
| Workqueue  | `watcher_workqueue_depth`                   | `kn_workqueue_depth`                            |
| Workqueue  | `watcher_workqueue_queue_latency_seconds`   | `kn_workqueue_queue_duration_seconds`           |
| Workqueue  | `watcher_workqueue_work_duration_seconds`   | `kn_workqueue_process_duration_seconds`         |
| Workqueue  | `watcher_workqueue_retries_total`           | `kn_workqueue_retries_total`                    |
| Workqueue  | `watcher_workqueue_unfinished_work_seconds` | `kn_workqueue_unfinished_work_seconds`          |
| K8s client | `watcher_client_latency`                    | `http_client_request_duration_seconds`          |
| K8s client | `watcher_client_results`                    | `kn_k8s_client_http_response_status_code_total` |
| Go runtime | `watcher_go_*`                              | `go_*` (standard Prometheus Go collector)       |

3. Removed metrics

`watcher_reconcile_count` and `watcher_reconcile_latency` are no longer emitted.
Replacements:
- `watcher_reconcile_count` → `kn_workqueue_adds_total`
- `watcher_reconcile_latency` → `kn_workqueue_process_duration_seconds`


✅ What is unchanged

All 10 Chains-specific signing metrics keep their existing names — no changes needed to Prometheus rules, alerts, or dashboards that reference these:

watcher_taskrun_sign_created_total
watcher_taskrun_payload_uploaded_total
watcher_taskrun_payload_stored_total
watcher_taskrun_marked_signed_total
watcher_taskrun_signing_failures_total

watcher_pipelinerun_sign_created_total
watcher_pipelinerun_payload_uploaded_total
watcher_pipelinerun_payload_stored_total
watcher_pipelinerun_marked_signed_total
watcher_pipelinerun_signing_failures_total

The `error_type` label on the `_signing_failures_total` counters is also preserved.
NOTE: watcher_taskrun_signing_failures_total and watcher_pipelinerun_signing_failures_total were previously always zero due to a bug in error type matching. These counters now report correct non-zero values

🆕 New: OTLP push export

Metrics can now be pushed directly to any OTLP-compatible backend (OpenTelemetry Collector, Grafana Alloy, Jaeger, etc.) without a Prometheus scrape step:
data:
  metrics-protocol: otlp/grpc          # or otlp/http
  metrics-endpoint: otel-collector.monitoring.svc.cluster.local:4317

/kind feature

anithapriyanatarajan · 2026-03-11T14:04:59Z

/kind cleanup

jkhelil · 2026-03-16T08:38:37Z

@anithapriyanatarajan can we fix this in same PR https://github.com/tektoncd/chains/blob/main/pkg/taskrunmetrics/injection.go#L47
we should not panic

jkhelil · 2026-03-16T08:25:56Z

+	if !r.initialized {
+		return
+	}
+	r.errCount.Add(ctx, 1, otelmetric.WithAttributes(attribute.String("error_type", string(errType))))


May be not hard code error_type and use a constant ErrorTypeAttrKey

jkhelil · 2026-03-16T08:36:32Z

-		}
-		errRegistering = viewRegister()
+		r = &Recorder{}
+		meter := otel.Meter("taskrunmetrics")


shall we use the entire package path https://opentelemetry.io/docs/languages/go/instrumentation/#acquiring-a-meter

jkhelil · 2026-03-16T08:37:02Z

-		}
-		errRegistering = viewRegister()
+		r = &Recorder{}
+		meter := otel.Meter("pipelinerunmetrics")


Shall we use entrire package path https://opentelemetry.io/docs/languages/go/instrumentation/#acquiring-a-meter

khrm

We will also need to update https://github.com/tektoncd/chains/blob/main/config/config-observability.yaml and set a new default here like in pipelines.

anithapriyanatarajan · 2026-03-23T17:33:56Z

@waveywaves @vdemeester - please share your reviews. Thank you

waveywaves

Code Review by Claude

5 inline comments. The OTel migration architecture is correct -- otel.Meter() with the global provider and knative/pkg handling MeterProvider setup is the right pattern. Issues below are about failure recovery, log spam, and test rigor.

waveywaves · 2026-03-23T18:43:48Z

@@ -99,89 +67,91 @@ func NewRecorder(ctx context.Context) (*Recorder, error) {
 	var errRegistering error
 	logger := logging.FromContext(ctx)
 	once.Do(func() {


sync.Once has no recovery from partial init failure. If any Int64Counter creation fails (lines 77-116), once.Do is consumed forever. Every subsequent NewRecorder() returns the broken recorder permanently with no way to recover short of controller restart.

Consider sync.Mutex with a boolean guard to allow retry:

var mu sync.Mutex func NewRecorder(ctx context.Context) (*Recorder, error) { mu.Lock() defer mu.Unlock() if r != nil && r.initialized { return r, nil } // ... create counters, allow retry on next call if it fails }

Same issue in pkg/taskrunmetrics/metrics.go:69.

Addressed -- replaced with sync.Mutex in latest commit. Looks good.

waveywaves · 2026-03-23T18:43:48Z

+	}
 	logger := logging.FromContext(ctx)
 	if !r.initialized {
 		logger.Errorf("Ignoring the metrics recording as recorder not initialized ")


This Errorf fires on every single reconciliation when the recorder is uninitialized. When NewRecorder fails, the partially-initialized recorder (non-nil, initialized == false) gets stored in context via WithClient. Get() finds it, returns it, and every RecordCountMetrics call hits this log line.

Result: continuous error log spam for the lifetime of the controller.

Two fixes needed:

In injection.go:40: store (*Recorder)(nil) on error instead of the broken recorder

Here: downgrade to Debugf or log only once via a separate sync.Once

Same pattern in pkg/taskrunmetrics/metrics.go and pkg/taskrunmetrics/injection.go.

Addressed -- downgraded to Debugf in latest commit.

waveywaves · 2026-03-23T18:43:48Z

 	ctx := context.Background()

+	// Should not panic or crash when recorder is uninitialized
 	recorder.RecordCountMetrics(ctx, metrics.SignedMessagesCount)


This test regressed from the OpenCensus version. The old test used metricstest.CheckStatsNotReported() to explicitly assert metrics were NOT emitted when the recorder is uninitialized. The new test only checks "doesn't panic" -- it doesn't verify that no metrics were actually recorded. Consider collecting from the ManualReader and asserting the metric data is empty.

Addressed -- test now sets up ManualReader, collects metrics, and asserts no metrics were recorded from uninitialized recorder. Good.

waveywaves · 2026-03-23T18:43:48Z

 	}

-	// Record an error metric with a sample error type "signing"
 	rec.RecordErrorMetric(ctx, "signing")


"signing" should use the appropriate constant from pkg/metrics/common.go (e.g., a typed MetricErrorType constant). If the constant value changes, this test silently passes with the wrong value.

Same issue in pkg/taskrunmetrics/metrics_test.go:95.

Addressed -- now uses metrics.SigningError constant.

waveywaves · 2026-03-23T18:43:48Z

+  echo "Setting up port forwards..."
+
+  # Kill any existing port-forwards
+  pkill -f "kubectl.*port-forward" || true


pkill -f "kubectl.*port-forward" matches ALL kubectl port-forwards on the machine, not just ones from this script. This will kill port-forwards from other projects or debugging sessions running on the same machine.

Scope the pattern to specific namespaces:

pkill -f "kubectl.*port-forward.*-n monitoring" || true pkill -f "kubectl.*port-forward.*-n observability-system" || true pkill -f "kubectl.*port-forward.*-n tekton-chains" || true

Addressed -- pkill now scoped to specific namespaces (monitoring, observability-system, tekton-chains).

waveywaves

Follow-up review

2 additional findings from deeper analysis.

Note: jkhelil's comments (ErrorTypeAttrKey constant, full meter package path) and khrm's comment (config-observability update) appear to be addressed in the current code.

waveywaves · 2026-03-24T17:30:04Z

 	untyped := ctx.Value(RecorderKey{})
 	if untyped == nil {
 		logging.FromContext(ctx).Errorf("Unable to fetch *pipelinerunmetrics.Recorder from context.")
+		return nil


nil-in-interface footgun in WithClient (line 35-41, not in diff). When NewRecorder returns (nil, err), rec is nil. But context.WithValue stores a non-nil interface{} wrapping a nil *Recorder.

Here in Get(), untyped == nil is false because it is an interface holding a nil pointer -- so it falls through to return untyped.(*Recorder) which returns (*Recorder)(nil). This only works because RecordCountMetrics and RecordErrorMetric have nil-receiver guards. One missed guard in a future method = production panic.

Fix WithClient: don't store the recorder on error, just return ctx unchanged and let Get() handle the missing key via the nil check above.

Same issue in pkg/taskrunmetrics/injection.go.

Addressed -- WithClient now returns ctx unchanged on error instead of storing the broken recorder. Both pipelinerunmetrics and taskrunmetrics fixed.

waveywaves · 2026-03-24T17:30:04Z

 			if err != nil {
 				logger.Error(err)
-				o.recordError(ctx, signableType.Type(), metrics.PayloadCreationError)
+				o.recordError(ctx, signableType, metrics.PayloadCreationError)


Good fix -- the old kind == TaskRunArtifact comparison was dead code because TaskRunArtifact.Type() returns tekton, not TaskRunArtifact. The type switch is correct.

Heads up: signing_failures_total has been always zero in production because of the old bug. After this merges, operators will see non-zero error counts for the first time. If anyone has alert-on-any-non-zero rules, they will get paged. Worth mentioning this behavioral change in the release notes.

Addressed -- release notes now explicitly state that signing_failures_total counters were previously always zero due to a bug and now report correct non-zero values. Good.

waveywaves

One minor suggestion -- the rest looks good to ship.

waveywaves · 2026-03-25T06:54:10Z

+
+	assertMetricsPresent(ctx, t, []string{
+		"watcher_taskrun_sign_created_total",
+	})


The test already creates a TaskRun and waits for signing to complete. The success path in ObjectSigner.Sign() fires 3 counters (sign_created, payload_stored, marked_signed), but only 1 is checked here. Adding the other two is zero-effort extra coverage:

Suggested change

})

assertMetricsPresent(ctx, t, []string{

"watcher_taskrun_sign_created_total",

"watcher_taskrun_payload_stored_total",

"watcher_taskrun_marked_signed_total",

})

Not blocking -- can be a follow-up.

@waveywaves - Thank you for the detailed review. Had difficulty is getting all 3 metrics in e2e tests. This created flakiness of e2e tests in the runner setup. So reduced the complexity of test to just check the sign created metric alone. To mitigate this shortcoming, I have included the hack script that helps evaluate the metrics in a local kind cluster anytime. Hope this helps.

vdemeester

Nice 🎉

vdemeester · 2026-03-27T09:30:57Z


 // NewRecorder creates a new metrics recorder instance
-// to log the PipelineRun related metrics.
+// to log  PipelineRun related metrics.


nit: Double space — "to log PipelineRun" → "to log PipelineRun"

Thank you. It is resolved 😄

tekton-robot · 2026-03-27T09:32:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Anitha Natarajan <anataraj@redhat.com>

vdemeester · 2026-03-27T12:26:13Z

/lgtm

tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 9, 2026

tekton-robot requested review from jkhelil and lcarva February 9, 2026 09:05

tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 9, 2026

anithapriyanatarajan added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2026

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2026

anithapriyanatarajan force-pushed the upgrade-otel branch from f1301f9 to 01ba96d Compare March 9, 2026 08:48

tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 9, 2026

anithapriyanatarajan force-pushed the upgrade-otel branch from 01ba96d to 83cb0a6 Compare March 9, 2026 11:41

anithapriyanatarajan force-pushed the upgrade-otel branch 2 times, most recently from 1157620 to 4e1b733 Compare March 11, 2026 11:18

anithapriyanatarajan removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2026

tekton-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 11, 2026

anithapriyanatarajan removed the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 11, 2026

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2026

jkhelil reviewed Mar 16, 2026

View reviewed changes

khrm reviewed Mar 16, 2026

View reviewed changes

anithapriyanatarajan force-pushed the upgrade-otel branch from 4e1b733 to 538efe0 Compare March 16, 2026 14:35

tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 16, 2026

anithapriyanatarajan force-pushed the upgrade-otel branch from 538efe0 to 9383eb5 Compare March 16, 2026 18:03

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2026

anithapriyanatarajan force-pushed the upgrade-otel branch from 9383eb5 to aef63b1 Compare March 23, 2026 08:49

anithapriyanatarajan force-pushed the upgrade-otel branch 2 times, most recently from b8ff203 to 4a12441 Compare March 23, 2026 12:34

tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 23, 2026

anithapriyanatarajan force-pushed the upgrade-otel branch 3 times, most recently from 4586806 to cabc14c Compare March 23, 2026 15:17

anithapriyanatarajan closed this Mar 23, 2026

anithapriyanatarajan reopened this Mar 23, 2026

anithapriyanatarajan force-pushed the upgrade-otel branch from cabc14c to ec8fac8 Compare March 23, 2026 15:45

anithapriyanatarajan requested review from jkhelil and khrm March 23, 2026 16:05

waveywaves reviewed Mar 23, 2026

View reviewed changes

anithapriyanatarajan force-pushed the upgrade-otel branch 3 times, most recently from a5f62d3 to 257ffb0 Compare March 24, 2026 12:39

waveywaves reviewed Mar 24, 2026

View reviewed changes

anithapriyanatarajan force-pushed the upgrade-otel branch from 257ffb0 to a8bc5aa Compare March 24, 2026 20:05

anithapriyanatarajan requested a review from waveywaves March 24, 2026 20:07

waveywaves reviewed Mar 25, 2026

View reviewed changes

vdemeester approved these changes Mar 27, 2026

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2026

feat(metrics): Migrate from OpenCensus to OpenTelemetry

5c52cc7

Signed-off-by: Anitha Natarajan <anataraj@redhat.com>

anithapriyanatarajan force-pushed the upgrade-otel branch from a8bc5aa to 5c52cc7 Compare March 27, 2026 10:58

tekton-robot assigned vdemeester Mar 27, 2026

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2026

tekton-robot merged commit 53834cb into tektoncd:main Mar 27, 2026
18 checks passed

Conversation

anithapriyanatarajan commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Submitter Checklist

Release Notes

Uh oh!

anithapriyanatarajan commented Mar 11, 2026

Uh oh!

jkhelil commented Mar 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khrm left a comment

Choose a reason for hiding this comment

Uh oh!

anithapriyanatarajan commented Mar 23, 2026

Uh oh!

waveywaves left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Code Review by Claude

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

waveywaves left a comment

Choose a reason for hiding this comment

Follow-up review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

waveywaves left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anithapriyanatarajan Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdemeester left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tekton-robot commented Mar 27, 2026

Uh oh!

vdemeester commented Mar 27, 2026

Uh oh!

Uh oh!

anithapriyanatarajan commented Feb 9, 2026 •

edited

Loading

waveywaves left a comment •

edited

Loading

anithapriyanatarajan Mar 25, 2026 •

edited

Loading