Skip to content

feat(metrics): Migrate from OpenCensus to OpenTelemetry#1550

Merged
tekton-robot merged 1 commit intotektoncd:mainfrom
anithapriyanatarajan:upgrade-otel
Mar 27, 2026
Merged

feat(metrics): Migrate from OpenCensus to OpenTelemetry#1550
tekton-robot merged 1 commit intotektoncd:mainfrom
anithapriyanatarajan:upgrade-otel

Conversation

@anithapriyanatarajan
Copy link
Copy Markdown
Contributor

@anithapriyanatarajan anithapriyanatarajan commented Feb 9, 2026

Fixes issue: #1355

Changes

This PR Migrates Chains metrics implementation from the deprecated OpenCensus library to OpenTelemetry to align with Tekton Pipeline and Knative's observability stack modernization.

This migration is necessary because:

  • Knative deprecated knative.dev/pkg/metrics (OpenCensus-based) and removed it from the latest versions
  • Knative introduced knative.dev/pkg/observability with OpenTelemetry support as the replacement metrics framework
  • Tekton Pipeline is migrating to OpenTelemetry (PR #9043)
  • OpenCensus is no longer maintained and the ecosystem has standardize on OpenTelemetry

Changes include:

  • Update knative.dev/pkg to v0.0.0-20260225113719-b239e967f175 (includes new observability package with OpenTelemetry support)
  • Update tektoncd/pipeline dependency to version 1.10.2 for OpenTelemetry compatibility
  • Replace deprecated knative.dev/pkg/metrics imports with direct OpenTelemetry (go.opentelemetry.io/otel) in pkg/pipelinerunmetrics and pkg/taskrunmetrics
  • Convert OpenCensus stats.Float64Measure and view.View to OpenTelemetry Int64Counter instruments
  • Update metric recording from metrics.Record() with tags to OTel's Counter.Add() with attributes
  • Simplify tests to work without deprecated knative.dev/pkg/metrics/metricstest

The metrics API exposed by Chains remains unchanged - this is an internal implementation migration. All existing metrics (watcher_pipelinerun_sign_created_total,watcher_taskrun_payload_uploaded_total, etc.) continue to be exported with the same names.

Relates to tektoncd/pipeline#9043
Relates to knative/serving observability migration: https://knative.dev/docs/serving/observability/metrics/serving-metrics/

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Migrate Chains metrics from OpenCensus to OpenTelemetry
-----------------------------------------------------------------------------------------
⚠️ Breaking Changes 

1. ConfigMap key renamed

The observability config key in `tekton-chains-config-observability` has been renamed:
from  `metrics.backend-destination`  to `metrics-protocol` |

`prometheus` is now the active default. If you were already using Prometheus and have not customized this Config Map, no action is needed.

 2. Knative infrastructure metric names changed

Metrics emitted by the Knative controller runtime have been renamed. Update any dashboards, recording rules, or alerts that reference the old names:

| Category   | Old name                                    | New name                                        |
|------------|---------------------------------------------|-------------------------------------------------|
| Workqueue  | `watcher_workqueue_adds_total`              | `kn_workqueue_adds_total`                       |
| Workqueue  | `watcher_workqueue_depth`                   | `kn_workqueue_depth`                            |
| Workqueue  | `watcher_workqueue_queue_latency_seconds`   | `kn_workqueue_queue_duration_seconds`           |
| Workqueue  | `watcher_workqueue_work_duration_seconds`   | `kn_workqueue_process_duration_seconds`         |
| Workqueue  | `watcher_workqueue_retries_total`           | `kn_workqueue_retries_total`                    |
| Workqueue  | `watcher_workqueue_unfinished_work_seconds` | `kn_workqueue_unfinished_work_seconds`          |
| K8s client | `watcher_client_latency`                    | `http_client_request_duration_seconds`          |
| K8s client | `watcher_client_results`                    | `kn_k8s_client_http_response_status_code_total` |
| Go runtime | `watcher_go_*`                              | `go_*` (standard Prometheus Go collector)       |

3. Removed metrics

`watcher_reconcile_count` and `watcher_reconcile_latency` are no longer emitted.
Replacements:
- `watcher_reconcile_count` → `kn_workqueue_adds_total`
- `watcher_reconcile_latency` → `kn_workqueue_process_duration_seconds`


✅ What is unchanged

All 10 Chains-specific signing metrics keep their existing names — no changes needed to Prometheus rules, alerts, or dashboards that reference these:

watcher_taskrun_sign_created_total
watcher_taskrun_payload_uploaded_total
watcher_taskrun_payload_stored_total
watcher_taskrun_marked_signed_total
watcher_taskrun_signing_failures_total

watcher_pipelinerun_sign_created_total
watcher_pipelinerun_payload_uploaded_total
watcher_pipelinerun_payload_stored_total
watcher_pipelinerun_marked_signed_total
watcher_pipelinerun_signing_failures_total

The `error_type` label on the `_signing_failures_total` counters is also preserved.
NOTE: watcher_taskrun_signing_failures_total and watcher_pipelinerun_signing_failures_total were previously always zero due to a bug in error type matching. These counters now report correct non-zero values

🆕 New: OTLP push export

Metrics can now be pushed directly to any OTLP-compatible backend (OpenTelemetry Collector, Grafana Alloy, Jaeger, etc.) without a Prometheus scrape step:
data:
  metrics-protocol: otlp/grpc          # or otlp/http
  metrics-endpoint: otel-collector.monitoring.svc.cluster.local:4317

/kind feature

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 9, 2026
@tekton-robot tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 9, 2026
@anithapriyanatarajan anithapriyanatarajan added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2026
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2026
@tekton-robot tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 9, 2026
@tekton-robot tekton-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 9, 2026
@anithapriyanatarajan anithapriyanatarajan force-pushed the upgrade-otel branch 2 times, most recently from 1157620 to 4e1b733 Compare March 11, 2026 11:18
@anithapriyanatarajan anithapriyanatarajan removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2026
@anithapriyanatarajan
Copy link
Copy Markdown
Contributor Author

/kind cleanup

@tekton-robot tekton-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 11, 2026
@anithapriyanatarajan anithapriyanatarajan removed the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 11, 2026
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2026
@jkhelil
Copy link
Copy Markdown
Member

jkhelil commented Mar 16, 2026

@anithapriyanatarajan can we fix this in same PR https://github.com/tektoncd/chains/blob/main/pkg/taskrunmetrics/injection.go#L47
we should not panic

Comment thread pkg/taskrunmetrics/metrics.go Outdated
if !r.initialized {
return
}
r.errCount.Add(ctx, 1, otelmetric.WithAttributes(attribute.String("error_type", string(errType))))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be not hard code error_type and use a constant ErrorTypeAttrKey

Comment thread pkg/taskrunmetrics/metrics.go Outdated
}
errRegistering = viewRegister()
r = &Recorder{}
meter := otel.Meter("taskrunmetrics")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread pkg/pipelinerunmetrics/metrics.go Outdated
}
errRegistering = viewRegister()
r = &Recorder{}
meter := otel.Meter("pipelinerunmetrics")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@khrm khrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also need to update https://github.com/tektoncd/chains/blob/main/config/config-observability.yaml and set a new default here like in pipelines.

@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 16, 2026
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2026
@anithapriyanatarajan anithapriyanatarajan force-pushed the upgrade-otel branch 2 times, most recently from b8ff203 to 4a12441 Compare March 23, 2026 12:34
@tekton-robot tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 23, 2026
@anithapriyanatarajan anithapriyanatarajan force-pushed the upgrade-otel branch 3 times, most recently from 4586806 to cabc14c Compare March 23, 2026 15:17
@anithapriyanatarajan
Copy link
Copy Markdown
Contributor Author

@waveywaves @vdemeester - please share your reviews. Thank you

Copy link
Copy Markdown
Member

@waveywaves waveywaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review by Claude

5 inline comments. The OTel migration architecture is correct -- otel.Meter() with the global provider and knative/pkg handling MeterProvider setup is the right pattern. Issues below are about failure recovery, log spam, and test rigor.

Comment thread pkg/pipelinerunmetrics/metrics.go Outdated
@@ -99,89 +67,91 @@ func NewRecorder(ctx context.Context) (*Recorder, error) {
var errRegistering error
logger := logging.FromContext(ctx)
once.Do(func() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync.Once has no recovery from partial init failure. If any Int64Counter creation fails (lines 77-116), once.Do is consumed forever. Every subsequent NewRecorder() returns the broken recorder permanently with no way to recover short of controller restart.

Consider sync.Mutex with a boolean guard to allow retry:

var mu sync.Mutex
func NewRecorder(ctx context.Context) (*Recorder, error) {
    mu.Lock()
    defer mu.Unlock()
    if r != nil && r.initialized {
        return r, nil
    }
    // ... create counters, allow retry on next call if it fails
}

Same issue in pkg/taskrunmetrics/metrics.go:69.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed -- replaced with sync.Mutex in latest commit. Looks good.

Comment thread pkg/pipelinerunmetrics/metrics.go Outdated
}
logger := logging.FromContext(ctx)
if !r.initialized {
logger.Errorf("Ignoring the metrics recording as recorder not initialized ")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Errorf fires on every single reconciliation when the recorder is uninitialized. When NewRecorder fails, the partially-initialized recorder (non-nil, initialized == false) gets stored in context via WithClient. Get() finds it, returns it, and every RecordCountMetrics call hits this log line.

Result: continuous error log spam for the lifetime of the controller.

Two fixes needed:

  1. In injection.go:40: store (*Recorder)(nil) on error instead of the broken recorder
  2. Here: downgrade to Debugf or log only once via a separate sync.Once

Same pattern in pkg/taskrunmetrics/metrics.go and pkg/taskrunmetrics/injection.go.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed -- downgraded to Debugf in latest commit.

ctx := context.Background()

// Should not panic or crash when recorder is uninitialized
recorder.RecordCountMetrics(ctx, metrics.SignedMessagesCount)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test regressed from the OpenCensus version. The old test used metricstest.CheckStatsNotReported() to explicitly assert metrics were NOT emitted when the recorder is uninitialized. The new test only checks "doesn't panic" -- it doesn't verify that no metrics were actually recorded. Consider collecting from the ManualReader and asserting the metric data is empty.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed -- test now sets up ManualReader, collects metrics, and asserts no metrics were recorded from uninitialized recorder. Good.

Comment thread pkg/pipelinerunmetrics/metrics_test.go Outdated
}

// Record an error metric with a sample error type "signing"
rec.RecordErrorMetric(ctx, "signing")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"signing" should use the appropriate constant from pkg/metrics/common.go (e.g., a typed MetricErrorType constant). If the constant value changes, this test silently passes with the wrong value.

Same issue in pkg/taskrunmetrics/metrics_test.go:95.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed -- now uses metrics.SigningError constant.

Comment thread hack/setup-observability-dev.sh Outdated
echo "Setting up port forwards..."

# Kill any existing port-forwards
pkill -f "kubectl.*port-forward" || true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkill -f "kubectl.*port-forward" matches ALL kubectl port-forwards on the machine, not just ones from this script. This will kill port-forwards from other projects or debugging sessions running on the same machine.

Scope the pattern to specific namespaces:

pkill -f "kubectl.*port-forward.*-n monitoring" || true
pkill -f "kubectl.*port-forward.*-n observability-system" || true
pkill -f "kubectl.*port-forward.*-n tekton-chains" || true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed -- pkill now scoped to specific namespaces (monitoring, observability-system, tekton-chains).

@anithapriyanatarajan anithapriyanatarajan force-pushed the upgrade-otel branch 3 times, most recently from a5f62d3 to 257ffb0 Compare March 24, 2026 12:39
Copy link
Copy Markdown
Member

@waveywaves waveywaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review

2 additional findings from deeper analysis.

Note: jkhelil's comments (ErrorTypeAttrKey constant, full meter package path) and khrm's comment (config-observability update) appear to be addressed in the current code.

untyped := ctx.Value(RecorderKey{})
if untyped == nil {
logging.FromContext(ctx).Errorf("Unable to fetch *pipelinerunmetrics.Recorder from context.")
return nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil-in-interface footgun in WithClient (line 35-41, not in diff). When NewRecorder returns (nil, err), rec is nil. But context.WithValue stores a non-nil interface{} wrapping a nil *Recorder.

Here in Get(), untyped == nil is false because it is an interface holding a nil pointer -- so it falls through to return untyped.(*Recorder) which returns (*Recorder)(nil). This only works because RecordCountMetrics and RecordErrorMetric have nil-receiver guards. One missed guard in a future method = production panic.

Fix WithClient: don't store the recorder on error, just return ctx unchanged and let Get() handle the missing key via the nil check above.

Same issue in pkg/taskrunmetrics/injection.go.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed -- WithClient now returns ctx unchanged on error instead of storing the broken recorder. Both pipelinerunmetrics and taskrunmetrics fixed.

Comment thread pkg/chains/signing.go
if err != nil {
logger.Error(err)
o.recordError(ctx, signableType.Type(), metrics.PayloadCreationError)
o.recordError(ctx, signableType, metrics.PayloadCreationError)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix -- the old kind == TaskRunArtifact comparison was dead code because TaskRunArtifact.Type() returns tekton, not TaskRunArtifact. The type switch is correct.

Heads up: signing_failures_total has been always zero in production because of the old bug. After this merges, operators will see non-zero error counts for the first time. If anyone has alert-on-any-non-zero rules, they will get paged. Worth mentioning this behavioral change in the release notes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed -- release notes now explicitly state that signing_failures_total counters were previously always zero due to a bug and now report correct non-zero values. Good.

Copy link
Copy Markdown
Member

@waveywaves waveywaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor suggestion -- the rest looks good to ship.

Comment thread test/metrics_test.go

assertMetricsPresent(ctx, t, []string{
"watcher_taskrun_sign_created_total",
})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test already creates a TaskRun and waits for signing to complete. The success path in ObjectSigner.Sign() fires 3 counters (sign_created, payload_stored, marked_signed), but only 1 is checked here. Adding the other two is zero-effort extra coverage:

Suggested change
})
assertMetricsPresent(ctx, t, []string{
"watcher_taskrun_sign_created_total",
"watcher_taskrun_payload_stored_total",
"watcher_taskrun_marked_signed_total",
})

Not blocking -- can be a follow-up.

Copy link
Copy Markdown
Contributor Author

@anithapriyanatarajan anithapriyanatarajan Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waveywaves - Thank you for the detailed review. Had difficulty is getting all 3 metrics in e2e tests. This created flakiness of e2e tests in the runner setup. So reduced the complexity of test to just check the sign created metric alone. To mitigate this shortcoming, I have included the hack script that helps evaluate the metrics in a local kind cluster anytime. Hope this helps.

Copy link
Copy Markdown
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🎉

Comment thread pkg/pipelinerunmetrics/metrics.go Outdated

// NewRecorder creates a new metrics recorder instance
// to log the PipelineRun related metrics.
// to log PipelineRun related metrics.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Double space — "to log PipelineRun" → "to log PipelineRun"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. It is resolved 😄

@tekton-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2026
Signed-off-by: Anitha Natarajan <anataraj@redhat.com>
@vdemeester
Copy link
Copy Markdown
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2026
@tekton-robot tekton-robot merged commit 53834cb into tektoncd:main Mar 27, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants