feat: add opentelemetry observability by ccf · Pull Request #168 · ccf/primer

ccf · 2026-04-04T03:10:38Z

Summary

add an optional OpenTelemetry observability layer with request spans, log correlation, and OTLP/console export support
instrument analytics cache requests, background job execution, and the heaviest composite analytics services
truth up the roadmap to mark OpenTelemetry integration shipped

Testing

PYTHONPATH=/Users/ccf/git/primer/src:/Users/ccf/git/primer pytest --import-mode=importlib tests/test_observability_service.py tests/test_compare.py tests/test_project_workspace.py tests/test_engineer_profile.py tests/test_background_job_service.py -q
ruff check src/primer/common/config.py src/primer/server/app.py src/primer/server/services/observability_service.py src/primer/server/services/analytics_cache_service.py src/primer/server/services/background_job_service.py src/primer/server/services/compare_service.py src/primer/server/services/project_workspace_service.py src/primer/server/services/engineer_profile_service.py tests/test_observability_service.py tests/test_background_job_service.py tests/test_compare.py tests/test_project_workspace.py tests/test_engineer_profile.py
full pre-push backend/frontend hooks

Note

Medium Risk
Touches request handling middleware and background job execution paths to emit OpenTelemetry metrics/spans; misconfiguration or exporter issues could add overhead or noisy telemetry but is gated behind otel_enabled.

Overview
Adds an optional OpenTelemetry observability layer (new observability_service) with request-level spans, duration histograms/counters, and log correlation via injected otel_trace_id/otel_span_id, supporting OTLP HTTP export or console export.

Instruments key backend hotspots: analytics cache now records hit/miss/error/decode/write counters, background job cycles and per-job executions are wrapped in spans and emit result/duration metrics, and heavy analytics aggregations (compare, engineer_profile, project_workspace) are wrapped in top-level spans. Also updates config (otel_* settings), adds OpenTelemetry dependencies, and marks the roadmap item as shipped.

^{Reviewed by Cursor Bugbot for commit 4a87c0e. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

✅ Fixed: Cache hit counter incremented before JSON parse succeeds
- Moved the hit counter to after json.loads succeeds, so decode errors are no longer double-counted as both hits and decode_errors.
✅ Fixed: Background job counter double-counts processed total
- Removed the redundant result=processed counter since processed = succeeded + failed, so summing across all result attribute values now yields the correct total.
✅ Fixed: Error path uses different attribute key than success path
- Changed the error path to use http.route (derived from request.scope route with fallback) instead of http.target, matching the success path attribute key.

Or push these changes by commenting:

@cursor push 04c7e26e03

Preview (04c7e26e03)

diff --git a/src/primer/server/app.py b/src/primer/server/app.py
--- a/src/primer/server/app.py
+++ b/src/primer/server/app.py
@@ -102,11 +102,6 @@
                 if result["processed"] > 0:
                     record_counter(
                         "primer.background_jobs.processed",
-                        result["processed"],
-                        {"result": "processed"},
-                    )
-                    record_counter(
-                        "primer.background_jobs.processed",
                         result["succeeded"],
                         {"result": "succeeded"},
                     )

diff --git a/src/primer/server/services/analytics_cache_service.py b/src/primer/server/services/analytics_cache_service.py
--- a/src/primer/server/services/analytics_cache_service.py
+++ b/src/primer/server/services/analytics_cache_service.py
@@ -98,12 +98,7 @@
         )
         return None
     try:
-        record_counter(
-            "primer.analytics_cache.requests",
-            1,
-            {"namespace": namespace, "result": "hit"},
-        )
-        return json.loads(payload)
+        parsed = json.loads(payload)
     except json.JSONDecodeError:
         record_counter(
             "primer.analytics_cache.requests",
@@ -111,6 +106,12 @@
             {"namespace": namespace, "result": "decode_error"},
         )
         return None
+    record_counter(
+        "primer.analytics_cache.requests",
+        1,
+        {"namespace": namespace, "result": "hit"},
+    )
+    return parsed
 
 
 def set_cached_json(

diff --git a/src/primer/server/services/observability_service.py b/src/primer/server/services/observability_service.py
--- a/src/primer/server/services/observability_service.py
+++ b/src/primer/server/services/observability_service.py
@@ -122,12 +122,14 @@
             except Exception as exc:
                 if span is not None:
                     span.set_attribute("error.type", exc.__class__.__name__)
+                route = request.scope.get("route")
+                route_path = getattr(route, "path", request.url.path)
                 record_counter(
                     "primer.http.requests",
                     1,
                     {
                         "http.method": request.method,
-                        "http.target": request.url.path,
+                        "http.route": route_path,
                         "http.status_code": "500",
                     },
                 )

_{You can send follow-ups to the cloud agent here.}

src/primer/server/services/analytics_cache_service.py

src/primer/server/app.py

src/primer/server/services/observability_service.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Route path always resolves to raw URL, not template
- Moved request.scope.get('route') after call_next() returns so the route template is available from the scope (populated during routing inside call_next), preventing unbounded cardinality on http.route metric labels.

Or push these changes by commenting:

@cursor push 2ab3c5eb7f

Preview (2ab3c5eb7f)

diff --git a/src/primer/server/services/observability_service.py b/src/primer/server/services/observability_service.py
--- a/src/primer/server/services/observability_service.py
+++ b/src/primer/server/services/observability_service.py
@@ -109,20 +109,20 @@
 
     @app.middleware("http")
     async def otel_request_middleware(request, call_next):
-        route = request.scope.get("route")
-        route_path = getattr(route, "path", request.url.path)
         with start_span(
             "http.request",
             {
                 "http.method": request.method,
-                "http.route": route_path,
             },
         ) as span:
             started = perf_counter()
             try:
                 response = await call_next(request)
             except Exception as exc:
+                route = request.scope.get("route")
+                route_path = getattr(route, "path", request.url.path)
                 if span is not None:
+                    span.set_attribute("http.route", route_path)
                     span.set_attribute("error.type", exc.__class__.__name__)
                 record_counter(
                     "primer.http.requests",
@@ -134,6 +134,8 @@
                     },
                 )
                 raise
+            route = request.scope.get("route")
+            route_path = getattr(route, "path", request.url.path)
             duration_ms = (perf_counter() - started) * 1000
             attributes = {
                 "http.method": request.method,

_{You can send follow-ups to the cloud agent here.}

src/primer/server/services/observability_service.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Wall clock used for duration instead of monotonic timer
- Replaced _utcnow_naive() wall clock with time.perf_counter() monotonic timer for job execution duration measurement, consistent with the rest of the codebase.

Or push these changes by commenting:

@cursor push ffc711f342

Preview (ffc711f342)

diff --git a/src/primer/server/services/background_job_service.py b/src/primer/server/services/background_job_service.py
--- a/src/primer/server/services/background_job_service.py
+++ b/src/primer/server/services/background_job_service.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import time
 from datetime import UTC, datetime, timedelta
 from typing import TYPE_CHECKING, Any
 
@@ -172,7 +173,7 @@
 
         processed += 1
         job_id, job_type, payload, attempts, max_attempts = claimed
-        job_started = _utcnow_naive()
+        job_started = time.perf_counter()
         with start_span(
             "background_job.execute",
             {
@@ -209,7 +210,7 @@
             finally:
                 record_histogram(
                     "primer.background_jobs.execution.duration_ms",
-                    (_utcnow_naive() - job_started).total_seconds() * 1000,
+                    (time.perf_counter() - job_started) * 1000,
                     {"job_type": job_type},
                 )

_{You can send follow-ups to the cloud agent here.}

src/primer/server/services/background_job_service.py

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: HTTP error path missing duration histogram recording
- Added duration_ms calculation and record_histogram call to the exception handler so error requests now contribute to latency metrics alongside the counter.
✅ Fixed: Noisy "disabled" counters fire on every cache call
- Removed the per-request 'disabled' counter emissions from both get_cached_json and set_cached_json when Redis client is unavailable, since this is a static configuration state not an actionable per-request signal.

Or push these changes by commenting:

@cursor push da972ef0a7

Preview (da972ef0a7)

diff --git a/src/primer/server/services/analytics_cache_service.py b/src/primer/server/services/analytics_cache_service.py
--- a/src/primer/server/services/analytics_cache_service.py
+++ b/src/primer/server/services/analytics_cache_service.py
@@ -74,11 +74,6 @@
 def get_cached_json(namespace: str, params: dict[str, Any]) -> Any | None:
     client = _get_redis_client()
     if client is None:
-        record_counter(
-            "primer.analytics_cache.requests",
-            1,
-            {"namespace": namespace, "result": "disabled"},
-        )
         return None
     try:
         payload = client.get(_build_cache_key(namespace, params))
@@ -123,11 +118,6 @@
 ) -> None:
     client = _get_redis_client()
     if client is None:
-        record_counter(
-            "primer.analytics_cache.writes",
-            1,
-            {"namespace": namespace, "result": "disabled"},
-        )
         return
     try:
         client.setex(

diff --git a/src/primer/server/services/observability_service.py b/src/primer/server/services/observability_service.py
--- a/src/primer/server/services/observability_service.py
+++ b/src/primer/server/services/observability_service.py
@@ -123,19 +123,19 @@
             try:
                 response = await call_next(request)
             except Exception as exc:
+                duration_ms = (perf_counter() - started) * 1000
                 route_path = _route_path(request)
+                attributes = {
+                    "http.method": request.method,
+                    "http.route": route_path,
+                    "http.status_code": "500",
+                }
                 if span is not None:
                     span.set_attribute("error.type", exc.__class__.__name__)
-                    span.set_attribute("http.route", route_path)
-                record_counter(
-                    "primer.http.requests",
-                    1,
-                    {
-                        "http.method": request.method,
-                        "http.route": route_path,
-                        "http.status_code": "500",
-                    },
-                )
+                    for key, value in attributes.items():
+                        span.set_attribute(key, value)
+                record_counter("primer.http.requests", 1, attributes)
+                record_histogram("primer.http.request.duration_ms", duration_ms, attributes)
                 raise
             duration_ms = (perf_counter() - started) * 1000
             route_path = _route_path(request)

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 487f855. Configure here.}

src/primer/server/services/observability_service.py

src/primer/server/services/analytics_cache_service.py

feat: add opentelemetry observability

8ff49c4

cursor bot reviewed Apr 4, 2026

View reviewed changes

src/primer/server/services/analytics_cache_service.py Show resolved Hide resolved

src/primer/server/app.py Show resolved Hide resolved

src/primer/server/services/observability_service.py Outdated Show resolved Hide resolved

fix: tighten observability metrics

8e3d218

cursor bot reviewed Apr 4, 2026

View reviewed changes

src/primer/server/services/observability_service.py Outdated Show resolved Hide resolved

fix: tighten telemetry route and cache metrics

92a63e9

cursor bot reviewed Apr 4, 2026

View reviewed changes

src/primer/server/services/background_job_service.py Outdated Show resolved Hide resolved

fix: use monotonic job timing

487f855

cursor bot reviewed Apr 4, 2026

View reviewed changes

src/primer/server/services/observability_service.py Show resolved Hide resolved

src/primer/server/services/analytics_cache_service.py Outdated Show resolved Hide resolved

ccf added 2 commits April 4, 2026 10:46

Fix OTel error-path request duration metrics

2b7ae19

Reduce noisy disabled analytics cache metrics

4a87c0e

ccf merged commit b6a0f75 into main Apr 4, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add opentelemetry observability#168

feat: add opentelemetry observability#168
ccf merged 6 commits intomainfrom
feat/opentelemetry-observability

ccf commented Apr 4, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cursor bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cursor bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ccf commented Apr 4, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ccf commented Apr 4, 2026 •

edited by cursor bot

Loading

cursor bot left a comment •

edited

Loading

cursor bot left a comment •

edited

Loading

cursor bot left a comment •

edited

Loading

cursor bot left a comment •

edited

Loading