Skip to content

feat: add opentelemetry observability#168

Merged
ccf merged 6 commits intomainfrom
feat/opentelemetry-observability
Apr 4, 2026
Merged

feat: add opentelemetry observability#168
ccf merged 6 commits intomainfrom
feat/opentelemetry-observability

Conversation

@ccf
Copy link
Copy Markdown
Owner

@ccf ccf commented Apr 4, 2026

Summary

  • add an optional OpenTelemetry observability layer with request spans, log correlation, and OTLP/console export support
  • instrument analytics cache requests, background job execution, and the heaviest composite analytics services
  • truth up the roadmap to mark OpenTelemetry integration shipped

Testing

  • PYTHONPATH=/Users/ccf/git/primer/src:/Users/ccf/git/primer pytest --import-mode=importlib tests/test_observability_service.py tests/test_compare.py tests/test_project_workspace.py tests/test_engineer_profile.py tests/test_background_job_service.py -q
  • ruff check src/primer/common/config.py src/primer/server/app.py src/primer/server/services/observability_service.py src/primer/server/services/analytics_cache_service.py src/primer/server/services/background_job_service.py src/primer/server/services/compare_service.py src/primer/server/services/project_workspace_service.py src/primer/server/services/engineer_profile_service.py tests/test_observability_service.py tests/test_background_job_service.py tests/test_compare.py tests/test_project_workspace.py tests/test_engineer_profile.py
  • full pre-push backend/frontend hooks

Note

Medium Risk
Touches request handling middleware and background job execution paths to emit OpenTelemetry metrics/spans; misconfiguration or exporter issues could add overhead or noisy telemetry but is gated behind otel_enabled.

Overview
Adds an optional OpenTelemetry observability layer (new observability_service) with request-level spans, duration histograms/counters, and log correlation via injected otel_trace_id/otel_span_id, supporting OTLP HTTP export or console export.

Instruments key backend hotspots: analytics cache now records hit/miss/error/decode/write counters, background job cycles and per-job executions are wrapped in spans and emit result/duration metrics, and heavy analytics aggregations (compare, engineer_profile, project_workspace) are wrapped in top-level spans. Also updates config (otel_* settings), adds OpenTelemetry dependencies, and marks the roadmap item as shipped.

Reviewed by Cursor Bugbot for commit 4a87c0e. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

  • ✅ Fixed: Cache hit counter incremented before JSON parse succeeds
    • Moved the hit counter to after json.loads succeeds, so decode errors are no longer double-counted as both hits and decode_errors.
  • ✅ Fixed: Background job counter double-counts processed total
    • Removed the redundant result=processed counter since processed = succeeded + failed, so summing across all result attribute values now yields the correct total.
  • ✅ Fixed: Error path uses different attribute key than success path
    • Changed the error path to use http.route (derived from request.scope route with fallback) instead of http.target, matching the success path attribute key.

Create PR

Or push these changes by commenting:

@cursor push 04c7e26e03
Preview (04c7e26e03)
diff --git a/src/primer/server/app.py b/src/primer/server/app.py
--- a/src/primer/server/app.py
+++ b/src/primer/server/app.py
@@ -102,11 +102,6 @@
                 if result["processed"] > 0:
                     record_counter(
                         "primer.background_jobs.processed",
-                        result["processed"],
-                        {"result": "processed"},
-                    )
-                    record_counter(
-                        "primer.background_jobs.processed",
                         result["succeeded"],
                         {"result": "succeeded"},
                     )

diff --git a/src/primer/server/services/analytics_cache_service.py b/src/primer/server/services/analytics_cache_service.py
--- a/src/primer/server/services/analytics_cache_service.py
+++ b/src/primer/server/services/analytics_cache_service.py
@@ -98,12 +98,7 @@
         )
         return None
     try:
-        record_counter(
-            "primer.analytics_cache.requests",
-            1,
-            {"namespace": namespace, "result": "hit"},
-        )
-        return json.loads(payload)
+        parsed = json.loads(payload)
     except json.JSONDecodeError:
         record_counter(
             "primer.analytics_cache.requests",
@@ -111,6 +106,12 @@
             {"namespace": namespace, "result": "decode_error"},
         )
         return None
+    record_counter(
+        "primer.analytics_cache.requests",
+        1,
+        {"namespace": namespace, "result": "hit"},
+    )
+    return parsed
 
 
 def set_cached_json(

diff --git a/src/primer/server/services/observability_service.py b/src/primer/server/services/observability_service.py
--- a/src/primer/server/services/observability_service.py
+++ b/src/primer/server/services/observability_service.py
@@ -122,12 +122,14 @@
             except Exception as exc:
                 if span is not None:
                     span.set_attribute("error.type", exc.__class__.__name__)
+                route = request.scope.get("route")
+                route_path = getattr(route, "path", request.url.path)
                 record_counter(
                     "primer.http.requests",
                     1,
                     {
                         "http.method": request.method,
-                        "http.target": request.url.path,
+                        "http.route": route_path,
                         "http.status_code": "500",
                     },
                 )

You can send follow-ups to the cloud agent here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Route path always resolves to raw URL, not template
    • Moved request.scope.get('route') after call_next() returns so the route template is available from the scope (populated during routing inside call_next), preventing unbounded cardinality on http.route metric labels.

Create PR

Or push these changes by commenting:

@cursor push 2ab3c5eb7f
Preview (2ab3c5eb7f)
diff --git a/src/primer/server/services/observability_service.py b/src/primer/server/services/observability_service.py
--- a/src/primer/server/services/observability_service.py
+++ b/src/primer/server/services/observability_service.py
@@ -109,20 +109,20 @@
 
     @app.middleware("http")
     async def otel_request_middleware(request, call_next):
-        route = request.scope.get("route")
-        route_path = getattr(route, "path", request.url.path)
         with start_span(
             "http.request",
             {
                 "http.method": request.method,
-                "http.route": route_path,
             },
         ) as span:
             started = perf_counter()
             try:
                 response = await call_next(request)
             except Exception as exc:
+                route = request.scope.get("route")
+                route_path = getattr(route, "path", request.url.path)
                 if span is not None:
+                    span.set_attribute("http.route", route_path)
                     span.set_attribute("error.type", exc.__class__.__name__)
                 record_counter(
                     "primer.http.requests",
@@ -134,6 +134,8 @@
                     },
                 )
                 raise
+            route = request.scope.get("route")
+            route_path = getattr(route, "path", request.url.path)
             duration_ms = (perf_counter() - started) * 1000
             attributes = {
                 "http.method": request.method,

You can send follow-ups to the cloud agent here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Wall clock used for duration instead of monotonic timer
    • Replaced _utcnow_naive() wall clock with time.perf_counter() monotonic timer for job execution duration measurement, consistent with the rest of the codebase.

Create PR

Or push these changes by commenting:

@cursor push ffc711f342
Preview (ffc711f342)
diff --git a/src/primer/server/services/background_job_service.py b/src/primer/server/services/background_job_service.py
--- a/src/primer/server/services/background_job_service.py
+++ b/src/primer/server/services/background_job_service.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import time
 from datetime import UTC, datetime, timedelta
 from typing import TYPE_CHECKING, Any
 
@@ -172,7 +173,7 @@
 
         processed += 1
         job_id, job_type, payload, attempts, max_attempts = claimed
-        job_started = _utcnow_naive()
+        job_started = time.perf_counter()
         with start_span(
             "background_job.execute",
             {
@@ -209,7 +210,7 @@
             finally:
                 record_histogram(
                     "primer.background_jobs.execution.duration_ms",
-                    (_utcnow_naive() - job_started).total_seconds() * 1000,
+                    (time.perf_counter() - job_started) * 1000,
                     {"job_type": job_type},
                 )

You can send follow-ups to the cloud agent here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: HTTP error path missing duration histogram recording
    • Added duration_ms calculation and record_histogram call to the exception handler so error requests now contribute to latency metrics alongside the counter.
  • ✅ Fixed: Noisy "disabled" counters fire on every cache call
    • Removed the per-request 'disabled' counter emissions from both get_cached_json and set_cached_json when Redis client is unavailable, since this is a static configuration state not an actionable per-request signal.

Create PR

Or push these changes by commenting:

@cursor push da972ef0a7
Preview (da972ef0a7)
diff --git a/src/primer/server/services/analytics_cache_service.py b/src/primer/server/services/analytics_cache_service.py
--- a/src/primer/server/services/analytics_cache_service.py
+++ b/src/primer/server/services/analytics_cache_service.py
@@ -74,11 +74,6 @@
 def get_cached_json(namespace: str, params: dict[str, Any]) -> Any | None:
     client = _get_redis_client()
     if client is None:
-        record_counter(
-            "primer.analytics_cache.requests",
-            1,
-            {"namespace": namespace, "result": "disabled"},
-        )
         return None
     try:
         payload = client.get(_build_cache_key(namespace, params))
@@ -123,11 +118,6 @@
 ) -> None:
     client = _get_redis_client()
     if client is None:
-        record_counter(
-            "primer.analytics_cache.writes",
-            1,
-            {"namespace": namespace, "result": "disabled"},
-        )
         return
     try:
         client.setex(

diff --git a/src/primer/server/services/observability_service.py b/src/primer/server/services/observability_service.py
--- a/src/primer/server/services/observability_service.py
+++ b/src/primer/server/services/observability_service.py
@@ -123,19 +123,19 @@
             try:
                 response = await call_next(request)
             except Exception as exc:
+                duration_ms = (perf_counter() - started) * 1000
                 route_path = _route_path(request)
+                attributes = {
+                    "http.method": request.method,
+                    "http.route": route_path,
+                    "http.status_code": "500",
+                }
                 if span is not None:
                     span.set_attribute("error.type", exc.__class__.__name__)
-                    span.set_attribute("http.route", route_path)
-                record_counter(
-                    "primer.http.requests",
-                    1,
-                    {
-                        "http.method": request.method,
-                        "http.route": route_path,
-                        "http.status_code": "500",
-                    },
-                )
+                    for key, value in attributes.items():
+                        span.set_attribute(key, value)
+                record_counter("primer.http.requests", 1, attributes)
+                record_histogram("primer.http.request.duration_ms", duration_ms, attributes)
                 raise
             duration_ms = (perf_counter() - started) * 1000
             route_path = _route_path(request)

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 487f855. Configure here.

@ccf ccf merged commit b6a0f75 into main Apr 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant