feat(telemetry): export per-status serve_shape request counter#4500
feat(telemetry): export per-status serve_shape request counter#4500erik-the-implementer wants to merge 5 commits into
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
❌ 107 Tests Failed:
View the top 3 failed test(s) by shortest run time
View the full list of 16 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
Claude Code ReviewSummaryAdds an unsampled, per-status What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None. Suggestions (Nice to Have)
Issue ConformanceNo linked issue — a minor warning per project convention; consider linking the dashboard/alerting work this metric enables. The PR description is thorough and now accurately matches the merged code. The mid-stream re-raise gap ( Previous Review Status
Nothing actionable remains. The PR is ready to merge. Review iteration: 3 | 2026-06-04 |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
Adds a new unsampled, per-request metric from the shape-serving path:
It hangs off the existing
[:electric, :plug, :serve_shape]telemetry event (no new emission point) and threads aknown_errorflag into that event's metadata, read straight off theelectric-internal-known-errorresponse header.Why
We sample shape-request spans aggressively (head sampling + tail overrides) to stay within our tracing event budget. That makes the span dataset great for drill-down but unreliable as a health signal:
known_errorand are intentionally excluded from span export, so a request plot built from spans can look perfectly healthy while the system is actually shedding load under overload.Admission rejection counts already exist as a metric (
electric.admission_control.reject.count), but there was no general, status-dimensioned request counter. The existingserve_shapemetrics also (a) drop live/long-poll requests and (b) aren't tagged by response status, so they can't express request mix or error rate.This counter fills that gap:
serve_shape.*metrics, whichkeep: live != true. Cheap to do as an aggregated metric (the reason we couldn't do it for spans doesn't apply).status=503, known_error=true.check_admissionhalts the conn, but a halt isn't a raise, so the halted conn still flows throughemit_shape_telemetry/1and gets counted.known_errormatches the wire signal — it's derived from theelectric-internal-known-errorresponse header, the same byte downstream consumers (e.g. the edge worker's tracing) key on, so the classification is consistent end to end.Intended use: this becomes the authoritative "requests by status / error rate" dashboard panel and alert source, leaving the sampled span dataset for exemplar drill-down.
Changes
electric-telemetry: definecounter("electric.plug.serve_shape.requests.count", tags: [:status, :known_error, :live])against the existing event (explicitevent_name/measurementto avoid colliding with the existingserve_shape.count).sync-service: addknown_errorto theserve_shapeevent metadata, derived from theelectric-internal-known-errorresponse header. The header check lives next to the code that sets it so the expected values stay single-sourced.Cardinality
Bounded:
status(~6 codes) ×known_error(2) ×live(2), per stack.Notes
Plug.Conn.chunk/2raising after the response is committed) doesn't emit theserve_shapeevent, so those requests aren't counted here — same limitation that already affects everyserve_shape.*metric.