75183 Monitoring Dashboard Design by chihaiaalex · Pull Request #7 · HSLdevcom/team-infodevops

chihaiaalex · 2026-02-25T22:17:55Z

No description provided.

haphut · 2026-03-24T09:33:40Z

docs/one-pagers/Monitoring-Dashboard-Design.md

+
+**Implementation:** Record the time between HTTP request start and response received in `GtfsRtMetricsExporter`. Micrometer `Timer` or manual `DistributionSummary` on elapsed time.
+
+### 6.2Database Polling Metrics


Missing space in heading.

haphut · 2026-03-24T09:33:46Z

docs/one-pagers/Monitoring-Dashboard-Design.md

+
+**Decision needed:** Instrument each microservice individually, or create a shared library in `transitdata-common`? Recommendation: add a metrics utility to `transitdata-common` and use it from each service. This keeps the Prometheus endpoint setup and metric definitions consistent.
+
+### 6.3End-to-End Latency Metrics


Missing space in heading.

haphut · 2026-03-24T09:34:23Z

docs/one-pagers/Monitoring-Dashboard-Design.md

+
+**Implementation:** The Pulsar message publish timestamp or the embedded event timestamp from the protobuf message can serve as the start time. Compute `System.currentTimeMillis() - eventTimestamp` at the output processor.
+
+### 6.4Logged Error Counts


Missing space in heading.

haphut · 2026-03-24T09:37:25Z

docs/one-pagers/Monitoring-Dashboard-Design.md

+## 8. Open Questions
+
+1. **Backlog thresholds:** What Pulsar backlog counts should trigger warnings? Need to observe baseline during normal operations and set thresholds at e.g., 2x-5x normal.
+2. **GTFS-RT feed age thresholds:** What's an acceptable timestamp age? 60s? 120s? This varies by feed type (service alerts update less frequently than vehicle positions).


For vehicle positions and trip updates I might alert if age exceeds five or ten seconds. Need to consider monitoring polling cycle and request delay.

haphut · 2026-03-24T09:37:51Z

docs/one-pagers/Monitoring-Dashboard-Design.md

+
+1. **Backlog thresholds:** What Pulsar backlog counts should trigger warnings? Need to observe baseline during normal operations and set thresholds at e.g., 2x-5x normal.
+2. **GTFS-RT feed age thresholds:** What's an acceptable timestamp age? 60s? 120s? This varies by feed type (service alerts update less frequently than vehicle positions).
+3. **Notification channels:** Where should alerts be sent? Slack channel? PagerDuty? Email?


Good point. Separate proposal.

haphut · 2026-03-24T09:39:39Z

docs/one-pagers/Monitoring-Dashboard-Design.md

+2. **GTFS-RT feed age thresholds:** What's an acceptable timestamp age? 60s? 120s? This varies by feed type (service alerts update less frequently than vehicle positions).
+3. **Notification channels:** Where should alerts be sent? Slack channel? PagerDuty? Email?
+4. **Pulsar topic completeness:** Several topic names need to be confirmed from the cluster -- EKE raw/deduplicated topics, cancellation internal topics, and service alert topics are referenced in the architecture but not present in the current Pulsar Overview dashboard.
+5. **Log-based error metrics approach:** Option A (sidecar/promtail) vs Option B (Logback counter appender) -- which is preferred? Option A is less invasive but requires sidecar containers. Option B is simpler but requires touching every microservice.


I am quite uncertain on this but if sidecars are copy-pastable and can be written into just the one deployment repo, that might affect the decision.

haphut · 2026-03-24T09:41:53Z

docs/one-pagers/Monitoring-Dashboard-Design.md

+4. **Pulsar topic completeness:** Several topic names need to be confirmed from the cluster -- EKE raw/deduplicated topics, cancellation internal topics, and service alert topics are referenced in the architecture but not present in the current Pulsar Overview dashboard.
+5. **Log-based error metrics approach:** Option A (sidecar/promtail) vs Option B (Logback counter appender) -- which is preferred? Option A is less invasive but requires sidecar containers. Option B is simpler but requires touching every microservice.
+6. **Recording rules:** Should we create Prometheus recording rules for expensive queries (e.g., baseline calculations for anomaly detection)?
+7. **Dashboard access:** Should dashboards be public (no login) or restricted? Who needs edit access vs. view access?


Restricted at the described level. Edit for the team, view for HSL employees and developers, I would say.

Product owner also would like to see a simplified public dashboard but that can be a separate proposal and could be easy to distill from this complex setup once this is underway.

haphut · 2026-03-24T09:42:41Z

docs/one-pagers/Monitoring-Dashboard-Design.md

+3. **Notification channels:** Where should alerts be sent? Slack channel? PagerDuty? Email?
+4. **Pulsar topic completeness:** Several topic names need to be confirmed from the cluster -- EKE raw/deduplicated topics, cancellation internal topics, and service alert topics are referenced in the architecture but not present in the current Pulsar Overview dashboard.
+5. **Log-based error metrics approach:** Option A (sidecar/promtail) vs Option B (Logback counter appender) -- which is preferred? Option A is less invasive but requires sidecar containers. Option B is simpler but requires touching every microservice.
+6. **Recording rules:** Should we create Prometheus recording rules for expensive queries (e.g., baseline calculations for anomaly detection)?


Maybe this can be decided when implementing and seeing how it works without.

haphut · 2026-03-24T09:52:24Z

Should we have one copy of this dashboard design per environment or should we have a shared view to all of our environments, for example in the golden signals? I think I would very slightly prefer to have one URL to access all environments but I'm fine with the separation, as well.

haphut · 2026-03-24T09:52:50Z

Either way, I would like to see the background color of the dashboards affected by the environment. For example, Colorbrewer 3-class RdYlBu, which is colorblind safe, could work so that #fc8d59 is for production, #ffffbf is for staging and #91bfdb is for development. That would help the developer to always know what data they are looking at.

haphut

Fantastic work! I do not have substantial changes to suggest.

75183 Monitoring Dashboard Design

ac86ea2

haphut reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

75183 Monitoring Dashboard Design#7

75183 Monitoring Dashboard Design#7
chihaiaalex wants to merge 1 commit intomainfrom
75183-monitoring-dashboard-design

chihaiaalex commented Feb 25, 2026

Uh oh!

haphut Mar 24, 2026

Uh oh!

haphut Mar 24, 2026

Uh oh!

haphut Mar 24, 2026

Uh oh!

haphut Mar 24, 2026

Uh oh!

haphut Mar 24, 2026

Uh oh!

haphut Mar 24, 2026

Uh oh!

haphut Mar 24, 2026

Uh oh!

haphut Mar 24, 2026

Uh oh!

haphut commented Mar 24, 2026

Uh oh!

haphut commented Mar 24, 2026

Uh oh!

haphut left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Implementation: Record the time between HTTP request start and response received in `GtfsRtMetricsExporter`. Micrometer `Timer` or manual `DistributionSummary` on elapsed time.

		### 6.2Database Polling Metrics


		Decision needed: Instrument each microservice individually, or create a shared library in `transitdata-common`? Recommendation: add a metrics utility to `transitdata-common` and use it from each service. This keeps the Prometheus endpoint setup and metric definitions consistent.

		### 6.3End-to-End Latency Metrics


		Implementation: The Pulsar message publish timestamp or the embedded event timestamp from the protobuf message can serve as the start time. Compute `System.currentTimeMillis() - eventTimestamp` at the output processor.

		### 6.4Logged Error Counts

Conversation

chihaiaalex commented Feb 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haphut commented Mar 24, 2026

Uh oh!

haphut commented Mar 24, 2026

Uh oh!

haphut left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants