Conversation
|
|
||
| **Implementation:** Record the time between HTTP request start and response received in `GtfsRtMetricsExporter`. Micrometer `Timer` or manual `DistributionSummary` on elapsed time. | ||
|
|
||
| ### 6.2Database Polling Metrics |
|
|
||
| **Decision needed:** Instrument each microservice individually, or create a shared library in `transitdata-common`? Recommendation: add a metrics utility to `transitdata-common` and use it from each service. This keeps the Prometheus endpoint setup and metric definitions consistent. | ||
|
|
||
| ### 6.3End-to-End Latency Metrics |
|
|
||
| **Implementation:** The Pulsar message publish timestamp or the embedded event timestamp from the protobuf message can serve as the start time. Compute `System.currentTimeMillis() - eventTimestamp` at the output processor. | ||
|
|
||
| ### 6.4Logged Error Counts |
| ## 8. Open Questions | ||
|
|
||
| 1. **Backlog thresholds:** What Pulsar backlog counts should trigger warnings? Need to observe baseline during normal operations and set thresholds at e.g., 2x-5x normal. | ||
| 2. **GTFS-RT feed age thresholds:** What's an acceptable timestamp age? 60s? 120s? This varies by feed type (service alerts update less frequently than vehicle positions). |
There was a problem hiding this comment.
For vehicle positions and trip updates I might alert if age exceeds five or ten seconds. Need to consider monitoring polling cycle and request delay.
|
|
||
| 1. **Backlog thresholds:** What Pulsar backlog counts should trigger warnings? Need to observe baseline during normal operations and set thresholds at e.g., 2x-5x normal. | ||
| 2. **GTFS-RT feed age thresholds:** What's an acceptable timestamp age? 60s? 120s? This varies by feed type (service alerts update less frequently than vehicle positions). | ||
| 3. **Notification channels:** Where should alerts be sent? Slack channel? PagerDuty? Email? |
There was a problem hiding this comment.
Good point. Separate proposal.
| 2. **GTFS-RT feed age thresholds:** What's an acceptable timestamp age? 60s? 120s? This varies by feed type (service alerts update less frequently than vehicle positions). | ||
| 3. **Notification channels:** Where should alerts be sent? Slack channel? PagerDuty? Email? | ||
| 4. **Pulsar topic completeness:** Several topic names need to be confirmed from the cluster -- EKE raw/deduplicated topics, cancellation internal topics, and service alert topics are referenced in the architecture but not present in the current Pulsar Overview dashboard. | ||
| 5. **Log-based error metrics approach:** Option A (sidecar/promtail) vs Option B (Logback counter appender) -- which is preferred? Option A is less invasive but requires sidecar containers. Option B is simpler but requires touching every microservice. |
There was a problem hiding this comment.
I am quite uncertain on this but if sidecars are copy-pastable and can be written into just the one deployment repo, that might affect the decision.
| 4. **Pulsar topic completeness:** Several topic names need to be confirmed from the cluster -- EKE raw/deduplicated topics, cancellation internal topics, and service alert topics are referenced in the architecture but not present in the current Pulsar Overview dashboard. | ||
| 5. **Log-based error metrics approach:** Option A (sidecar/promtail) vs Option B (Logback counter appender) -- which is preferred? Option A is less invasive but requires sidecar containers. Option B is simpler but requires touching every microservice. | ||
| 6. **Recording rules:** Should we create Prometheus recording rules for expensive queries (e.g., baseline calculations for anomaly detection)? | ||
| 7. **Dashboard access:** Should dashboards be public (no login) or restricted? Who needs edit access vs. view access? |
There was a problem hiding this comment.
Restricted at the described level. Edit for the team, view for HSL employees and developers, I would say.
Product owner also would like to see a simplified public dashboard but that can be a separate proposal and could be easy to distill from this complex setup once this is underway.
| 3. **Notification channels:** Where should alerts be sent? Slack channel? PagerDuty? Email? | ||
| 4. **Pulsar topic completeness:** Several topic names need to be confirmed from the cluster -- EKE raw/deduplicated topics, cancellation internal topics, and service alert topics are referenced in the architecture but not present in the current Pulsar Overview dashboard. | ||
| 5. **Log-based error metrics approach:** Option A (sidecar/promtail) vs Option B (Logback counter appender) -- which is preferred? Option A is less invasive but requires sidecar containers. Option B is simpler but requires touching every microservice. | ||
| 6. **Recording rules:** Should we create Prometheus recording rules for expensive queries (e.g., baseline calculations for anomaly detection)? |
There was a problem hiding this comment.
Maybe this can be decided when implementing and seeing how it works without.
|
Should we have one copy of this dashboard design per environment or should we have a shared view to all of our environments, for example in the golden signals? I think I would very slightly prefer to have one URL to access all environments but I'm fine with the separation, as well. |
|
Either way, I would like to see the background color of the dashboards affected by the environment. For example, Colorbrewer 3-class RdYlBu, which is colorblind safe, could work so that |
haphut
left a comment
There was a problem hiding this comment.
Fantastic work! I do not have substantial changes to suggest.
No description provided.