Skip to content

75183 Monitoring Dashboard Design#7

Open
chihaiaalex wants to merge 1 commit intomainfrom
75183-monitoring-dashboard-design
Open

75183 Monitoring Dashboard Design#7
chihaiaalex wants to merge 1 commit intomainfrom
75183-monitoring-dashboard-design

Conversation

@chihaiaalex
Copy link
Contributor

No description provided.


**Implementation:** Record the time between HTTP request start and response received in `GtfsRtMetricsExporter`. Micrometer `Timer` or manual `DistributionSummary` on elapsed time.

### 6.2Database Polling Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in heading.


**Decision needed:** Instrument each microservice individually, or create a shared library in `transitdata-common`? Recommendation: add a metrics utility to `transitdata-common` and use it from each service. This keeps the Prometheus endpoint setup and metric definitions consistent.

### 6.3End-to-End Latency Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in heading.


**Implementation:** The Pulsar message publish timestamp or the embedded event timestamp from the protobuf message can serve as the start time. Compute `System.currentTimeMillis() - eventTimestamp` at the output processor.

### 6.4Logged Error Counts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in heading.

## 8. Open Questions

1. **Backlog thresholds:** What Pulsar backlog counts should trigger warnings? Need to observe baseline during normal operations and set thresholds at e.g., 2x-5x normal.
2. **GTFS-RT feed age thresholds:** What's an acceptable timestamp age? 60s? 120s? This varies by feed type (service alerts update less frequently than vehicle positions).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For vehicle positions and trip updates I might alert if age exceeds five or ten seconds. Need to consider monitoring polling cycle and request delay.


1. **Backlog thresholds:** What Pulsar backlog counts should trigger warnings? Need to observe baseline during normal operations and set thresholds at e.g., 2x-5x normal.
2. **GTFS-RT feed age thresholds:** What's an acceptable timestamp age? 60s? 120s? This varies by feed type (service alerts update less frequently than vehicle positions).
3. **Notification channels:** Where should alerts be sent? Slack channel? PagerDuty? Email?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Separate proposal.

2. **GTFS-RT feed age thresholds:** What's an acceptable timestamp age? 60s? 120s? This varies by feed type (service alerts update less frequently than vehicle positions).
3. **Notification channels:** Where should alerts be sent? Slack channel? PagerDuty? Email?
4. **Pulsar topic completeness:** Several topic names need to be confirmed from the cluster -- EKE raw/deduplicated topics, cancellation internal topics, and service alert topics are referenced in the architecture but not present in the current Pulsar Overview dashboard.
5. **Log-based error metrics approach:** Option A (sidecar/promtail) vs Option B (Logback counter appender) -- which is preferred? Option A is less invasive but requires sidecar containers. Option B is simpler but requires touching every microservice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite uncertain on this but if sidecars are copy-pastable and can be written into just the one deployment repo, that might affect the decision.

4. **Pulsar topic completeness:** Several topic names need to be confirmed from the cluster -- EKE raw/deduplicated topics, cancellation internal topics, and service alert topics are referenced in the architecture but not present in the current Pulsar Overview dashboard.
5. **Log-based error metrics approach:** Option A (sidecar/promtail) vs Option B (Logback counter appender) -- which is preferred? Option A is less invasive but requires sidecar containers. Option B is simpler but requires touching every microservice.
6. **Recording rules:** Should we create Prometheus recording rules for expensive queries (e.g., baseline calculations for anomaly detection)?
7. **Dashboard access:** Should dashboards be public (no login) or restricted? Who needs edit access vs. view access?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restricted at the described level. Edit for the team, view for HSL employees and developers, I would say.

Product owner also would like to see a simplified public dashboard but that can be a separate proposal and could be easy to distill from this complex setup once this is underway.

3. **Notification channels:** Where should alerts be sent? Slack channel? PagerDuty? Email?
4. **Pulsar topic completeness:** Several topic names need to be confirmed from the cluster -- EKE raw/deduplicated topics, cancellation internal topics, and service alert topics are referenced in the architecture but not present in the current Pulsar Overview dashboard.
5. **Log-based error metrics approach:** Option A (sidecar/promtail) vs Option B (Logback counter appender) -- which is preferred? Option A is less invasive but requires sidecar containers. Option B is simpler but requires touching every microservice.
6. **Recording rules:** Should we create Prometheus recording rules for expensive queries (e.g., baseline calculations for anomaly detection)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be decided when implementing and seeing how it works without.

@haphut
Copy link
Contributor

haphut commented Mar 24, 2026

Should we have one copy of this dashboard design per environment or should we have a shared view to all of our environments, for example in the golden signals? I think I would very slightly prefer to have one URL to access all environments but I'm fine with the separation, as well.

@haphut
Copy link
Contributor

haphut commented Mar 24, 2026

Either way, I would like to see the background color of the dashboards affected by the environment. For example, Colorbrewer 3-class RdYlBu, which is colorblind safe, could work so that #fc8d59 is for production, #ffffbf is for staging and #91bfdb is for development. That would help the developer to always know what data they are looking at.

Copy link
Contributor

@haphut haphut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! I do not have substantial changes to suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants