Instances can be configured to emit telemetry to aid in performance testing or troubleshooting performance-related issues.
Setting ServiceControl/PrintMetrics to true will print metrics to the logs at INFO level.
Set ServiceControl.Audit/OtlpEndpointUrl to a valid OTLP endpoint url. Only GRPC endpoints are supported at this stage.
It's recommended to use a local OTEL Collector to collect, batch and export the metrics to the relevant observability backend being used.
Example configuration: https://github.com/andreasohlund/Docker/tree/main/otel-monitoring
The following ingestion metrics with their corresponding dimensions are available:
sc.audit.ingestion.batch_duration_seconds- Message batch processing duration in secondsresult- Indicates if the full batch size was used (batch size == max concurrency of the transport):full,partialorfailed
sc.audit.ingestion.message_duration_seconds- Audit message processing duration in secondsmessage.category- Indicates the category of the message ingested:audit-message,saga-updateorcontrol-messageresult- Indicates the outcome of the operation:success,failedorskipped(if the message was filtered out and skipped)
sc.audit.ingestion.failures_total- Failure countermessage.category- Indicates the category of the message ingested:audit-message,saga-updateorcontrol-messageresult- Indicates how the failure was resolved:retryorstored-poision
sc.audit.ingestion.consecutive_batch_failure_total- Consecutive batch failures
Example queries in PromQL for use in Grafana:
- Ingestion rate:
sum (rate(sc_audit_ingestion_message_duration_seconds_count[$__rate_interval])) by (exported_job) - Failure rate:
sum(rate(sc_audit_ingestion_failures_total[$__rate_interval])) by (exported_job,result) - Message duration:
histogram_quantile(0.9,sum(rate(sc_audit_ingestion_message_duration_seconds_bucket[$__rate_interval])) by (le,exported_job))
No telemetry is currently available.