Update histogram best practices and metric types documentation for native histograms#2868
Update histogram best practices and metric types documentation for native histograms#2868
Conversation
8ab93fc to
89b7d47
Compare
84ebcab to
85fcce3
Compare
With this update, the best practices document about histogram and summaries finally takes native histograms into account. Signed-off-by: beorn7 <beorn@grafana.com>
Note that I used this opportunity to replace the term "client library" with "instrumentation library". I always thought that "client library" is confusing as it is not implementing a client in any way. (Technically, it implements a _server_, of which the Prometheus "server" is the client… 🤯) Even if we accept that "Prometheus client library" just means "a library to do something that has to do with Prometheus", the title "client library" still doesn't tell us what the library is actually for. (Note that the client_golang repository not only contains an instrumentation library, but also includes an _actual_ client library that helps you to implement clients that talk to the Prometheus HTTP API.) Signed-off-by: beorn7 <beorn@grafana.com>
85fcce3 to
6ec3723
Compare
| histograms (currently this is the case for Go and Java), you should probably | ||
| prefer native histograms over classic histograms. |
There was a problem hiding this comment.
I think instead of saying "probably" we should point to the spec where it talks about pros and cons , although I don't seem to find that bit :( What do you think about adding something like https://grafana.com/docs/mimir/latest/send/native-histograms/_exponential_buckets/#advantages-and-disadvantages in the spec ?
I can take this as a follow up if you'd like?
| exception of native histograms, these are currently only differentiated in the | ||
| instrumentation libraries (to enable APIs tailored to the usage of the specific | ||
| types) and in the exposition protocols. The Prometheus server does not yet make |
There was a problem hiding this comment.
I feel that the parenthesis makes this sentence too long and a bit hard to understand, let's simplify, something like:
| exception of native histograms, these are currently only differentiated in the | |
| instrumentation libraries (to enable APIs tailored to the usage of the specific | |
| types) and in the exposition protocols. The Prometheus server does not yet make | |
| exception of native histograms, these are currently only differentiated in the | |
| API of instrumentation libraries and in the exposition protocols. | |
| The Prometheus server does not yet make |
| @@ -51,37 +57,78 @@ Client library usage documentation for gauges: | |||
|
|
|||
| A _histogram_ samples observations (usually things like request durations or | |||
There was a problem hiding this comment.
Hmm samples observations sounds like we don't take all observations into account - too similar to sampling traces. So maybe say "measures".
| `<basename>_count` (identical to `<basename>_bucket{le="+Inf"}` above) | ||
|
|
||
| Native histograms are generally much more efficient than classic histograms, | ||
| allow much higher resolution, and do not require explicit configuration of |
There was a problem hiding this comment.
Maybe mention atomic transfer over network? Here and for NHCBs as well?
| boundary provided as a label. With native histograms, use the | ||
| [`histogram_fraction()` | ||
| function](/docs/prometheus/latest/querying/functions/#histogram_fraction) to | ||
| calculate fractions of observations within given boundaries. |
There was a problem hiding this comment.
Mention trim as a new (experimental ?) way of doing the same.
| To calculate the average request duration during the last 5 minutes | ||
| from a histogram or summary called `http_request_duration_seconds`, | ||
| use the following expression: | ||
| Histograms and summaries both sample observations, typically request durations |
There was a problem hiding this comment.
I'd prefer measure over sample
|
|
||
| histogram_sum(rate(http_request_duration_seconds[5m])) | ||
| / | ||
| histogram_count(rate(http_request_duration_seconds[5m])) |
There was a problem hiding this comment.
Mention the shorthand for this use case histogram_avg.
| | Required configuration during instrumentation | Pick a desired resolution and maybe a strategy to limit the bucket count. | Pick buckets suitable for the expected range of observed values and the desired queries. | Pick desired φ-quantiles and sliding window. Other φ-quantiles and sliding windows cannot be calculated later. | ||
| | Instrumentation cost | Observations are cheap as they only need to increment counters. | Observations are cheap as they only need to increment counters. | Observations are relatively expensive due to the streaming quantile calculation. | ||
| | Query performance | The server has to calculate quantiles from complex histogram samples. You can use [recording rules](/docs/prometheus/latest/configuration/recording_rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | The server has to calculate quantiles from a large number of bucket series. You can use [recording rules](/docs/prometheus/latest/configuration/recording_rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | Fast (no quantile calculations on the server, and aggregations are impossible anyway, see below). | ||
| | Number of time series per histogram/summary | One (with a composite sample type). | `_sum`, `_count`, one per configured bucket. | `_sum`, `_count`, one per configured quantile. |
There was a problem hiding this comment.
| | Number of time series per histogram/summary | One (with a composite sample type). | `_sum`, `_count`, one per configured bucket. | `_sum`, `_count`, one per configured quantile. | |
| | Number of time series per histogram/summary | One (with a composite sample type). | `_sum`, `_count`, and one for each configured bucket. | `_sum`, `_count`, and one for each configured quantile. |
Fixes #2803.
With this update, the best practices page about histograms and summaries and the concepts page about metric types finally takes native histograms into account.
While working on this, it occured to me that the best practices page about histograms and summaries would probably benefit from a more fundamental rewrite, based on the user experience of the last decade. Some of the focal points of the document seem outdated from today's perspective, while other topics might be missing. (I find the focus on Apdex score a bit weird by now, and also the detailed error analysis might not appeal to a broad audience…) However, a complete rewrite would have taken a lot of time, and I did not want to let our users wait for even longer. So I went for this incremental update – which shouldn't prevent anybody from a thorough rewrite in the future.
Note that I used this opportunity to replace the term "client library" with "instrumentation library". I always thought that "client library" is confusing as it is not implementing a client in any way. (Technically, it implements a server, of which the Prometheus "server" is the client… 🤯) Even if we accept that "Prometheus client library" just means "a library to do something that has to do with Prometheus", the title "client library" still doesn't tell us what the library is actually for. (Note that the client_golang repository not only contains an instrumentation library, but also includes an actual client library that helps you to implement clients that talk to the Prometheus HTTP API.)