Skip to content

Update histogram best practices and metric types documentation for native histograms#2868

Open
beorn7 wants to merge 2 commits intomainfrom
beorn7/histogram2
Open

Update histogram best practices and metric types documentation for native histograms#2868
beorn7 wants to merge 2 commits intomainfrom
beorn7/histogram2

Conversation

@beorn7
Copy link
Member

@beorn7 beorn7 commented Mar 5, 2026

Fixes #2803.

With this update, the best practices page about histograms and summaries and the concepts page about metric types finally takes native histograms into account.

While working on this, it occured to me that the best practices page about histograms and summaries would probably benefit from a more fundamental rewrite, based on the user experience of the last decade. Some of the focal points of the document seem outdated from today's perspective, while other topics might be missing. (I find the focus on Apdex score a bit weird by now, and also the detailed error analysis might not appeal to a broad audience…) However, a complete rewrite would have taken a lot of time, and I did not want to let our users wait for even longer. So I went for this incremental update – which shouldn't prevent anybody from a thorough rewrite in the future.

Note that I used this opportunity to replace the term "client library" with "instrumentation library". I always thought that "client library" is confusing as it is not implementing a client in any way. (Technically, it implements a server, of which the Prometheus "server" is the client… 🤯) Even if we accept that "Prometheus client library" just means "a library to do something that has to do with Prometheus", the title "client library" still doesn't tell us what the library is actually for. (Note that the client_golang repository not only contains an instrumentation library, but also includes an actual client library that helps you to implement clients that talk to the Prometheus HTTP API.)

@beorn7 beorn7 force-pushed the beorn7/histogram2 branch 3 times, most recently from 8ab93fc to 89b7d47 Compare March 10, 2026 18:10
@beorn7 beorn7 changed the title Revamp histogram and summaries best practices Update histogram best practices and metric types documentation for native histograms Mar 10, 2026
@beorn7 beorn7 force-pushed the beorn7/histogram2 branch 2 times, most recently from 84ebcab to 85fcce3 Compare March 10, 2026 18:24
beorn7 added 2 commits March 10, 2026 20:34
With this update, the best practices document about histogram and
summaries finally takes native histograms into account.

Signed-off-by: beorn7 <beorn@grafana.com>
Note that I used this opportunity to replace the term "client library"
with "instrumentation library". I always thought that "client library"
is confusing as it is not implementing a client in any way.
(Technically, it implements a _server_, of which the Prometheus
"server" is the client… 🤯) Even if we accept that
"Prometheus client library" just means "a library to do something that
has to do with Prometheus", the title "client library" still doesn't
tell us what the library is actually for. (Note that the client_golang
repository not only contains an instrumentation library, but also
includes an _actual_ client library that helps you to implement
clients that talk to the Prometheus HTTP API.)

Signed-off-by: beorn7 <beorn@grafana.com>
@beorn7 beorn7 force-pushed the beorn7/histogram2 branch from 85fcce3 to 6ec3723 Compare March 10, 2026 19:35
@beorn7 beorn7 marked this pull request as ready for review March 10, 2026 19:35
@beorn7 beorn7 requested a review from krajorama March 10, 2026 19:35
Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. Looking good.

Comment on lines +94 to +95
histograms (currently this is the case for Go and Java), you should probably
prefer native histograms over classic histograms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of saying "probably" we should point to the spec where it talks about pros and cons , although I don't seem to find that bit :( What do you think about adding something like https://grafana.com/docs/mimir/latest/send/native-histograms/_exponential_buckets/#advantages-and-disadvantages in the spec ?

I can take this as a follow up if you'd like?

Comment on lines +7 to +9
exception of native histograms, these are currently only differentiated in the
instrumentation libraries (to enable APIs tailored to the usage of the specific
types) and in the exposition protocols. The Prometheus server does not yet make
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that the parenthesis makes this sentence too long and a bit hard to understand, let's simplify, something like:

Suggested change
exception of native histograms, these are currently only differentiated in the
instrumentation libraries (to enable APIs tailored to the usage of the specific
types) and in the exposition protocols. The Prometheus server does not yet make
exception of native histograms, these are currently only differentiated in the
API of instrumentation libraries and in the exposition protocols.
The Prometheus server does not yet make

@@ -51,37 +57,78 @@ Client library usage documentation for gauges:

A _histogram_ samples observations (usually things like request durations or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm samples observations sounds like we don't take all observations into account - too similar to sampling traces. So maybe say "measures".

`<basename>_count` (identical to `<basename>_bucket{le="+Inf"}` above)

Native histograms are generally much more efficient than classic histograms,
allow much higher resolution, and do not require explicit configuration of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention atomic transfer over network? Here and for NHCBs as well?

boundary provided as a label. With native histograms, use the
[`histogram_fraction()`
function](/docs/prometheus/latest/querying/functions/#histogram_fraction) to
calculate fractions of observations within given boundaries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention trim as a new (experimental ?) way of doing the same.

To calculate the average request duration during the last 5 minutes
from a histogram or summary called `http_request_duration_seconds`,
use the following expression:
Histograms and summaries both sample observations, typically request durations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer measure over sample


histogram_sum(rate(http_request_duration_seconds[5m]))
/
histogram_count(rate(http_request_duration_seconds[5m]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention the shorthand for this use case histogram_avg.

| Required configuration during instrumentation | Pick a desired resolution and maybe a strategy to limit the bucket count. | Pick buckets suitable for the expected range of observed values and the desired queries. | Pick desired φ-quantiles and sliding window. Other φ-quantiles and sliding windows cannot be calculated later.
| Instrumentation cost | Observations are cheap as they only need to increment counters. | Observations are cheap as they only need to increment counters. | Observations are relatively expensive due to the streaming quantile calculation.
| Query performance | The server has to calculate quantiles from complex histogram samples. You can use [recording rules](/docs/prometheus/latest/configuration/recording_rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | The server has to calculate quantiles from a large number of bucket series. You can use [recording rules](/docs/prometheus/latest/configuration/recording_rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | Fast (no quantile calculations on the server, and aggregations are impossible anyway, see below).
| Number of time series per histogram/summary | One (with a composite sample type). | `_sum`, `_count`, one per configured bucket. | `_sum`, `_count`, one per configured quantile.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Number of time series per histogram/summary | One (with a composite sample type). | `_sum`, `_count`, one per configured bucket. | `_sum`, `_count`, one per configured quantile.
| Number of time series per histogram/summary | One (with a composite sample type). | `_sum`, `_count`, and one for each configured bucket. | `_sum`, `_count`, and one for each configured quantile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metric types doc section needs native histograms update

2 participants