Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
* [ENHANCEMENT] Distributor: Add `WrappedHistogram` with configurable size limit (`-validation.max-native-histogram-size-bytes`) to cap native histogram protobuf size before unmarshalling. #7570
* [ENHANCEMENT] Ingester: Add lazy regex evaluation on head postings cache miss. Defers expensive regex matchers on high-cardinality labels to per-series filtering when a selective equality matcher already narrows the result set. Configured via `-blocks-storage.expanded_postings_cache.head.lazy-matcher-max-cardinality` (disabled by default). #7553
* [ENHANCEMENT] Ring: Add ring metric to count number of duplicate tokens. #7626
* [ENHANCEMENT] Metrics: Add native histogram support to all remaining production histograms, enabling dual-format (classic + native) exposition across all Cortex components.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, can you add a PR number?

* [BUGFIX] Querier: Fix queryWithRetry and labelsWithRetry returning (nil, nil) on cancelled context by propagating ctx.Err(). #7370
* [BUGFIX] Metrics Helper: Fix non-deterministic bucket order in merged histograms by sorting buckets after map iteration, matching Prometheus client library behavior. #7380
* [BUGFIX] Distributor: Return HTTP 401 Unauthorized when tenant ID resolution fails in the Prometheus Remote Write 2.0 path. #7389
Expand Down
93 changes: 93 additions & 0 deletions docs/proposals/native-histograms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Native Histogram Support for All Production Metrics

## Problem

Cortex exposes ~56 histogram metrics across its components. Native histograms (introduced in Prometheus 2.40) offer significant advantages over classic histograms:

1. **Higher resolution** — Exponential bucket boundaries adapt to observed values, eliminating the need to pre-define bucket ranges.
2. **Lower storage cost** — Native histograms typically use fewer time series than classic histograms with many buckets.
3. **Better aggregation** — Native histograms can be merged across instances without information loss from mismatched bucket boundaries.

Previously, ~11 histograms were converted to dual-format (classic + native). The remaining ~45 production histograms were still classic-only, providing an inconsistent experience for operators who want to adopt native histograms.

## Design

### Dual-Format Histograms

All production histograms are configured as dual-format by adding three fields to each `prometheus.HistogramOpts`:

```go
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
```

This means:
- **Classic scrapes** continue to work unchanged — the histogram exposes classic buckets as before.
- **Native-aware scrapes** (using `Accept: application/vnd.google.protobuf`) receive native histogram data instead.
- No behavioral change for existing deployments until they opt in to native histogram scraping.

### Configuration Values

| Field | Value | Rationale |
|-------|-------|-----------|
| `NativeHistogramBucketFactor` | 1.1 | ~10% relative resolution per bucket, providing good precision without excessive bucket count. Matches the pattern already established in the codebase. |
| `NativeHistogramMaxBucketNumber` | 100 | Upper bound on bucket count to prevent cardinality explosion from pathological distributions. |
| `NativeHistogramMinResetDuration` | `time.Hour` | Prevents premature schema resets during transient spikes, reducing unnecessary churn. |

These values match the existing pattern used by the ~11 histograms already converted (e.g., in `pkg/distributor`, `pkg/ingester`).

### Affected Components

| Component | Histograms |
|-----------|-----------|
| Ingester | 8 (WAL replay, appender add/commit, queried samples/exemplars/series/chunks) |
| Querier | 7 (store gateway client, instances hit, refetches, blocks scan, tenant federation) |
| Cache | 4 (instrumented cache value size/duration, memcached, redis) |
| API | 3 (request duration, message sizes) |
| Distributor | 2 (query duration, labels per sample) |
| Compactor | 2 (meta sync duration, garbage collection duration) |
| Storage/TSDB | 5 (bucket index load, multilevel cache fetch/backfill) |
| Store Gateway | 1 (blocks sync) |
| Frontend | 2 (queue duration, retries) |
| Scheduler | 1 (queue duration) |
| Alertmanager | 2 (client request duration, initial sync duration) |
| Ring/KV | 4 (KV request duration, consul, dynamodb, lifecycler shutdown) |
| Ruler | 2 (client pool request duration, frontend client pool) |
| HA Tracker | 1 (change propagation time) |
| Configs | 2 (client request duration, database request duration) |
| Parquet Converter | 1 (block conversion delay) |

### Backward Compatibility

This change is fully backward compatible:

- **No metric name changes** — all existing metric names, labels, and bucket boundaries are preserved.
- **No scrape format change** — classic format is served unless the scraper explicitly requests native histograms via content negotiation.
- **No configuration required** — dual-format is enabled automatically; operators opt in to native scraping at the Prometheus/agent level.

## Migration Path

1. **Deploy updated Cortex** — histograms begin emitting dual-format data. No observable change for classic scrapers.
2. **Enable native histogram ingestion** in Prometheus (or compatible agents) by configuring `scrape_protocols` to include `PrometheusProto`.
3. **Update dashboards/alerts** — native histograms use the base metric name (without `_bucket`/`_sum`/`_count` suffixes). For example:

```promql
# Classic histogram query:
histogram_quantile(0.99, rate(cortex_ingester_tsdb_appender_commit_duration_seconds_bucket[5m]))

# Native histogram query (uses base name directly):
histogram_quantile(0.99, rate(cortex_ingester_tsdb_appender_commit_duration_seconds[5m]))
```

Native histograms also enable new PromQL functions not available with classic histograms: `histogram_avg()`, `histogram_fraction()`, `histogram_stddev()`, and `histogram_stdvar()`.

## Alternatives Considered

### Native-only (no classic buckets)

Rejected because it would break existing dashboards and alerts that rely on classic bucket boundaries. Dual-format ensures zero disruption.

### Per-histogram configuration

Rejected because uniform settings simplify operations and the chosen values (1.1 factor, 100 max buckets, 1h min reset) are broadly suitable for all Cortex histograms. Operators who need different settings can override at the scrape level.
9 changes: 6 additions & 3 deletions pkg/alertmanager/alertmanager_client.go
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,12 @@ func newAlertmanagerClientsPool(discovery client.PoolServiceDiscovery, amClientC
}

requestDuration := promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Name: "cortex_alertmanager_distributor_client_request_duration_seconds",
Help: "Time spent executing requests from an alertmanager to another alertmanager.",
Buckets: prometheus.ExponentialBuckets(0.008, 4, 7),
Name: "cortex_alertmanager_distributor_client_request_duration_seconds",
Help: "Time spent executing requests from an alertmanager to another alertmanager.",
Buckets: prometheus.ExponentialBuckets(0.008, 4, 7),
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, []string{"operation", "status_code"})

factory := func(addr string) (client.PoolClient, error) {
Expand Down
9 changes: 6 additions & 3 deletions pkg/alertmanager/state_replication.go
Original file line number Diff line number Diff line change
Expand Up @@ -111,9 +111,12 @@ func newReplicatedStates(userID string, rf int, re Replicator, st alertstore.Ale
Help: "Number of times we have completed syncing initial state for each possible outcome.",
}, []string{"outcome"}),
initialSyncDuration: promauto.With(r).NewHistogram(prometheus.HistogramOpts{
Name: "alertmanager_state_initial_sync_duration_seconds",
Help: "Time spent syncing initial state from peers or remote storage.",
Buckets: prometheus.ExponentialBuckets(0.008, 4, 7),
Name: "alertmanager_state_initial_sync_duration_seconds",
Help: "Time spent syncing initial state from peers or remote storage.",
Buckets: prometheus.ExponentialBuckets(0.008, 4, 7),
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}),
}
s.initialSyncCompleted.WithLabelValues(syncFromReplica)
Expand Down
34 changes: 22 additions & 12 deletions pkg/api/handlers.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (
"net/http"
"path"
"sync"
"time"

"github.com/go-kit/log"
"github.com/go-kit/log/level"
Expand Down Expand Up @@ -171,24 +172,33 @@ func NewQuerierHandler(
) http.Handler {
// Prometheus histograms for requests to the querier.
querierRequestDuration := promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Namespace: "cortex",
Name: "querier_request_duration_seconds",
Help: "Time (in seconds) spent serving HTTP requests to the querier.",
Buckets: instrument.DefBuckets,
Namespace: "cortex",
Name: "querier_request_duration_seconds",
Help: "Time (in seconds) spent serving HTTP requests to the querier.",
Buckets: instrument.DefBuckets,
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, []string{"method", "route", "status_code", "ws"})

receivedMessageSize := promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Namespace: "cortex",
Name: "querier_request_message_bytes",
Help: "Size (in bytes) of messages received in the request to the querier.",
Buckets: middleware.BodySizeBuckets,
Namespace: "cortex",
Name: "querier_request_message_bytes",
Help: "Size (in bytes) of messages received in the request to the querier.",
Buckets: middleware.BodySizeBuckets,
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, []string{"method", "route"})

sentMessageSize := promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Namespace: "cortex",
Name: "querier_response_message_bytes",
Help: "Size (in bytes) of messages sent in response by the querier.",
Buckets: middleware.BodySizeBuckets,
Namespace: "cortex",
Name: "querier_response_message_bytes",
Help: "Size (in bytes) of messages sent in response by the querier.",
Buckets: middleware.BodySizeBuckets,
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, []string{"method", "route"})

inflightRequests := promauto.With(reg).NewGaugeVec(prometheus.GaugeOpts{
Expand Down
14 changes: 10 additions & 4 deletions pkg/chunk/cache/instrumented.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,11 @@ func Instrument(name string, cache Cache, reg prometheus.Registerer) Cache {
// Cached chunks are generally in the KBs, but cached index can
// get big. Histogram goes from 1KB to 4MB.
// 1024 * 4^(7-1) = 4MB
Buckets: prometheus.ExponentialBuckets(1024, 4, 7),
ConstLabels: prometheus.Labels{"name": name},
Buckets: prometheus.ExponentialBuckets(1024, 4, 7),
ConstLabels: prometheus.Labels{"name": name},
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, []string{"method"})

return &instrumentedCache{
Expand All @@ -33,8 +36,11 @@ func Instrument(name string, cache Cache, reg prometheus.Registerer) Cache {
Name: "cache_request_duration_seconds",
Help: "Total time spent in seconds doing cache requests.",
// Cache requests are very quick: smallest bucket is 16us, biggest is 1s.
Buckets: prometheus.ExponentialBuckets(0.000016, 4, 8),
ConstLabels: prometheus.Labels{"name": name},
Buckets: prometheus.ExponentialBuckets(0.000016, 4, 8),
ConstLabels: prometheus.Labels{"name": name},
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, []string{"method", "status_code"})),

fetchedKeys: promauto.With(reg).NewCounter(prometheus.CounterOpts{
Expand Down
7 changes: 5 additions & 2 deletions pkg/chunk/cache/memcached.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,11 @@ func NewMemcached(cfg MemcachedConfig, client MemcachedClient, name string, reg
Name: "memcache_request_duration_seconds",
Help: "Total time spent in seconds doing memcache requests.",
// Memcached requests are very quick: smallest bucket is 16us, biggest is 1s
Buckets: prometheus.ExponentialBuckets(0.000016, 4, 8),
ConstLabels: prometheus.Labels{"name": name},
Buckets: prometheus.ExponentialBuckets(0.000016, 4, 8),
ConstLabels: prometheus.Labels{"name": name},
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, []string{"method", "status_code"}),
),
}
Expand Down
13 changes: 8 additions & 5 deletions pkg/chunk/cache/redis_cache.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,14 @@ func NewRedisCache(name string, redisClient *RedisClient, reg prometheus.Registe
logger: logger,
requestDuration: instr.NewHistogramCollector(
promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Namespace: "cortex",
Name: "rediscache_request_duration_seconds",
Help: "Total time spent in seconds doing Redis requests.",
Buckets: prometheus.ExponentialBuckets(0.000016, 4, 8),
ConstLabels: prometheus.Labels{"name": name},
Namespace: "cortex",
Name: "rediscache_request_duration_seconds",
Help: "Total time spent in seconds doing Redis requests.",
Buckets: prometheus.ExponentialBuckets(0.000016, 4, 8),
ConstLabels: prometheus.Labels{"name": name},
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, []string{"method", "status_code"}),
),
}
Expand Down
19 changes: 14 additions & 5 deletions pkg/compactor/compactor_metrics.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
package compactor

import (
"time"

"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/thanos-io/thanos/pkg/block"
Expand Down Expand Up @@ -89,9 +91,12 @@ func newCompactorMetricsWithLabels(reg prometheus.Registerer, commonLabels []str
Help: "Total blocks metadata synchronization failures.",
}, nil)
m.metaFetcherSyncDuration = promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Name: "cortex_compactor_meta_sync_duration_seconds",
Help: "Duration of the blocks metadata synchronization in seconds.",
Buckets: []float64{0.01, 1, 10, 100, 300, 600, 1000},
Name: "cortex_compactor_meta_sync_duration_seconds",
Help: "Duration of the blocks metadata synchronization in seconds.",
Buckets: []float64{0.01, 1, 10, 100, 300, 600, 1000},
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, nil)
m.metaFetcherSynced = extprom.NewTxGaugeVec(
reg,
Expand Down Expand Up @@ -126,8 +131,12 @@ func newCompactorMetricsWithLabels(reg prometheus.Registerer, commonLabels []str
Help: "Total number of failed garbage collection operations.",
}, nil)
m.syncerGarbageCollectionDuration = promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Name: "cortex_compactor_garbage_collection_duration_seconds",
Help: "Time it took to perform garbage collection iteration.",
Name: "cortex_compactor_garbage_collection_duration_seconds",
Help: "Time it took to perform garbage collection iteration.",
Buckets: prometheus.DefBuckets,
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, nil)
m.syncerBlocksMarkedForDeletion = promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
Name: blocksMarkedForDeletionName,
Expand Down
11 changes: 7 additions & 4 deletions pkg/configs/client/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,13 @@ func (cfg *Config) RegisterFlagsWithPrefix(prefix string, f *flag.FlagSet) {
}

var configsRequestDuration = instrument.NewHistogramCollector(promauto.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "cortex",
Name: "configs_request_duration_seconds",
Help: "Time spent requesting userconfig.",
Buckets: prometheus.DefBuckets,
Namespace: "cortex",
Name: "configs_request_duration_seconds",
Help: "Time spent requesting userconfig.",
Buckets: prometheus.DefBuckets,
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, []string{"operation", "status_code"}))

// Client is what the ruler and altermanger needs from a config store to process rules.
Expand Down
12 changes: 8 additions & 4 deletions pkg/configs/db/timed.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ package db

import (
"context"
"time"

"github.com/prometheus/client_golang/prometheus"
"github.com/weaveworks/common/instrument"
Expand All @@ -11,10 +12,13 @@ import (

var (
databaseRequestDuration = instrument.NewHistogramCollector(prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "cortex",
Name: "database_request_duration_seconds",
Help: "Time spent (in seconds) doing database requests.",
Buckets: prometheus.DefBuckets,
Namespace: "cortex",
Name: "database_request_duration_seconds",
Help: "Time spent (in seconds) doing database requests.",
Buckets: prometheus.DefBuckets,
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, []string{"method", "status_code"}))
)

Expand Down
22 changes: 14 additions & 8 deletions pkg/distributor/distributor.go
Original file line number Diff line number Diff line change
Expand Up @@ -336,10 +336,13 @@ func New(cfg Config, clientConfig ingester_client.Config, limits *validation.Ove
ingestionRate: util_math.NewEWMARate(0.2, instanceIngestionRateTickInterval),

queryDuration: instrument.NewHistogramCollector(promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Namespace: "cortex",
Name: "distributor_query_duration_seconds",
Help: "Time spent executing expression and exemplar queries.",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 20, 30},
Namespace: "cortex",
Name: "distributor_query_duration_seconds",
Help: "Time spent executing expression and exemplar queries.",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 20, 30},
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}, []string{"method", "status_code"})),
receivedSamples: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
Namespace: "cortex",
Expand Down Expand Up @@ -396,10 +399,13 @@ func New(cfg Config, clientConfig ingester_client.Config, limits *validation.Ove
Help: "The total number of deduplicated samples.",
}, []string{"user", "cluster"}),
labelsHistogram: promauto.With(reg).NewHistogram(prometheus.HistogramOpts{
Namespace: "cortex",
Name: "labels_per_sample",
Help: "Number of labels per sample.",
Buckets: []float64{5, 10, 15, 20, 25},
Namespace: "cortex",
Name: "labels_per_sample",
Help: "Number of labels per sample.",
Buckets: []float64{5, 10, 15, 20, 25},
NativeHistogramBucketFactor: 1.1,
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
}),
ingesterAppends: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
Namespace: "cortex",
Expand Down
Loading
Loading