Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions docs/proposals-accepted/202511-tenant-based-config-overrides.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# Proposal: Tenant-Based Configuration Overrides for Thanos Components

* **Author(s):** ricket-son
* **Created:** 2025-11-01
* **Related Issue:** [#8544](https://github.com/thanos-io/thanos/issues/8544)

> TL;DR: This design doc is proposing configuration overrides on a per-tenant basis on several thanos components to integrate flexible configuration.

---

## Context

Thanos is increasingly deployed in **multi-tenant environments**, where each tenant may have distinct requirements for data retention, compaction scheduling, and performance tuning.
Currently, most Thanos components—such as the **Compactor**, **Receive**, or **Store Gateway**—share a single configuration across all tenants.

This limitation means that:
- Operators cannot easily enforce **different retention periods** per tenant.
- All tenants are affected by the same compaction or concurrency settings.
- To achieve separation, operators must deploy **one component per tenant**, which adds operational overhead.

A native, configuration-based mechanism for **tenant-specific overrides** would reduce complexity, improve flexibility, and better support multi-tenant Thanos deployments.

---

## Goals

- Enable **tenant-based configuration overrides** for Thanos components, starting with the Compactor.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal!

Just trying to understand, what is currently blocking you from doing this already with multiple sharded compactor instances?

You can use the --selector.relabel-config to select blocks for a particular tenant, already thanks to external labels. And then further split by time or anything else.

This makes the process really scalable.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orchestrating that many compactors might be hard tho, which is also why we've created https://github.com/thanos-community/thanos-operator/ cc: @philipgough

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great hint. But even with that, you'll have to manage multiple instances of compactors. Like you said, it's harder to maintain.

Thanks for mentioning the operator, didn't know that. Already looks great.
There is seems like you also intend to deploy separate components for each tenant, if different configs are required?

Copy link
Copy Markdown

@clarifai-fmarceau clarifai-fmarceau Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for less complex compactor sharding until needed...! There's an overhead resource cost to moving with that model as well.

I'm personally happy with how a single compactor does the job (until you reach nodes limits). We don't have a need for sharding it at this point but it'd be very convenient for a single compactor as @ricket-son suggested to be able to handle multiple retention rules per tenant (preferably via regex).

@saswatamcode On the topic of the thanos-orchestrator, it's looking good - good job ! It definitely seems like a solution that's more flexible then using a good old helm chart (or whatever other templating tool) to manage Thanos.

Copy link
Copy Markdown
Member

@saswatamcode saswatamcode Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right so something that seems more reasonable to me is not basing this off of tenancy or just a tenant label, as there can be several flexible usecases for such a feature.

How about instead some selector config that will allow us to set separate retention configs for particular set of blocks?

Basically will be a map of selector relabel config to retention config. We could call it Retention Policy then

So the config here will look like,

retention_policies:
  -  action: drop
      regex: "A"
      source_labels:
      - tenant
      retention:
        resolution_raw: 5d
        resolution_5m: 10d
      delete_delay: 12h

This to me makes the system a lot more flexible. This way you could also easily group tenants with similar retention needs or from the same subset.

And you are also free to use this in any way needed, like business metrics might have longer retention than SRE metrics and so on.

Wdyt? 🙂

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Receive already has write level tenancy via headers, that can be set from some auth proxy in front
StoreGW has selector configs if you want to run separate, but don't really need to IMO as Querier has flags to enforce the presence of labels, which you can use for tenancy as well 🙂 (or some proxy infront to do this for you like prom-label-proxy)

Ruler is a bit tricky.
You have to enforce tenant labels in a couple different places for rule config, once in the rule expression, and then in the rule labels as well, so that the generated series retains the tenant. We actually implemented this in Thanos Operator already, you need only add config like so,

ruleTenancyConfig:
      tenantLabel: tenant_id
      tenantValueLabel: operator.thanos.io/tenant

and it will pick up PrometheusRule object with those tenant labels and enforce it there.

The block generated from Ruler would be mixed, but since you'll be enforcing Querier-level tenancy, it would only ever select rules for a particular tenant.
The next step I want to implement in the operator is stateless ruler you can actually create remote write targets so that rules of a particular tenant get remote written to your Receive with a certain tenant label, that way you maintain separate tenant rule blocks too!

In our experience, having a centralized config, means we have to encode a lot of operational logic into Thanos, which takes away from the core scope of this project (scalable metrics storage and querying).

This is why we started Thanos Operator as a way to do all this as declaratively as possible, whilst keeping Thanos scoped to storage and still be gradually adoptable over existing Prometheus setups.

Also TODO for me to actually write some nice blogpost for the operator + some doc on Thanos for the tenancy stuff, as it isn't immediately clear!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah, having this selector relable config on compactor would be cool @ricket-son, makes the tenancy picture easier to manage 🙂

If you update the proposal with this, can do one more round of review and merge!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this shouldn't aim to configure tenancy in general @saswatamcode. But to configure the overrides of configuration for a tenant.

I mean I don't care if its in a unified section or not. Whatever is simpler to implement. As long as it's documented, I guess it doesn't matter.

Just wanted to know, what you think about possible future tenant config overrides on other components, like receive for ingestion-limit / rate-limits / etc. and how this will be implemented / declared.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the proposal soon.

@GiedriusS whats your opinion on that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello jumping in the discussion to expose one of our usecase that seems to be part of this proposal.

We are running a Receive cluster with multiple tenants. As of today the --tsdb.retention flag must be global and applies to all tenants "TSDB" as per: https://thanos.io/tip/components/receive.md/#tenant-lifecycle-management

We would like to be able to configure tenant_specific retention the same way we are overriding ingestion limits:
https://thanos.io/tip/components/receive.md/#understanding-the-configuration-file

One of our usecase is to allow bigger retention for a tenant only while all the other leverage the global config.

(please note, we dont have object storage enabled here and leverage local disks only)

- Support common parameters such as:
- Retention periods per resolution level.
- Compaction concurrency and deletion delay.
- Optional component-specific tunables (e.g., query or ingestion limits).
- Maintain backward compatibility (no overrides = current behavior).
- Support reloadable configuration files where possible.
- Standardize how Thanos identifies tenants for configuration purposes.

---

## Non-Goals

- Implementing full multi-tenancy isolation (e.g., per-tenant storage credentials or ACLs).
- Providing a web UI or API for managing overrides.
- Changing the existing data model or metadata schema beyond identifying tenants.

---

## Proposed Design

### Configuration Format

A new top-level configuration section, `tenant_overrides`, is added to the component configuration file.

Example configuration structure:

```yaml
# global compactor configs
compactor:
retention:
resolution_raw: 365d # --retention.resolution-raw=365d
resolution_5m: 180d # --retention.resolution-5m=180d
resolution_1h: 0d # --retention.resolution-1h=0d
concurrency: 4 # --compact.concurrency=4
delete_delay: 48h # --delete-delay=48h

# data-specific configs
config_overrides:
- action: keep
regex: "tenant-a"
source_labels:
- "tenant_id"
compactor:
retention:
resolution_raw: 90d
Comment on lines +70 to +72
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the compactor parent key could be removed when this is only compactor focused


- action: keep
regex: "tenant-b"
source_labels:
- "tenant_id"
compactor:
retention:
resolution_raw: 5d
resolution_5m: 10d
delete_delay: 12h

- action: keep
regex: "dev-*"
source_labels:
- "cluster"
compactor:
retention:
resolution_raw: 1d
resolution_5m: 5d
delete_delay: 12h

# future:
# receive: # (obsolete?)
# max_series_per_tenant:
# max_ingest_samples_per_second:
Comment on lines +94 to +97
Copy link
Copy Markdown
Author

@ricket-son ricket-son Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# storegateway:
# query_timeout:

```

**Key principles:**
- Components `default` configuration block defines the baseline configuration for all tenants.
- Inside `config_overrides`, required overrides can be specified on a per-label basis using regex matching.
- Unspecified fields inherit from `default` or the component’s global config.

### Parameters Potentially Overridable

| Parameter | Component | Description |
|------------|------------|-------------|
| `retention.resolution_raw` | Compactor | Duration to keep raw metrics blocks |
| `retention.resolution_5m` | Compactor | Retention for 5-minute resolution |
| `retention.resolution_1h` | Compactor | Retention for 1-hour resolution |
| `delete_delay` | Compactor | Grace period before deleting old blocks |
| `concurrency` | Compactor | Number of concurrent compactions |
| `max_series_per_tenant` | Receive | (Future) Max allowed series per tenant |
| `max_ingest_samples_per_second` | Receive | (Future) Max allowed samples per second |
| `query_timeout` | Store Gateway | (Future) Query timeout per tenant |

### Runtime Behavior

1. On startup, the component loads the base configuration.
2. It parses `config_overrides` and stores them in memory.
3. For each operation (e.g., block compaction), the component:
- Determines the relevant labels from metadata.
- Merges the `default` config with the labels' overrides.
- Applies the effective configuration.
4. If no override exists, defaults apply unchanged.
5. Optional: expose applied overrides as metrics/log entries.

### Example Log Line

```
level=info component=compact msg="Applied config override" source_labels="cluster" regex="dev-*" resolution_raw="1d" concurrency="1"
```

---

## Alternatives

| Approach | Pros | Cons |
|-----------|------|------|
| Deploy separate components per tenant | Isolation | High operational overhead |
| Use object storage lifecycle policies | Simplicity | Limited scope; not integrated with Thanos |
| Extend block labels for retention hints | Simple | Not centrally controlled; hard to audit |
| Proposed override config | Integrated, flexible | Adds config and implementation complexity |

---

## Action Plan

1. **Design & Discussion**
- Validate config format and naming.
- Confirm tenant identification mechanism.
2. **Implementation (Phase 1 – Compactor)**
- Extend compactor config struct.
- Implement `ConfigOverrides` parsing and merging logic.
- Integrate into retention and compaction scheduling.
3. **Testing**
- Unit tests for override resolution.
- Integration tests simulating multiple tenants.
4. **Documentation**
- Update component docs with examples.
- Add migration notes.
5. **Future Phases**
- Extend support to Store Gateway, Receive, Ruler.
- Introduce optional API-based overrides.

---

## Future Work

- Support dynamic override reloads via API or file watcher.
- Extend to query-level or ingestion-level limits.
- Provide observability via `/metrics` and `/config` endpoints.
Loading