feat: add blog post for --enable-feature=use-uncached-io by machine424 · Pull Request #2869 · prometheus/docs

machine424 · 2026-03-05T20:31:36Z

No description provided.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

nwanduka

Thanks for sharing this @machine424. LGTM.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

bwplotka

Nice! Added some suggestions, but great explanation of this feature!

bwplotka · 2026-03-06T10:44:04Z

+
+<!-- more -->
+
+Do you find yourself constantly looking up the difference between `container_memory_usage_bytes`, `container_memory_working_set_bytes`, and `container_memory_rss`? It gets worse when you pick the wrong one to set a memory limit, interpret benchmark results, or debug an OOMKilled container.


"It gets worse" is bit fuzzy on what do you mean

Suggested change

Do you find yourself constantly looking up the difference between `container_memory_usage_bytes`, `container_memory_working_set_bytes`, and `container_memory_rss`? It gets worse when you pick the wrong one to set a memory limit, interpret benchmark results, or debug an OOMKilled container.

Do you find yourself constantly looking up the difference between `container_memory_usage_bytes`, `container_memory_working_set_bytes`, and `container_memory_rss`? Do you know which one to use for memory limits, benchmark result intepretation, OOMKilled debugging?

bwplotka · 2026-03-06T10:44:51Z

+
+Do you find yourself constantly looking up the difference between `container_memory_usage_bytes`, `container_memory_working_set_bytes`, and `container_memory_rss`? It gets worse when you pick the wrong one to set a memory limit, interpret benchmark results, or debug an OOMKilled container.
+
+You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of many others.


Suggested change

You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of many others.

You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of users.

bwplotka · 2026-03-09T07:24:38Z

+
+You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of many others.
+
+The explanation is simple: RAM is not used in just one way. One of the easiest things to miss is the page cache, and for some containers it can make up most of the memory usage, creating large gaps between those metrics.


Should we narrow down OS? I'd add this blog post applies to Linux only (both AMD and ARM)

added a "NOTE"

bwplotka · 2026-03-09T07:26:10Z

+
+<!-- more -->
+
+The [use-uncached-io](https://prometheus.io/docs/prometheus/latest/feature_flags/#use-uncached-io) feature flag was built for exactly this. Prometheus is a database and it does a lot of disk writes, but not every write benefits from the page cache. Compaction writes are a good example, because once written, that data is unlikely to be read again soon.


tiny nit, but for the future: this flag probably could be called "uncached-io" - we don't call our flags "enable-"

it's use-uncached-io, but I agree, we can always do better with naming :)

bwplotka · 2026-03-09T07:27:14Z

+
+<!-- more -->
+
+The [use-uncached-io](https://prometheus.io/docs/prometheus/latest/feature_flags/#use-uncached-io) feature flag was built for exactly this. Prometheus is a database and it does a lot of disk writes, but not every write benefits from the page cache. Compaction writes are a good example, because once written, that data is unlikely to be read again soon.


This section calls page cache without actually defining what it is? Is it worth to educate reader what page cache is? (or at least link to wikipedia etc?)

Added a link. As you know, it’s a balance between clarity and conciseness; if I start explaining “feature flag” “compaction” or “disk writes"...

Even though the post isn't highly technical, it’s intended for readers who are already somewhat familiar with the concepts/limitations mentioned.

I also expect that many people will read this through an LLM, which can supply any missing references or additional details...

bwplotka · 2026-03-09T07:29:28Z

+
+To deal with that, a [`bufio.Writer`](https://pkg.go.dev/bufio#Writer)-like writer, [`directIOWriter`](https://github.com/prometheus/prometheus/blob/ac12e30f99df9d2f68025f0238c0aef95146e94b/tsdb/fileutil/direct_io_writer.go#L46), was implemented. On kernels `v6.1` or newer, Prometheus gets the exact alignment values from [statx](https://man7.org/linux/man-pages/man2/statx.2.html); otherwise, conservative defaults are used.
+
+The `directIOWriter` is currently limited to chunk writes, but that is already a substantial amount of I/O. Benchmarks show a 20-50% reduction in page cache usage, as measured by `container_memory_cache`.


My first question is .. does it have impact on other metrics? Performance of other stuff? Is it useful to mention in this blog post?

Based on the benchmarks I ran, there are no notable performance improvements to report. Of course, I would have mentioned any regressions if I had encountered them.

Maybe users running large, long-lived instances will share whether they saw any improvements.

bwplotka · 2026-03-09T07:31:12Z

+
+<!-- more -->
+
+The [use-uncached-io](https://prometheus.io/docs/prometheus/latest/feature_flags/#use-uncached-io) feature flag was built for exactly this. Prometheus is a database and it does a lot of disk writes, but not every write benefits from the page cache. Compaction writes are a good example, because once written, that data is unlikely to be read again soon.


Compaction writes are a good example, because once written, that data is unlikely to be read again soon.

Can we explain why unlikely? This data is used for long term storage queries. It's worth to mention that in practice, majoritiy of queries hit only 24h or even 1h

bwplotka · 2026-03-09T07:31:47Z

+
+### Experimenting with `RWF_DONTCACHE`
+
+Introduced in Linux kernel `v6.14`, `RWF_DONTCACHE` enables uncached buffered I/O, where data still goes through the page cache, but the corresponding pages are dropped afterwards. It would be worth benchmarking whether this can deliver similar benefits without direct I/O's alignment constraints.


bwplotka · 2026-03-09T07:33:25Z

+
+You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of many others.
+
+The explanation is simple: RAM is not used in just one way. One of the easiest things to miss is the page cache, and for some containers it can make up most of the memory usage, creating large gaps between those metrics.


Can we mention clearly that this page cache is meant for best-effort data - the moment kernel needs memory for other processes it should be able to clean this cache. But for large box with unused memory, memory can be marked as "used" to the limit of that box which can be scary and confusing - despite this memory can be cleaned on demand.

added a mention and a link

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

bwplotka

Amazing, I have only wording suggestion, otherwise LGTM! 💪🏽 Great post.

bwplotka · 2026-03-20T10:17:32Z

+
+You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of users.
+
+The explanation is simple: RAM is not used in just one way. One of the easiest things to miss is the [page cache](https://en.wikipedia.org/wiki/Page_cache), and for some containers it can make up most of the reported memory usage, even though that memory is largely reclaimable, creating large gaps between those metrics.


Suggested change

The explanation is simple: RAM is not used in just one way. One of the easiest things to miss is the [page cache](https://en.wikipedia.org/wiki/Page_cache), and for some containers it can make up most of the reported memory usage, even though that memory is largely reclaimable, creating large gaps between those metrics.

The explanation is simple: RAM is not used in just one way. One of the easiest things to miss is the [page cache](https://en.wikipedia.org/wiki/Page_cache) semantics. For some containers memory taken by page caching can make up most of the reported memory usage, even though that memory is largely reclaimable, creating surprising differences between those metrics.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

feat: add blog post for --enable-feature=use-uncached-io

57cc029

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

nwanduka reviewed Mar 6, 2026

View reviewed changes

Comment thread blog/posts/2026-03-05-uncached-io.md Outdated

Comment thread blog/posts/2026-03-05-uncached-io.md

Comment thread blog/posts/2026-03-05-uncached-io.md Outdated

machine424 added 2 commits March 6, 2026 22:06

review 1

d8b9ab2

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

review 2

81dcb92

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

bwplotka reviewed Mar 9, 2026

View reviewed changes

review 3 + some rewording

165edc8

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

machine424 requested a review from bwplotka March 12, 2026 14:48

bwplotka previously approved these changes Mar 20, 2026

View reviewed changes

machine424 dismissed bwplotka’s stale review via c6525da March 20, 2026 11:00

review 4

47ee069

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

machine424 force-pushed the tttr branch from c6525da to 47ee069 Compare March 20, 2026 11:36

Merge branch 'main' into tttr

e46b73e

bwplotka approved these changes Mar 20, 2026

View reviewed changes

machine424 merged commit d7d71c5 into prometheus:main Mar 20, 2026
6 checks passed


		<!-- more -->

		Do you find yourself constantly looking up the difference between `container_memory_usage_bytes`, `container_memory_working_set_bytes`, and `container_memory_rss`? It gets worse when you pick the wrong one to set a memory limit, interpret benchmark results, or debug an OOMKilled container.


		Do you find yourself constantly looking up the difference between `container_memory_usage_bytes`, `container_memory_working_set_bytes`, and `container_memory_rss`? It gets worse when you pick the wrong one to set a memory limit, interpret benchmark results, or debug an OOMKilled container.

		You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of many others.


		You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of many others.

		The explanation is simple: RAM is not used in just one way. One of the easiest things to miss is the page cache, and for some containers it can make up most of the memory usage, creating large gaps between those metrics.


		<!-- more -->

		The [use-uncached-io](https://prometheus.io/docs/prometheus/latest/feature_flags/#use-uncached-io) feature flag was built for exactly this. Prometheus is a database and it does a lot of disk writes, but not every write benefits from the page cache. Compaction writes are a good example, because once written, that data is unlikely to be read again soon.


		To deal with that, a [`bufio.Writer`](https://pkg.go.dev/bufio#Writer)-like writer, [`directIOWriter`](https://github.com/prometheus/prometheus/blob/ac12e30f99df9d2f68025f0238c0aef95146e94b/tsdb/fileutil/direct_io_writer.go#L46), was implemented. On kernels `v6.1` or newer, Prometheus gets the exact alignment values from [statx](https://man7.org/linux/man-pages/man2/statx.2.html); otherwise, conservative defaults are used.

		The `directIOWriter` is currently limited to chunk writes, but that is already a substantial amount of I/O. Benchmarks show a 20-50% reduction in page cache usage, as measured by `container_memory_cache`.


		### Experimenting with `RWF_DONTCACHE`

		Introduced in Linux kernel `v6.14`, `RWF_DONTCACHE` enables uncached buffered I/O, where data still goes through the page cache, but the corresponding pages are dropped afterwards. It would be worth benchmarking whether this can deliver similar benefits without direct I/O's alignment constraints.

Conversation

machine424 commented Mar 5, 2026

Uh oh!

nwanduka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants