Coalesce consecutive page cache misses into single S3 requests by alexey-milovidov · Pull Request #104230 · ClickHouse/ClickHouse

alexey-milovidov · 2026-05-06T14:07:06Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Cold reads of object storage through the userspace page cache (use_page_cache_for_object_storage = 1) are now significantly faster, because consecutive cache misses are coalesced into a single HTTP request instead of one request per page_cache_block_size block.

Description

CachedInMemoryReadBufferFromFile::populateBlockRange previously issued one in->readBigAt per missing 1 MiB block. On object storage, each call is a separate HTTP request, so a cold scan of a 14 GB Parquet file through the userspace page cache made ~15k requests, each paying the TCP/TLS round-trip — measurably slower than the filesystem cache, which fetches in larger segments.

Coalescing was previously implemented in commit 682b070 and reverted in c178d2a to avoid transient memory spikes from huge temporary buffers under parallel cold reads.

This change re-introduces coalescing with a hard cap on the temporary buffer (max_coalesced_bytes = 16 MiB). Long miss runs are split into multiple fetches, bounding peak transient memory per call. Single-block misses still read directly into the cache cell, avoiding the buffer and the extra memcpy.

Test results

Measured on c8g.24xlarge against the ClickBench clickhouse-datalake queries (43 queries, single 14.7 GB Parquet on S3, totals over all queries):

Config	Cold total	Hot total
OLD filesystem cache	62.28s	18.57s
NEW page cache (broken)	106.14s	20.14s
NEW page cache + this fix	56.58s	13.59s

Per-query, the worst regressed queries are back to baseline:

Q24 (heaviest, SELECT *): 15.23s -> 34.61s broken -> 15.27s fixed
Q23: 4.89s -> 10.14s broken -> 4.72s fixed
Q22: 2.30s -> 5.00s broken -> 2.43s fixed

Context: ClickHouse/ClickBench#818

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

`CachedInMemoryReadBufferFromFile::populateBlockRange` previously issued one `in->readBigAt` per missing 1 MiB block. On object storage, each call is a separate HTTP request, so a cold scan of a 14 GB Parquet file through the userspace page cache made ~15k requests, each paying the TCP/TLS round-trip — measurably slower than the filesystem cache, which fetches in larger segments. Coalescing was previously implemented in commit 682b070 and reverted in c178d2a to avoid transient memory spikes from huge temporary buffers under parallel cold reads. Re-introduce coalescing with a hard cap on the temporary buffer (`max_coalesced_bytes` = 16 MiB). Long miss runs are split into multiple fetches, bounding peak transient memory per call. Single-block misses still read directly into the cache cell, avoiding the buffer and the extra `memcpy`. Measured locally on c8g.24xlarge against the ClickBench `clickhouse-datalake` queries (43 queries, single 14.7 GB Parquet on S3, totals over all queries): cold runs: filesystem cache 62.28s -> page cache (default) 56.58s hot runs: filesystem cache 18.57s -> page cache (default) 13.59s The page cache is now strictly faster than the filesystem cache on both cold and hot, with no benchmark-script tuning required. Context: ClickHouse/ClickBench#818 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-06T14:07:47Z

Workflow [PR], commit [26f5279]

Summary: ✅

AI Review

Summary

This PR re-introduces coalesced page-cache miss reads for object storage, now bounded by a new setting page_cache_max_coalesced_bytes to avoid transient memory spikes under parallel cold reads. The change is coherent across settings plumbing and read-path implementation, and I did not find correctness, safety, or compatibility issues that require changes.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	✅
PR metadata quality	✅
Safe rollout	✅
Compilation time	✅
No large/binary files	✅

Final Verdict

Status: ✅ Approve

alexey-milovidov · 2026-05-06T14:09:51Z

+    /// The coalesced read uses a temporary buffer, capped at `max_coalesced_bytes` to bound
+    /// transient memory under parallel cold reads. A run longer than the cap is split.
+    /// Single-block misses bypass the buffer and read directly into the cache cell.
+    constexpr size_t max_coalesced_bytes = 16 * 1024 * 1024;


Can we make it a setting?

clickhouse-gh · 2026-05-06T14:11:41Z

+    /// The coalesced read uses a temporary buffer, capped at `max_coalesced_bytes` to bound
+    /// transient memory under parallel cold reads. A run longer than the cap is split.
+    /// Single-block misses bypass the buffer and read directly into the cache cell.
+    constexpr size_t max_coalesced_bytes = 16 * 1024 * 1024;


max_coalesced_bytes = 16 MiB is an important behavior threshold (memory vs request coalescing), but it's hardcoded. This makes tuning impossible for clusters with very different object-store RTT or memory pressure, and it violates the usual ClickHouse pattern of exposing such trade-offs as a setting.

Please make this cap configurable (e.g. a read/page-cache setting with a conservative default), then use that value here.

Address review feedback on PR ClickHouse#104230: the 16 MiB cap on coalesced page-cache reads was hardcoded, which prevents tuning for clusters with different object-store RTT or memory pressure. Expose it as the session setting `page_cache_max_coalesced_bytes`, with the same 16 MiB default. The setting flows through `ReadSettings` to `CachedInMemoryReadBufferFromFile::populateBlockRange`, which uses it to compute the per-fetch block cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-06T20:10:11Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.10%	84.10%	+0.00%
Functions	91.10%	91.10%	+0.00%
Branches	76.60%	76.60%	+0.00%

Changed lines: 30.77% (20/65) · Uncovered code

Full report · Diff report

alexey-milovidov

Looks good.

alexey-milovidov added the comp-datalake Data lake table formats (Iceberg/Delta/Hudi) integration. label May 6, 2026

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 6, 2026

alexey-milovidov commented May 6, 2026

View reviewed changes

alexey-milovidov requested a review from al13n321 May 6, 2026 14:09

clickhouse-gh Bot reviewed May 6, 2026

View reviewed changes

alexey-milovidov commented May 6, 2026

View reviewed changes

alexey-milovidov self-assigned this May 6, 2026

alexey-milovidov added this pull request to the merge queue May 6, 2026

Merged via the queue into ClickHouse:master with commit 16657f0 May 6, 2026
165 checks passed

alexey-milovidov deleted the fix-page-cache-coalesce-bounded branch May 6, 2026 21:50

robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coalesce consecutive page cache misses into single S3 requests#104230

Coalesce consecutive page cache misses into single S3 requests#104230
alexey-milovidov merged 2 commits intoClickHouse:masterfrom
alexey-milovidov:fix-page-cache-coalesce-bounded

alexey-milovidov commented May 6, 2026

Uh oh!

clickhouse-gh Bot commented May 6, 2026 •

edited

Loading

Uh oh!

alexey-milovidov May 6, 2026

Uh oh!

clickhouse-gh Bot May 6, 2026

Uh oh!

clickhouse-gh Bot commented May 6, 2026

Uh oh!

alexey-milovidov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexey-milovidov commented May 6, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Test results

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

ClickHouse Rules

Final Verdict

Uh oh!

alexey-milovidov May 6, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented May 6, 2026

LLVM Coverage Report

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clickhouse-gh Bot commented May 6, 2026 •

edited

Loading