Skip to content

[core][flink] Manifest cache benchmarks + expose more manifest cache options#8186

Merged
JingsongLi merged 3 commits into
apache:masterfrom
mao-liu:feat/manifest-cache-options
Jun 11, 2026
Merged

[core][flink] Manifest cache benchmarks + expose more manifest cache options#8186
JingsongLi merged 3 commits into
apache:masterfrom
mao-liu:feat/manifest-cache-options

Conversation

@mao-liu

@mao-liu mao-liu commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Purpose

In Paimon v1.3 (prior to 960dce1), manifest cache incurred significant heap memory spike during cold-filling. This problem was raised and discussed in #7030 and #7031. This problem is particularly evident for highly partitioned tables in jobs with high parallelism.

While the heap spike issue is mostly resolved via 960dce1, some additional manifest cache options are proposed here to help tune the manifest cache for highly partitioned tables in jobs with high parallelism.

When many high-parallelism writers restore at the same time, the Job Manager's manifest cache can become a memory bottleneck. The cache holds entries with soft references, so under sustained heap pressure the JVM reclaims entries that are then immediately re-read and decompressed, driving heap back up and triggering further reclamation — a cache-thrash spiral. There was previously no way to tune this behavior.

This PR exposes additional manifest-cache controls and a prefetch option to make this tunable:

  • Added WriteRestoreScanBenchmark, a micro-benchmark that reproduces the manifest-cache cold-fill memory spike and reports heap/cache footprint across cache-disabled vs. cache-enabled (strong-ref) arms. On Paimon v1.3, this benchmark would reveal significant memory heap spike during cold-filling on the cache-enabled path. This problem is no longer present after 960dce1, however the benchmark could still be useful in measuring performance and detecting regression in the future.

  • SegmentsCache now supports a configurable idle TTL (expire-after-access) and a soft-values toggle. Setting soft-values=false pins the working set with strong references so the thrash spiral cannot start; the cache then stays bounded by weight (up to its configured memory). The defaults preserve the existing behavior (soft references on).

  • New catalog option:

    • cache.manifest.soft-values (default true) — toggle soft/strong references for the catalog manifest cache. The catalog manifest cache continues to inherit the catalog-wide cache.expire-after-access TTL.
  • New writer-coordinator options:

    • sink.writer-coordinator.cache-soft-values (default true) — same soft/strong reference toggle for the coordinator manifest cache.
    • sink.writer-coordinator.cache-expire-after-access (default disabled) — optional idle TTL for coordinator cache entries; the cache stays bounded by sink.writer-coordinator.cache-memory regardless.
    • sink.writer-coordinator.prefetch-manifests (default false) — eagerly read all data manifests of the latest snapshot during refresh to warm the in-Job-Manager manifest cache once, avoiding many concurrent cold manifest reads when writers restore simultaneously.
  • Docs: documented the new options and added a "Write Initialize" section in write-performance.md explaining when these settings help, the failure mechanism, and how they resolve it.

Tests

  • SegmentsCacheTest: covers defaults (soft refs on, no TTL), getter pass-through, create returning null on zero memory, and that strong references stay bounded by weight-based eviction.
  • CachingCatalogTest#testManifestCacheOptions: asserts the catalog manifest cache picks up soft-values and inherits the catalog idle TTL.
  • TableWriteCoordinatorTest: testBuildManifestCacheOptions verifies the coordinator options map to the cache (default soft refs + no TTL, explicit TTL honored, soft-values=false switches to strong refs, zero memory disables the cache); testPrefetchManifestsWarmsCache verifies that constructing the coordinator with prefetch enabled warms the cache and that scan results remain correct.
  • Regenerated config docs verified by ConfigOptionsDocsCompletenessITCase.

Closes #7030

Remove cache page size changes - not needed

Tidying up
@mao-liu mao-liu changed the title [core] [flink] Manifest cache benchmarks + expose more manifest cache options [core][flink] Manifest cache benchmarks + expose more manifest cache options Jun 10, 2026
// that per-task `scan` requests use, so subsequent concurrent requests hit
// warm bytes instead of each performing a cold manifest read.
scan.withPartitionFilter(PartitionPredicate.ALWAYS_TRUE)
.withBucketFilter(Filter.alwaysTrue())

@JingsongLi JingsongLi Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prefetch reuses the mutable coordinator scan. A normal request calls scan.withPartitionBucket(partition, bucket), which leaves specifiedBucket set in AbstractFileStoreScan/ManifestsReader; withBucketFilter(Filter.alwaysTrue()) only adds a permissive filter and does not clear that field. After the next checkpoint refresh, this plan can therefore skip manifests outside the last requested bucket range instead of warming all data manifests. Please use a fresh table.store().newScan().withSnapshot(snapshot) for the prefetch, or add an explicit way to clear the bucket state, and cover the scan-then-checkpoint case in the test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @JingsongLi - and thanks for catching the bug 🙏

I have updated prefetch to use a fresh scan instance, and also added a test to cover this scenario. - 709c79b

@mao-liu mao-liu force-pushed the feat/manifest-cache-options branch from 5ddd680 to 709c79b Compare June 11, 2026 00:39

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The previous prefetch issue is fixed with a fresh scan instance and covered by the new test.

@JingsongLi JingsongLi merged commit 5d26928 into apache:master Jun 11, 2026
13 checks passed
@mao-liu mao-liu deleted the feat/manifest-cache-options branch June 11, 2026 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Write and compaction performance significantly degraded when writing to highly partitioned tables

2 participants