[core][flink] Manifest cache benchmarks + expose more manifest cache options#8186
Conversation
Remove cache page size changes - not needed Tidying up
| // that per-task `scan` requests use, so subsequent concurrent requests hit | ||
| // warm bytes instead of each performing a cold manifest read. | ||
| scan.withPartitionFilter(PartitionPredicate.ALWAYS_TRUE) | ||
| .withBucketFilter(Filter.alwaysTrue()) |
There was a problem hiding this comment.
This prefetch reuses the mutable coordinator scan. A normal request calls scan.withPartitionBucket(partition, bucket), which leaves specifiedBucket set in AbstractFileStoreScan/ManifestsReader; withBucketFilter(Filter.alwaysTrue()) only adds a permissive filter and does not clear that field. After the next checkpoint refresh, this plan can therefore skip manifests outside the last requested bucket range instead of warming all data manifests. Please use a fresh table.store().newScan().withSnapshot(snapshot) for the prefetch, or add an explicit way to clear the bucket state, and cover the scan-then-checkpoint case in the test.
There was a problem hiding this comment.
Thanks for the review @JingsongLi - and thanks for catching the bug 🙏
I have updated prefetch to use a fresh scan instance, and also added a test to cover this scenario. - 709c79b
5ddd680 to
709c79b
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
LGTM. The previous prefetch issue is fixed with a fresh scan instance and covered by the new test.
Purpose
In Paimon v1.3 (prior to 960dce1), manifest cache incurred significant heap memory spike during cold-filling. This problem was raised and discussed in #7030 and #7031. This problem is particularly evident for highly partitioned tables in jobs with high parallelism.
While the heap spike issue is mostly resolved via 960dce1, some additional manifest cache options are proposed here to help tune the manifest cache for highly partitioned tables in jobs with high parallelism.
When many high-parallelism writers restore at the same time, the Job Manager's manifest cache can become a memory bottleneck. The cache holds entries with soft references, so under sustained heap pressure the JVM reclaims entries that are then immediately re-read and decompressed, driving heap back up and triggering further reclamation — a cache-thrash spiral. There was previously no way to tune this behavior.
This PR exposes additional manifest-cache controls and a prefetch option to make this tunable:
Added
WriteRestoreScanBenchmark, a micro-benchmark that reproduces the manifest-cache cold-fill memory spike and reports heap/cache footprint across cache-disabled vs. cache-enabled (strong-ref) arms. On Paimon v1.3, this benchmark would reveal significant memory heap spike during cold-filling on the cache-enabled path. This problem is no longer present after 960dce1, however the benchmark could still be useful in measuring performance and detecting regression in the future.SegmentsCachenow supports a configurable idle TTL (expire-after-access) and asoft-valuestoggle. Settingsoft-values=falsepins the working set with strong references so the thrash spiral cannot start; the cache then stays bounded by weight (up to its configured memory). The defaults preserve the existing behavior (soft references on).New catalog option:
cache.manifest.soft-values(defaulttrue) — toggle soft/strong references for the catalog manifest cache. The catalog manifest cache continues to inherit the catalog-widecache.expire-after-accessTTL.New writer-coordinator options:
sink.writer-coordinator.cache-soft-values(defaulttrue) — same soft/strong reference toggle for the coordinator manifest cache.sink.writer-coordinator.cache-expire-after-access(default disabled) — optional idle TTL for coordinator cache entries; the cache stays bounded bysink.writer-coordinator.cache-memoryregardless.sink.writer-coordinator.prefetch-manifests(defaultfalse) — eagerly read all data manifests of the latest snapshot during refresh to warm the in-Job-Manager manifest cache once, avoiding many concurrent cold manifest reads when writers restore simultaneously.Docs: documented the new options and added a "Write Initialize" section in
write-performance.mdexplaining when these settings help, the failure mechanism, and how they resolve it.Tests
SegmentsCacheTest: covers defaults (soft refs on, no TTL), getter pass-through,createreturning null on zero memory, and that strong references stay bounded by weight-based eviction.CachingCatalogTest#testManifestCacheOptions: asserts the catalog manifest cache picks upsoft-valuesand inherits the catalog idle TTL.TableWriteCoordinatorTest:testBuildManifestCacheOptionsverifies the coordinator options map to the cache (default soft refs + no TTL, explicit TTL honored,soft-values=falseswitches to strong refs, zero memory disables the cache);testPrefetchManifestsWarmsCacheverifies that constructing the coordinator with prefetch enabled warms the cache and that scan results remain correct.ConfigOptionsDocsCompletenessITCase.Closes #7030