HBASE-30134: Improve CacheAwareLoadBalancer to consider low cache ratio when calculating imbalance (#8197) by wchevreuil · Pull Request #8218 · apache/hbase

wchevreuil · 2026-05-11T10:36:19Z

No description provided.

…io when calculating imbalance (apache#8197) Signed-off-by: Tak Lon (Stephen) Wu <taklwu@apache.org> Change-Id: Id388f670f32fafb2ac28037746cfedae0f22ab78

VladRodionov

The patch (or approach) scoped potential issue, which should be addressed, I think: Risk of balancing churn. If cache ratio changes slowly and free cache fluctuates, the balancer may repeatedly identify “better” placements. The patch is adding a cost signal, not necessarily a stable state transition.

This is a common problem of all balancers, I am not very familiar with StochasticBalancer internals, but if it is missing there, some stabilization algorithm should be added to prevent frequent rebalancing - the simplest is a cooling off period. But it is probably should be a separate ticket.

VladRodionov · 2026-05-11T17:39:08Z

potentialBestWeightedFromFreeCache(...) seems to consider any server with enough free cache as evidence that the region has a better possible cache score. Should this be limited to servers that are valid/reasonable candidate destinations for this specific region, considering locality/rack/table skew/other constraints?

The CacheAwareLoadBalancer was originally thought for use cases where optimal cache allocation is pivotal for overall performance, so it only combine skewness and cache allocation costs when weighting in an imbalance. For use cases where locality/rack/load are more critical, the StochasticLoadBalancer should be used.

VladRodionov · 2026-05-11T17:41:38Z

    updateRegionLoad();
  }

+  protected Map<ServerName, Long> getServerBlockCacheFreeBytes() {


Free cache capacity does not necessarily mean the moved region will become warm. What mechanism is expected to actually populate the cache after the move — prefetch, time-based priority, HBASE-30135, or normal client reads?

Yeah, these changes are only calculating if there's imbalance considering cache space and uncached regions. HBASE-30135 will deal with new plans generation. Once regions are moved, prefetch will run at region opening and it's the mean for caching the regions. Maybe we should add extra validation for the prefetch property.

VladRodionov · 2026-05-11T17:52:58Z

+     * #potentialCacheRatioAfterMove} * region MB) so placement is not considered optimal solely
+     * from low ratios when capacity exists somewhere in the cluster.
+     */
+    private double potentialBestWeightedFromFreeCache(BalancerClusterState cluster, int region,


The required free cache appears to be based on total region HFile size. Should this instead use cacheable/hot data size?

Yes, let me fix this in HBASE-30135.

VladRodionov · 2026-05-11T18:09:06Z

+   */
+  public static final String POTENTIAL_CACHE_RATIO_AFTER_MOVE_KEY =
+    "hbase.master.balancer.cacheaware.potentialCacheRatioAfterMove";
+  public static final float POTENTIAL_CACHE_RATIO_AFTER_MOVE_DEFAULT = 0.95f;


Are the defaults for potentialCacheRatioAfterMove (0.95) and minFreeCacheSpaceFactor (1.0) based on observed data? They seem important enough to justify in comments or tests.

For potentialCacheRatioAfterMove, I haven't formally measured. For root dirs on cloud storage, I've seen performance degradation starts to be noticeable when cache ratio falls below 90%.

For minFreeCacheSpaceFactor, I thought originally of only allow moving if cache free space can accommodate the whole region, but decided to make this configurable for flexibility.

VladRodionov · 2026-05-11T18:11:00Z

+  /**
+   * The available cache space on this region server (bytes), if reported in the server load.
+   */
+  default long getCacheFreeSize() {


ServerMetrics#getCacheFreeSize() defaults to 0L, and getRegionColdDataSize() defaults to empty map. That is safe for compatibility, but it means missing metrics are indistinguishable from “no free cache” or “no cold data.” Should missing cache-free-size be represented differently from actual zero? Would 0L make old / partially upgraded servers look like they have no free cache and bias balancing?

Yeah, since there would be no mean to determine the free cache space on those, it would exclude those as potential candidates, reducing the odds of finding a better assignment plan, so it may perform less moves, or worst case, no moves at all.

VladRodionov · 2026-05-11T18:25:30Z

+            // don't want to move regions with low cache ratio due to data classified as cold.
            float regionCacheRatioOnOldServer =
-              regionSizeMB == 0 ? 0.0f : (float) regionSizeInCache / regionSizeMB;
+              regionSizeMB


I think adding cold data size to the numerator makes the value less clear. It turns the metric into an “effective satisfaction ratio” rather than an actual cache ratio.

Would it be better to exclude cold data from the denominator instead?

For example:

cacheRatio = cachedDataSize / (totalRegionSize - coldDataSize)

That would measure how much of the cache-relevant/hot data is actually cached. Cold data is intentionally not expected to be in block cache, so it should not make the region look poorly cached, but it also should not be counted as cached.

We may also need to clamp the result to [0, 1] and handle the case where totalRegionSize <= coldDataSize.

Change-Id: I0f715c3ff0703eeb56cdf5033242cb9e8234fba5

Change-Id: I07c0d5ba8b95d9036424e8360b685e21d5babfa2

VladRodionov

OK, lgtm

HBASE-30134: Improve CacheAwareLoadBalancer to consider low cache rat…

21ec9f1

…io when calculating imbalance (apache#8197) Signed-off-by: Tak Lon (Stephen) Wu <taklwu@apache.org> Change-Id: Id388f670f32fafb2ac28037746cfedae0f22ab78

wchevreuil force-pushed the HBASE-30134-branch-2 branch from d17ce31 to 21ec9f1 Compare May 11, 2026 10:38

VladRodionov reviewed May 11, 2026

View reviewed changes

wchevreuil added 2 commits May 11, 2026 21:33

spotless

3503022

Change-Id: I0f715c3ff0703eeb56cdf5033242cb9e8234fba5

fix ut

2f24b41

Change-Id: I07c0d5ba8b95d9036424e8360b685e21d5babfa2

VladRodionov approved these changes May 12, 2026

View reviewed changes

Conversation

wchevreuil commented May 11, 2026

Uh oh!

VladRodionov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VladRodionov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants