feat(cubestore): add stale-while-revalidate timeout for SQL query cache by paveltiunov · Pull Request #10822 · cube-js/cube

paveltiunov · 2026-05-05T17:47:53Z

Check List

Tests have been run in packages where changes have been made if available
Linter has been run for changed code
Tests for the changes have been added if not covered yet
Docs have been added / updated if required

Description of Changes Made

Adds a new CUBESTORE_QUERY_CACHE_STALE_WHILE_REVALIDATE environment variable to CubeStore that enables stale-while-revalidate behavior for the SQL result cache.

Problem

CubeStore's SQL result cache key includes the current snapshot of partition and chunk IDs. When compaction or partition changes occur, the cache key changes even though the underlying query data may not have materially changed. This causes cache misses and forces queries to wait for full re-execution.

Solution

When enabled (set to a number of seconds, e.g. 30), the cache maintains a secondary lookup keyed only by SQL query text + inline tables (ignoring partition/chunk IDs). The behavior is:

Exact cache hit (SQL + partitions + chunks match): Serve immediately (existing behavior, unchanged)
Exact miss, stale hit within timeout: Return the stale result immediately and spawn a background task to re-execute the query and update both caches
No hit at all: Execute the query normally (existing behavior)

This is disabled by default (0 = disabled). Setting it to e.g. 30 allows serving stale results for up to 30 seconds while refreshing in the background.

Implementation approach

The implementation uses a single get_inner() method with an is_background_refresh flag:

get() is a thin public wrapper that calls get_inner(false)
On stale hit, the background refresh spawns get_inner(true) — which skips the stale cache check but goes through the entire queue cache dedup flow naturally
This means if two concurrent stale hits try to refresh the same query, the second one joins the queue as a waiter instead of starting a duplicate execution
get() takes self: &Arc<Self> so the Arc can be cloned into the spawned task
get_inner() returns Pin<Box<dyn Future + Send>> to satisfy tokio::spawn bounds

Changes

File	Change
`config/mod.rs`	Added `query_cache_stale_while_revalidate_secs` to `ConfigObj` trait and `ConfigObjImpl`, parsed from `CUBESTORE_QUERY_CACHE_STALE_WHILE_REVALIDATE` env var
`sql/cache.rs`	Added `stale_cache` (Moka cache with byte weigher, keyed by `SqlQueueCacheKey`), extracted `get_inner()` with `is_background_refresh` flag, stale-while-revalidate lookup, background refresh via `tokio::spawn` calling back into `get_inner(true)`
`app_metrics.rs`	Added `DATA_QUERIES_CACHE_STALE_HIT` counter metric (`cs.sql.query.data.cache.stale_hit`)

New metric

cs.sql.query.data.cache.stale_hit — counts how many queries were served from the stale cache with a background refresh triggered

Unit tests

Test	What it verifies
`stale_while_revalidate`	Stale result served immediately on exact cache miss; background refresh updates cache
`stale_while_revalidate_background_failure`	Stale entry preserved when background refresh fails; subsequent requests still get stale result
`stale_while_revalidate_timeout_expiry`	After timeout expires, stale path is bypassed and fresh execution runs

Add CUBESTORE_QUERY_CACHE_STALE_WHILE_REVALIDATE env config param that enables stale-while-revalidate behavior for CubeStore's SQL result cache. When enabled (set to a number of seconds, e.g. 30), the cache maintains a secondary lookup keyed only by SQL query text + inline tables (ignoring partition/chunk IDs). If a query arrives and the exact cache key (which includes current partition/chunk snapshot) misses, but a stale entry exists and was created within the configured timeout, the stale result is served immediately while a background task refreshes the cache with fresh data. This is disabled by default (0 = disabled). The feature reduces query latency during compaction or partition changes when the underlying data hasn't materially changed. Changes: - config/mod.rs: Add query_cache_stale_while_revalidate_secs to ConfigObj trait and ConfigObjImpl, parsed from CUBESTORE_QUERY_CACHE_STALE_WHILE_REVALIDATE - sql/cache.rs: Add stale_cache (Moka cache keyed by SqlQueueCacheKey), stale-while-revalidate lookup in get(), background refresh via tokio::spawn - app_metrics.rs: Add DATA_QUERIES_CACHE_STALE_HIT counter metric - Unit test for stale-while-revalidate behavior Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>

claude · 2026-05-05T17:48:19Z

Claude finished @paveltiunov's task in 1m 38s —— View job

Code Review: Stale-While-Revalidate Cache

Read and analyze all changed files
Review code quality and correctness
Check for security and performance issues
Evaluate test coverage
Post detailed review feedback

Overall this is a well-structured feature with a clean design. The stale-while-revalidate pattern is a good fit for the compaction-induced cache invalidation problem described in the PR. The code is readable, the config plumbing is correct, and the test covers the core happy path. I found one likely bug and a couple of design considerations worth discussing.

Bug: Stale cache has no weigher — `max_capacity` counts entries, not bytes

rust/cubestore/cubestore/src/sql/cache.rs:107-112

let stale_cache = stale_while_revalidate_timeout.map(|timeout| {
    Cache::builder()
        .time_to_idle(timeout * 2)
        .max_capacity(capacity_bytes)  // ← problem
        .build()
});

The result_cache is configured with .weigher(sql_result_cache_sizeof), which makes max_capacity operate in bytes. The stale_cache has no weigher, so Moka treats max_capacity as a maximum entry count. When capacity_bytes is, say, 536870912 (512 MiB), this allows ~537 million entries — effectively unbounded.

This should either use a weigher (note: the key type is SqlQueueCacheKey, not SqlResultCacheKey, so sql_result_cache_sizeof won't work directly) or use a separate, reasonable entry-count limit. Since each stale entry stores Arc<DataFrame> (cheap pointer) plus an Instant, a simple entry count cap (e.g., a fraction of queue_cache_max_capacity) may be the pragmatic fix here.

Fix this →

Design consideration: No deduplication of concurrent background refreshes

rust/cubestore/cubestore/src/sql/cache.rs:195-232

When a stale hit occurs, the code immediately spawns a background refresh via tokio::spawn. However, this path bypasses the queue_cache deduplication that exists for the normal "exact miss" path (lines 235-263). If N requests arrive simultaneously for the same query text with a stale-but-valid entry, all N will trigger independent background re-executions.

For a single query this is low-risk since the results are idempotent and the extra work is bounded by the stale timeout window. But under high concurrency on a popular query during compaction, this could amplify load. Consider either:

Inserting the queue key into queue_cache before spawning, so subsequent requests coalesce
Using an AtomicBool/flag per queue key to skip if a refresh is already in flight

This isn't a blocker — the feature is disabled by default, and the timeout window limits exposure — but worth considering for production use at scale.

Minor observations

1. Clippy: redundant closure in map — rust/cubestore/cubestore/src/sql/cache.rs:105

stale_while_revalidate_secs.map(|s| Duration::from_secs(s))

Can be simplified to:

stale_while_revalidate_secs.map(Duration::from_secs)

2. Dropped JoinHandle — rust/cubestore/cubestore/src/sql/cache.rs:205

The tokio::spawn return value is dropped. If the background task panics, it will be silently swallowed. Consider either assigning it to let _ = for explicitness, or logging panics by wrapping with a catch_unwind-style handler. The current Err branch at line 228 handles normal errors well, but panics are a different path.

3. Test coverage — rust/cubestore/cubestore/src/sql/cache.rs:487-580

The test validates the happy path well. A few edge cases that would strengthen confidence:

Background refresh failure (exec returns Err) — verify the stale entry is still served on subsequent requests and doesn't get corrupted
Stale timeout expiration — verify that after the timeout elapses, the stale path is no longer used and the query goes through normal execution

Summary

Area	Assessment
Config plumbing	Clean, follows existing patterns, env var parsing is correct
Core cache logic	Sound design; stale lookup → background refresh → update is well-structured
Metrics	Appropriate `stale_hit` counter added
Memory safety	Bug: stale cache `max_capacity` is effectively unbounded without a weigher
Concurrency	Background refresh thundering herd possible under high concurrency
Tests	Good happy-path coverage; edge cases (failure, expiry) not covered
Backwards compatibility	Disabled by default, all existing call sites updated correctly

The stale cache weigher issue should be fixed before merge. The concurrency deduplication is a nice-to-have that could be addressed as a follow-up.

claude · 2026-05-05T19:03:07Z

Claude finished @paveltiunov's task in 3m 46s —— View job

Code Review: Stale-While-Revalidate Cache

Read and analyze all changed files
Review code quality, correctness, and error handling
Check for security and performance issues
Evaluate test coverage
Post detailed review feedback

Overall this is a well-designed feature — the stale-while-revalidate pattern is a great fit for the compaction-induced cache invalidation problem. The config plumbing follows existing patterns cleanly, the metric is appropriate, and the core get() flow is logical. However, I found one critical test issue and one memory bug that should be addressed before merge.

Critical: Test does not exercise the stale-while-revalidate path

rust/cubestore/cubestore/src/sql/cache.rs:487-580

The stale_while_revalidate test uses the same plan (cloned) for all three cache.get() calls. Since this plan has PlanningMeta { indices: Vec::new(), ... }, the SqlResultCacheKey::from_plan() method (line 29-50) produces identical result_keys for all calls — the partition_ids and chunk_ids are both empty vectors derived from the (empty) index snapshots.

This means:

First call (line 516-518): Cache miss → executes → inserts Int(1) into both result_cache and stale_cache
Second call (line 535-537): Exact hit on result_cache.get(&result_key) at line 187 → returns Int(1) immediately. The stale path at line 195 is never reached. exec2 is never called.
Third call (line 561-563): Exact hit again → returns Int(1). The assertion val == TableValue::Int(2) || val == TableValue::Int(3) at line 573-577 should fail.

The key difference between SqlResultCacheKey (includes partition_ids/chunk_ids) and SqlQueueCacheKey (only query + inline_tables) is exactly what this feature relies on — but the test doesn't exercise it because all plans are identical.

To properly test this, the second call needs a plan with different index snapshots (producing a different result_key) but the same query text (producing the same queue_key). For example, construct a second plan with a PlanningMeta containing a dummy IndexSnapshot with different partition/chunk IDs.

Fix this →

Bug: Stale cache `max_capacity` is effectively unbounded

rust/cubestore/cubestore/src/sql/cache.rs:107-112

let stale_cache = stale_while_revalidate_timeout.map(|timeout| {
    Cache::builder()
        .time_to_idle(timeout * 2)
        .max_capacity(capacity_bytes)  // ← problem
        .build()
});

The result_cache (line 117-119) is configured with .weigher(sql_result_cache_sizeof), making max_capacity operate in bytes. The stale_cache has no weigher, so Moka treats max_capacity as a maximum entry count. With a default capacity_bytes of 536870912 (512 MiB), this allows ~537 million entries — effectively unbounded.

Since each StaleEntry stores Arc<DataFrame> (cheap pointer) + Instant, a simple entry-count cap (e.g., a fraction of queue_cache_max_capacity) or a weigher based on the DataFrame's deep size would be more appropriate.

Fix this →

Design: No deduplication of concurrent background refreshes

rust/cubestore/cubestore/src/sql/cache.rs:195-232

When a stale hit occurs, the code immediately spawns a background refresh via tokio::spawn. This bypasses the queue_cache deduplication that exists for the normal "exact miss" path (lines 235-263). If N concurrent requests arrive for the same query text with a stale-but-valid entry, all N will trigger independent background re-executions.

This isn't a blocker (the feature is opt-in and the timeout window limits exposure), but under high concurrency on a popular query during compaction, this could amplify load. Consider either:

Inserting the queue key into queue_cache before spawning, so subsequent requests coalesce
Using an AtomicBool/flag per queue key to skip if a refresh is already in flight

Minor observations

1. Clippy: redundant closure — cache.rs:105

stale_while_revalidate_secs.map(|s| Duration::from_secs(s))
// can be:
stale_while_revalidate_secs.map(Duration::from_secs)

2. Dropped JoinHandle — cache.rs:205

The tokio::spawn return value is dropped. If the background task panics, it will be silently swallowed. Consider wrapping with error handling or assigning to let _ = for explicitness.

3. StaleEntry does not derive DeepSizeOf — cache.rs:68-72

Unlike SqlResultCacheKey, StaleEntry doesn't derive DeepSizeOf. If a weigher is added to the stale cache (as recommended above), this trait will be needed.

Summary

Area	Assessment
Config plumbing	Clean, follows existing patterns, env var parsing is correct
Core cache logic	Sound design; stale lookup → background refresh → update is well-structured
Metrics	Appropriate `stale_hit` counter added
Memory safety	Stale cache `max_capacity` is effectively unbounded without a weigher
Test correctness	Test does not exercise the stale path — all calls hit exact cache due to identical plans
Concurrency	Background refresh thundering herd possible under high concurrency
Backwards compatibility	Disabled by default, all existing call sites updated correctly

The test issue is the most critical finding — it means the core feature behavior has no automated validation. The stale cache capacity bug should also be fixed to avoid unbounded memory growth. The concurrency deduplication could be addressed as a follow-up.

…idate refresh The background refresh spawned by stale-while-revalidate now integrates with the queue cache dedup mechanism: - Extract try_register_background() which atomically checks if a query is already in flight in the queue cache and registers a new entry if not. - On stale hit, try_register_background() is called first. If another execution is already in flight (from a previous stale hit or a concurrent foreground request), we skip spawning entirely. - The background spawn registers a watch::Sender in the queue cache, so any concurrent foreground get() for the same SQL that doesn't hit the stale cache will join as a waiter instead of starting a duplicate execution. - Queue cache entry cleanup happens lazily: sender is dropped when the background task completes, closing the channel. The next get() for that key detects the closed channel via has_changed().is_err() and pops it. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>

…e-revalidate Extract get_inner() with is_background_refresh flag from get(). The background spawn now calls get_inner(true) which skips the stale cache check but reuses the entire queue cache dedup flow — no separate implementation for the background path. - get() is a thin wrapper: delegates to get_inner(false) - get_inner(is_background_refresh=true) skips stale check, goes straight to queue cache logic for natural dedup - get() now takes self: &Arc<Self> so the spawn can clone the Arc - get_inner() takes self: Arc<Self> and returns Pin<Box<dyn Future>> to satisfy Send + 'static bounds for tokio::spawn - Removed try_register_background() and all duplicated cache update logic from previous approach Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>

… change The test now invalidates the exact result cache before the second get() call to simulate a partition/chunk change that would make the exact cache key miss while the stale entry (keyed by SQL only) still matches. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>

- Add weigher to stale_cache so max_capacity is in bytes, not entry count - Fix clippy: use Duration::from_secs as method reference instead of closure - Add test: background refresh failure preserves stale entry for next request - Add test: stale timeout expiry falls through to fresh execution Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>

codecov · 2026-05-06T00:14:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.09%. Comparing base (2baf0bc) to head (2d99902).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #10822   +/-   ##
=======================================
  Coverage   58.09%   58.09%           
=======================================
  Files         216      216           
  Lines       16899    16899           
  Branches     3426     3426           
=======================================
  Hits         9818     9818           
  Misses       6578     6578           
  Partials      503      503

Flag	Coverage Δ
cube-backend	`58.09% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions Bot added cube store Issues relating to Cube Store rust Pull requests that update Rust code labels May 5, 2026

vercel Bot deployed to Preview May 5, 2026 17:49 View deployment

paveltiunov marked this pull request as ready for review May 5, 2026 19:02

paveltiunov requested a review from a team as a code owner May 5, 2026 19:02

cursoragent and others added 4 commits May 5, 2026 23:18

waralexrom approved these changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cubestore): add stale-while-revalidate timeout for SQL query cache#10822

feat(cubestore): add stale-while-revalidate timeout for SQL query cache#10822
paveltiunov wants to merge 5 commits intomasterfrom
cursor/stale-while-revalidate-cache-e465

paveltiunov commented May 5, 2026 •

edited by cursor Bot

Loading

Uh oh!

claude Bot commented May 5, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 5, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

paveltiunov commented May 5, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Implementation approach

Changes

New metric

Unit tests

Uh oh!

claude Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Stale-While-Revalidate Cache

Bug: Stale cache has no weigher — max_capacity counts entries, not bytes

Design consideration: No deduplication of concurrent background refreshes

Minor observations

Summary

Uh oh!

claude Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Stale-While-Revalidate Cache

Critical: Test does not exercise the stale-while-revalidate path

Bug: Stale cache max_capacity is effectively unbounded

Design: No deduplication of concurrent background refreshes

Minor observations

Summary

Uh oh!

codecov Bot commented May 6, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

paveltiunov commented May 5, 2026 •

edited by cursor Bot

Loading

claude Bot commented May 5, 2026 •

edited

Loading

Bug: Stale cache has no weigher — `max_capacity` counts entries, not bytes

claude Bot commented May 5, 2026 •

edited

Loading

Bug: Stale cache `max_capacity` is effectively unbounded