feat(cubestore): add stale-while-revalidate timeout for SQL query cache#10822
feat(cubestore): add stale-while-revalidate timeout for SQL query cache#10822paveltiunov wants to merge 5 commits intomasterfrom
Conversation
Add CUBESTORE_QUERY_CACHE_STALE_WHILE_REVALIDATE env config param that enables stale-while-revalidate behavior for CubeStore's SQL result cache. When enabled (set to a number of seconds, e.g. 30), the cache maintains a secondary lookup keyed only by SQL query text + inline tables (ignoring partition/chunk IDs). If a query arrives and the exact cache key (which includes current partition/chunk snapshot) misses, but a stale entry exists and was created within the configured timeout, the stale result is served immediately while a background task refreshes the cache with fresh data. This is disabled by default (0 = disabled). The feature reduces query latency during compaction or partition changes when the underlying data hasn't materially changed. Changes: - config/mod.rs: Add query_cache_stale_while_revalidate_secs to ConfigObj trait and ConfigObjImpl, parsed from CUBESTORE_QUERY_CACHE_STALE_WHILE_REVALIDATE - sql/cache.rs: Add stale_cache (Moka cache keyed by SqlQueueCacheKey), stale-while-revalidate lookup in get(), background refresh via tokio::spawn - app_metrics.rs: Add DATA_QUERIES_CACHE_STALE_HIT counter metric - Unit test for stale-while-revalidate behavior Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Claude finished @paveltiunov's task in 1m 38s —— View job Code Review: Stale-While-Revalidate Cache
Overall this is a well-structured feature with a clean design. The stale-while-revalidate pattern is a good fit for the compaction-induced cache invalidation problem described in the PR. The code is readable, the config plumbing is correct, and the test covers the core happy path. I found one likely bug and a couple of design considerations worth discussing. Bug: Stale cache has no weigher —
|
| Area | Assessment |
|---|---|
| Config plumbing | Clean, follows existing patterns, env var parsing is correct |
| Core cache logic | Sound design; stale lookup → background refresh → update is well-structured |
| Metrics | Appropriate stale_hit counter added |
| Memory safety | Bug: stale cache max_capacity is effectively unbounded without a weigher |
| Concurrency | Background refresh thundering herd possible under high concurrency |
| Tests | Good happy-path coverage; edge cases (failure, expiry) not covered |
| Backwards compatibility | Disabled by default, all existing call sites updated correctly |
The stale cache weigher issue should be fixed before merge. The concurrency deduplication is a nice-to-have that could be addressed as a follow-up.
|
Claude finished @paveltiunov's task in 3m 46s —— View job Code Review: Stale-While-Revalidate Cache
Overall this is a well-designed feature — the stale-while-revalidate pattern is a great fit for the compaction-induced cache invalidation problem. The config plumbing follows existing patterns cleanly, the metric is appropriate, and the core Critical: Test does not exercise the stale-while-revalidate path
The This means:
The key difference between To properly test this, the second call needs a plan with different index snapshots (producing a different Bug: Stale cache
|
| Area | Assessment |
|---|---|
| Config plumbing | Clean, follows existing patterns, env var parsing is correct |
| Core cache logic | Sound design; stale lookup → background refresh → update is well-structured |
| Metrics | Appropriate stale_hit counter added |
| Memory safety | Stale cache max_capacity is effectively unbounded without a weigher |
| Test correctness | Test does not exercise the stale path — all calls hit exact cache due to identical plans |
| Concurrency | Background refresh thundering herd possible under high concurrency |
| Backwards compatibility | Disabled by default, all existing call sites updated correctly |
The test issue is the most critical finding — it means the core feature behavior has no automated validation. The stale cache capacity bug should also be fixed to avoid unbounded memory growth. The concurrency deduplication could be addressed as a follow-up.
…idate refresh The background refresh spawned by stale-while-revalidate now integrates with the queue cache dedup mechanism: - Extract try_register_background() which atomically checks if a query is already in flight in the queue cache and registers a new entry if not. - On stale hit, try_register_background() is called first. If another execution is already in flight (from a previous stale hit or a concurrent foreground request), we skip spawning entirely. - The background spawn registers a watch::Sender in the queue cache, so any concurrent foreground get() for the same SQL that doesn't hit the stale cache will join as a waiter instead of starting a duplicate execution. - Queue cache entry cleanup happens lazily: sender is dropped when the background task completes, closing the channel. The next get() for that key detects the closed channel via has_changed().is_err() and pops it. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
…e-revalidate Extract get_inner() with is_background_refresh flag from get(). The background spawn now calls get_inner(true) which skips the stale cache check but reuses the entire queue cache dedup flow — no separate implementation for the background path. - get() is a thin wrapper: delegates to get_inner(false) - get_inner(is_background_refresh=true) skips stale check, goes straight to queue cache logic for natural dedup - get() now takes self: &Arc<Self> so the spawn can clone the Arc - get_inner() takes self: Arc<Self> and returns Pin<Box<dyn Future>> to satisfy Send + 'static bounds for tokio::spawn - Removed try_register_background() and all duplicated cache update logic from previous approach Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
… change The test now invalidates the exact result cache before the second get() call to simulate a partition/chunk change that would make the exact cache key miss while the stale entry (keyed by SQL only) still matches. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
- Add weigher to stale_cache so max_capacity is in bytes, not entry count - Fix clippy: use Duration::from_secs as method reference instead of closure - Add test: background refresh failure preserves stale entry for next request - Add test: stale timeout expiry falls through to fresh execution Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10822 +/- ##
=======================================
Coverage 58.09% 58.09%
=======================================
Files 216 216
Lines 16899 16899
Branches 3426 3426
=======================================
Hits 9818 9818
Misses 6578 6578
Partials 503 503
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Check List
Description of Changes Made
Adds a new
CUBESTORE_QUERY_CACHE_STALE_WHILE_REVALIDATEenvironment variable to CubeStore that enables stale-while-revalidate behavior for the SQL result cache.Problem
CubeStore's SQL result cache key includes the current snapshot of partition and chunk IDs. When compaction or partition changes occur, the cache key changes even though the underlying query data may not have materially changed. This causes cache misses and forces queries to wait for full re-execution.
Solution
When enabled (set to a number of seconds, e.g.
30), the cache maintains a secondary lookup keyed only by SQL query text + inline tables (ignoring partition/chunk IDs). The behavior is:This is disabled by default (
0= disabled). Setting it to e.g.30allows serving stale results for up to 30 seconds while refreshing in the background.Implementation approach
The implementation uses a single
get_inner()method with anis_background_refreshflag:get()is a thin public wrapper that callsget_inner(false)get_inner(true)— which skips the stale cache check but goes through the entire queue cache dedup flow naturallyget()takesself: &Arc<Self>so the Arc can be cloned into the spawned taskget_inner()returnsPin<Box<dyn Future + Send>>to satisfytokio::spawnboundsChanges
config/mod.rsquery_cache_stale_while_revalidate_secstoConfigObjtrait andConfigObjImpl, parsed fromCUBESTORE_QUERY_CACHE_STALE_WHILE_REVALIDATEenv varsql/cache.rsstale_cache(Moka cache with byte weigher, keyed bySqlQueueCacheKey), extractedget_inner()withis_background_refreshflag, stale-while-revalidate lookup, background refresh viatokio::spawncalling back intoget_inner(true)app_metrics.rsDATA_QUERIES_CACHE_STALE_HITcounter metric (cs.sql.query.data.cache.stale_hit)New metric
cs.sql.query.data.cache.stale_hit— counts how many queries were served from the stale cache with a background refresh triggeredUnit tests
stale_while_revalidatestale_while_revalidate_background_failurestale_while_revalidate_timeout_expiry