From cb569e68721e89f435861c4cf3995206fbd03e45 Mon Sep 17 00:00:00 2001
From: Serhii Savchuk <ser.vasilich@hotmail.com>
Date: Fri, 5 Jun 2026 18:16:17 +0300
Subject: [PATCH] perf(group): gate the multi-key Top-K candidate finder on
 input size
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The TopK candidate finder in exec_group is single-threaded: it builds
one SoA hash table sized to n_scan * 4/3 (cc[]/ck64[]/ck32[]) and
scans the full input sequentially, then refines aggregates for the K
winners in a second pass.  The shortcut pays off when n_groups is
much smaller than n_scan and the K winners absorb most of the rows
— Pass-2 then re-aggregates only K << n_groups rows worth of state.

For uniform high-cardinality inputs (10M rows × ~10M distinct
composite keys) the SoA HT is hundreds of MB, every probe is an
L3/DRAM miss, the single-threaded scan is latency-bound, and Pass-2
gains nothing because nearly every group already has count = 1.
The parallel radix_v2 path with per-(worker, partition) shards runs
~3-4× faster on such inputs.

Add `n_scan <= 1000000` to the TopK-candidate gate so large inputs
fall through to the parallel path.  Smaller inputs (where the
single-thread SoA HT fits L2/L3 and Pass-1's skip-the-other-aggs
trade is worthwhile) keep the existing fast path.

ClickBench 10M:
  q32  ~890 → ~204 ms
---
 src/ops/group.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/ops/group.c b/src/ops/group.c
index 99eb4e92..003a6971 100644
--- a/src/ops/group.c
+++ b/src/ops/group.c
@@ -7387,8 +7387,18 @@ ht_path:;
     ray_t* part_hts_hdr = NULL;
     group_ht_t*  part_hts   = NULL;
 
+    /* Top-K candidate finder (this block) is single-threaded — it builds
+     * one SoA hash table sized to n_scan*4/3 and scans the full input
+     * sequentially.  For uniform high-cardinality inputs (e.g. ClickBench
+     * q32 — 10M rows, ~10M distinct composite keys) the SoA HT is hundreds
+     * of MB, every probe is an L3/DRAM miss, and the single-threaded scan
+     * is dominated by latency.  The parallel radix_v2 path below partitions
+     * the keys across workers with cache-resident shards and runs ~3-4×
+     * faster on such inputs.  Gate the TopK shortcut on input size so it
+     * only fires where Pass-1's skip-the-unneeded-aggs trade is worth a
+     * full single-threaded scan. */
     if (use_emit_filter && emit_filter.top_count_take > 0 &&
-        n_keys > 1) {
+        n_keys > 1 && n_scan <= 1000000) {
         bool top_count_nonselective = false;
         if (n_keys >= 2 && n_keys <= 5) {
             bool supported = true;