From cb569e68721e89f435861c4cf3995206fbd03e45 Mon Sep 17 00:00:00 2001 From: Serhii Savchuk Date: Fri, 5 Jun 2026 18:16:17 +0300 Subject: [PATCH] perf(group): gate the multi-key Top-K candidate finder on input size MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The TopK candidate finder in exec_group is single-threaded: it builds one SoA hash table sized to n_scan * 4/3 (cc[]/ck64[]/ck32[]) and scans the full input sequentially, then refines aggregates for the K winners in a second pass. The shortcut pays off when n_groups is much smaller than n_scan and the K winners absorb most of the rows — Pass-2 then re-aggregates only K << n_groups rows worth of state. For uniform high-cardinality inputs (10M rows × ~10M distinct composite keys) the SoA HT is hundreds of MB, every probe is an L3/DRAM miss, the single-threaded scan is latency-bound, and Pass-2 gains nothing because nearly every group already has count = 1. The parallel radix_v2 path with per-(worker, partition) shards runs ~3-4× faster on such inputs. Add `n_scan <= 1000000` to the TopK-candidate gate so large inputs fall through to the parallel path. Smaller inputs (where the single-thread SoA HT fits L2/L3 and Pass-1's skip-the-other-aggs trade is worthwhile) keep the existing fast path. ClickBench 10M: q32 ~890 → ~204 ms --- src/ops/group.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/src/ops/group.c b/src/ops/group.c index 99eb4e92..003a6971 100644 --- a/src/ops/group.c +++ b/src/ops/group.c @@ -7387,8 +7387,18 @@ ht_path:; ray_t* part_hts_hdr = NULL; group_ht_t* part_hts = NULL; + /* Top-K candidate finder (this block) is single-threaded — it builds + * one SoA hash table sized to n_scan*4/3 and scans the full input + * sequentially. For uniform high-cardinality inputs (e.g. ClickBench + * q32 — 10M rows, ~10M distinct composite keys) the SoA HT is hundreds + * of MB, every probe is an L3/DRAM miss, and the single-threaded scan + * is dominated by latency. The parallel radix_v2 path below partitions + * the keys across workers with cache-resident shards and runs ~3-4× + * faster on such inputs. Gate the TopK shortcut on input size so it + * only fires where Pass-1's skip-the-unneeded-aggs trade is worth a + * full single-threaded scan. */ if (use_emit_filter && emit_filter.top_count_take > 0 && - n_keys > 1) { + n_keys > 1 && n_scan <= 1000000) { bool top_count_nonselective = false; if (n_keys >= 2 && n_keys <= 5) { bool supported = true;