perf(group): gate the multi-key Top-K candidate finder on input size#227
Merged
Conversation
The TopK candidate finder in exec_group is single-threaded: it builds one SoA hash table sized to n_scan * 4/3 (cc[]/ck64[]/ck32[]) and scans the full input sequentially, then refines aggregates for the K winners in a second pass. The shortcut pays off when n_groups is much smaller than n_scan and the K winners absorb most of the rows — Pass-2 then re-aggregates only K << n_groups rows worth of state. For uniform high-cardinality inputs (10M rows × ~10M distinct composite keys) the SoA HT is hundreds of MB, every probe is an L3/DRAM miss, the single-threaded scan is latency-bound, and Pass-2 gains nothing because nearly every group already has count = 1. The parallel radix_v2 path with per-(worker, partition) shards runs ~3-4× faster on such inputs. Add `n_scan <= 1000000` to the TopK-candidate gate so large inputs fall through to the parallel path. Smaller inputs (where the single-thread SoA HT fits L2/L3 and Pass-1's skip-the-other-aggs trade is worthwhile) keep the existing fast path. ClickBench 10M: q32 ~890 → ~204 ms
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The TopK candidate finder in
exec_group(theuse_emit_filter && top_count_take > 0 && n_keys > 1block) is single-threaded: itbuilds one SoA hash table sized to
n_scan * 4/3(cc[]/ck64[]/ck32[]) and scans the full input sequentially, then refinesaggregates for the K winners in a second pass. The shortcut pays
off when n_groups ≪ n_scan and the K winners absorb most of the rows
— Pass-2 then re-aggregates only K rows worth of state.
For uniform high-cardinality inputs (10M rows × ~10M distinct
composite keys, e.g. ClickBench q32 by
{WatchID, ClientIP}) theSoA HT is hundreds of MB, every probe is an L3/DRAM miss, the
single-threaded scan is latency-bound, and Pass-2 gains nothing
because nearly every group already has count = 1. The parallel
radix_v2_phase1_fnpath with per-(worker, partition) shards runs~3-4× faster on such inputs.
Add
n_scan <= 1000000to the TopK candidate gate so large inputsfall through to the parallel path. Smaller inputs (where the
single-thread SoA HT fits L2/L3 and Pass-1's skip-the-other-aggs
trade is still worth it) keep the existing fast path.
Profile of the hot loop before the gate (q32, 30 reps):
ClickBench 10M:
Tests: 3232/3234 pass (unchanged).