HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns#6418
HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns#6418konstantinb wants to merge 3 commits intoapache:masterfrom
Conversation
…ne more "const" condition missed earlier
There was a problem hiding this comment.
Refreshing parquet_types_non_dictionary_encoding_vectorization.q.out was missed in the original impacted-set run. The 14-line diff is in a side query (SELECT hex(cbinary), count(*) FROM parquet_types_n1 GROUP BY cbinary), not the test's main subject — cbinary is a binary column, and Hive's getColStatistics BINARY branch never populates countDistinct, so it arrives at
extractNDVGroupingColumns with the canonical (NDV=0, numNulls>0) unknown-NDV signature this PR targets.
The new estimate is also empirically more accurate. The actual data in the test (visible from the SELECT rows in the .out) has 36 distinct non-NULL binary values + 1 NULL bucket = 37 actual GROUP
BY groups. Estimates:
| Estimate | Error | |
|---|---|---|
Master (+1 of 0) |
1 | 37× under |
| This PR (heuristic fallback at hash) | 150 | ~4× over |
| This PR (heuristic at mergepartial) | 75 | ~2× over |
|
|
@zabetak — picking up on our earlier discussion in #6359, I'd like to walk through what this revision actually does and how it lines up with the concerns you raised there. The PR's primary subject What was done A single in-memory boolean field
This directly closes the gotcha you flagged on April 2: "if in the CASE statement you have one branch with unknown NDV and another that is the NULL constant the result of the combiner will appear to The only reader of the flag is This applies the +1 for verified const-NULL keys (yielding the exact Impact surface
Touched sites in production code:
Net flag-aware logic: 4 sites (3 producers + 1 combiner rule) + 1 reader. Everything else is one-time bookkeeping that any new boolean field on On your March 25 concern You wrote on March 25: "If we add a new field it means that we need to keep them up to date in various places where ColStatistics are used so I would prefer to avoid this if possible." That concern A side benefit for #6359 itself With
Net effect: #6359 can drop the API change and stay focused on its stated goal (the combiner improvement) — directly addressing your March 25 "There is no reason to add a default implementation for Alternative considered The architecturally complete fix is the one you yourself suggested on March 26: "everything would be simpler if we could use -1 for NDV to declare unknown as it happens for the other stats. This is
vs I think this revision is the right balance for the cost/benefit ratio, and on March 26 in the same thread you indicated you'd accept this approach: "I am OK to revert to your original proposal of |
|
@konstantinb Thanks for the detailed answer. NULL is not an actual value so not sure if it should be reflected in the NDV. In other words, maybe we should not adjust NDV at all independent if it is constant or not. I think we have discussed this at some point but don't remember if we concluded somewhere. Have you explored this alternative? |
@zabetak — sorry for not following up sooner on your question in #6359. You raised the possibility of dropping the +1 for the NDV > 0 case entirely rather than narrowing it. I wanted to show a concrete case where that matters before going that route. The +1 dates to HIVE-5369 (Prasanth Jayachandran, 2013), introduced with the comment "if NULLs exists, add 1 to distinct count": long dvs = cs.getCountDistint(); The reason is GROUP BY semantics: a nullable column with NDV > 0 produces one extra output group for NULL rows. Dropping the +1 means that extra group disappears from the estimate. I added ql/src/test/queries/clientpositive/groupby_null_bucket.q to make this concrete. The scenario: two columns each with NDV=1 but a large null fraction (the pattern you get from LEFT JOINs with many unmatched rows), grouped alongside a high-cardinality key:
With +1: NDV product = 100K × 2 × 2 = 400K → 13 MB → above 10 MB map join threshold → Merge Join Plan change (key excerpts): With +1: Without +1: groupby_null_bucket.q.txt Could you please let me know whether you think this is a realistic scenario or an artificial corner case that can safely be ignored? |



What changes were proposed in this pull request?
HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?