Skip to content

HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns#6418

Open
konstantinb wants to merge 3 commits intoapache:masterfrom
konstantinb:HIVE-29556
Open

HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns#6418
konstantinb wants to merge 3 commits intoapache:masterfrom
konstantinb:HIVE-29556

Conversation

@konstantinb
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refreshing parquet_types_non_dictionary_encoding_vectorization.q.out was missed in the original impacted-set run. The 14-line diff is in a side query (SELECT hex(cbinary), count(*) FROM parquet_types_n1 GROUP BY cbinary), not the test's main subject — cbinary is a binary column, and Hive's getColStatistics BINARY branch never populates countDistinct, so it arrives at
extractNDVGroupingColumns with the canonical (NDV=0, numNulls>0) unknown-NDV signature this PR targets.

The new estimate is also empirically more accurate. The actual data in the test (visible from the SELECT rows in the .out) has 36 distinct non-NULL binary values + 1 NULL bucket = 37 actual GROUP
BY groups
. Estimates:

Estimate Error
Master (+1 of 0) 1 37× under
This PR (heuristic fallback at hash) 150 ~4× over
This PR (heuristic at mergepartial) 75 ~2× over

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 1, 2026

@konstantinb
Copy link
Copy Markdown
Contributor Author

@zabetak — picking up on our earlier discussion in #6359, I'd like to walk through what this revision actually does and how it lines up with the concerns you raised there. The PR's primary subject
(the extractNDVGroupingColumns over-confident +1 for unknown NDV) is fixed in a way that resolves the const-NULL edge case you flagged, without an architectural refactor.

What was done

A single in-memory boolean field isConst is added to ColStatistics, set at exactly the sites that produce a verifiably-zero NDV:

1. `StatsUtils.buildColStatForConstant` — every constant projection (NULL or non-NULL literal).
2. `StatsUtils.getColStatistics` BOOLEAN branch, all-NULL sub-case (`numTrues == 0 && numFalses == 0`).
3. `StatsUtils.getColStatistics` BOOLEAN branch, verified single-value sub-case (numNulls == 0 with one of {numTrues, numFalses} verified positive) — covers all-TRUE and all-FALSE.

PessimisticStatCombiner.add AND-clears the flag across multi-branch combinations:

  result.setConst(result.isConst() && stat.isConst());

This directly closes the gotcha you flagged on April 2: "if in the CASE statement you have one branch with unknown NDV and another that is the NULL constant the result of the combiner will appear to
other consumers as a NULL constant"
. With the AND-clear, an unknown branch combined with a NULL-literal branch produces a result with isConst = false, so downstream consumers see "unknown" rather
than "verified const-NULL".

The only reader of the flag is extractNDVGroupingColumns, which uses one unified condition for the +1 NULL-bucket adjustment:

  if (cs.getNumNulls() > 0 && (ndv > 0 || cs.isConst())) {
    ndv = StatsUtils.safeAdd(ndv, 1);
  }

This applies the +1 for verified const-NULL keys (yielding the exact 1 group) and skips for unknown-NDV columns (yielding 0 → caller falls back to parentNumRows / 2).

Impact surface

ColStatistics is a plain Java POJO in org.apache.hadoop.hive.ql.plan — no Thrift inheritance, no Serializable, no metastore persistence. The field is purely in-memory query-plan state within a
single optimizer pass. The metastore Thrift schema (ColumnStatisticsObj, the per-type stats data classes) is unchanged; existing persisted column stats are unaffected; cross-engine consumers (Spark
/ Iceberg / Trino bridges) see the same wire format.

Touched sites in production code:

  • ColStatistics: new field + getter/setter, propagated in clone() and dumped in toString() — the same housekeeping every other boolean field on ColStatistics already carries (mirrors
    isEstimated / isFilteredColumn).
  • StatsUtils: 3 producer sites where setConst(true) is called — buildColStatForConstant (constant projections), and two sub-cases of the getColStatistics BOOLEAN branch (all-NULL + verified
    single-value).
  • PessimisticStatCombiner.add: 1 line — the AND-clear rule.
  • extractNDVGroupingColumns: 1 read site (the only consumer).

Net flag-aware logic: 4 sites (3 producers + 1 combiner rule) + 1 reader. Everything else is one-time bookkeeping that any new boolean field on ColStatistics carries.

On your March 25 concern

You wrote on March 25: "If we add a new field it means that we need to keep them up to date in various places where ColStatistics are used so I would prefer to avoid this if possible." That concern
is correct in principle — and worth weighing against the alternative below. In practice, with this PR's flag-aware surface bounded to 4 sites + 1 reader, and the default-safe failure mode (a missing
setConst(true) call leaves the recovery dormant rather than introducing wrong answers), the maintenance cost is one-time and finite rather than open-ended.

A side benefit for #6359 itself

With isConst in place, the StatEstimator interface refactor in #6359 (adding a numRows parameter to estimate(...)) becomes unnecessary. Two motivations for that refactor dissolve:

  • The min(rows, sum NDV) bound enforcement — your March 25 suggestion: "A GenericUDF can never have an NDV > numRows so instead of putting the responsibility to the implementors of StatEstimator
    we could enforce the bound here. This would avoid the changes in the existing APIs."
    Applying the cap at the call site keeps estimate(parents) unchanged.
  • The combiner's need to detect const-NULL via signature (which required parent rows internally to check numNulls == numRows) — closed by isConst being explicit in ColStatistics. The combiner
    reads the flag directly; no signature inference, no parent-row plumbing needed.

Net effect: #6359 can drop the API change and stay focused on its stated goal (the combiner improvement) — directly addressing your March 25 "There is no reason to add a default implementation for
this method. Leaving things as they are does not break anything."

Alternative considered

The architecturally complete fix is the one you yourself suggested on March 26: "everything would be simpler if we could use -1 for NDV to declare unknown as it happens for the other stats. This is
probably a bigger change to digest so let's not go into this direction for now."
I agree it's the right long-term answer; concretely the cost is:

  • ~20+ read sites of ColStatistics.getCountDistint() across optimizer/, stats/, etc. — most check <= 0 or == 0 under the implicit "0 means unknown" assumption and would each need
    case-by-case re-evaluation;
  • the metastore Thrift schema's per-type stats data (getNumDVs() returning long with implicit 0 default) — definitionally tangles "verified zero" with "unset";
  • backward compatibility for already-persisted column stats, where existing 0 values become ambiguous between "verified zero" and "old/unset";
  • cross-engine coordination (Spark / Iceberg / Trino consume Hive's column stats — semantic shift in NDV=0 is a wire-compatibility break);
  • regenerating likely hundreds of .q.out goldens that print NDV in EXPLAIN annotations.

vs isConst's 4 flag-aware sites / 1 reader / no schema or wire change.

I think this revision is the right balance for the cost/benefit ratio, and on March 26 in the same thread you indicated you'd accept this approach: "I am OK to revert to your original proposal of
adding a new boolean field in ColStatistics object for marking constants. Leaving the final choice to you :)"
Happy to discuss further if any of the above raises concerns.

@konstantinb konstantinb marked this pull request as ready for review May 1, 2026 21:50
@zabetak
Copy link
Copy Markdown
Member

zabetak commented May 4, 2026

@konstantinb Thanks for the detailed answer.

NULL is not an actual value so not sure if it should be reflected in the NDV. In other words, maybe we should not adjust NDV at all independent if it is constant or not. I think we have discussed this at some point but don't remember if we concluded somewhere. Have you explored this alternative?

@konstantinb
Copy link
Copy Markdown
Contributor Author

konstantinb commented May 4, 2026

@konstantinb Thanks for the detailed answer.

NULL is not an actual value so not sure if it should be reflected in the NDV. In other words, maybe we should not adjust NDV at all independent if it is constant or not. I think we have discussed this at some point but don't remember if we concluded somewhere. Have you explored this alternative?

@zabetak — sorry for not following up sooner on your question in #6359. You raised the possibility of dropping the +1 for the NDV > 0 case entirely rather than narrowing it. I wanted to show a concrete case where that matters before going that route.

The +1 dates to HIVE-5369 (Prasanth Jayachandran, 2013), introduced with the comment "if NULLs exists, add 1 to distinct count":

long dvs = cs.getCountDistint();
// if NULLs exists, add 1 to distinct count
if (cs.getNumNulls() > 0) {
dvs += 1;
}

The reason is GROUP BY semantics: a nullable column with NDV > 0 produces one extra output group for NULL rows. Dropping the +1 means that extra group disappears from the estimate.

I added ql/src/test/queries/clientpositive/groupby_null_bucket.q to make this concrete. The scenario: two columns each with NDV=1 but a large null fraction (the pattern you get from LEFT JOINs with many unmatched rows), grouped alongside a high-cardinality key:

Column NDV numNulls
product_id 100,000 0
funnel_step 1 480,000,000
device_type 1 450,000,000

With +1: NDV product = 100K × 2 × 2 = 400K → 13 MB → above 10 MB map join threshold → Merge Join
Without +1: NDV product = 100K × 1 × 1 = 100K → 3.3 MB → below threshold → Map Join (4× underestimate, OOM risk on the actual ~400K-row result)

What would be removed:

    long ndv = cs.getCountDistint();
  - if (cs.getNumNulls() > 0) {
  -   ndv = StatsUtils.safeAdd(ndv, 1);
  - }
    ndvValues.add(ndv);

Plan change (key excerpts):

With +1:
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Map 4 (SIMPLE_EDGE), Reducer 2 (SIMPLE_EDGE)
...
Group By Operator [mergepartial]
Statistics: Num rows: 400000 Data size: 13064238
...
Merge Join Operator

Without +1:
Edges:
Map 3 <- Reducer 2 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
...
Group By Operator [mergepartial]
Statistics: Num rows: 100000 Data size: 3266238
...
Map Join Operator

groupby_null_bucket.q.txt
groupby_null_bucket.q.out.WITHOUT_PLUS1.txt
groupby_null_bucket.q.out.WITH_PLUS1.txt

Could you please let me know whether you think this is a realistic scenario or an artificial corner case that can safely be ignored?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants