HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns by konstantinb · Pull Request #6418 · apache/hive

konstantinb · 2026-04-09T22:17:05Z

What changes were proposed in this pull request?

HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

…w flag (or bigger refactoring)

…ne more "const" condition missed earlier

konstantinb · 2026-05-01T19:07:13Z

Refreshing parquet_types_non_dictionary_encoding_vectorization.q.out was missed in the original impacted-set run. The 14-line diff is in a side query (SELECT hex(cbinary), count(*) FROM parquet_types_n1 GROUP BY cbinary), not the test's main subject — cbinary is a binary column, and Hive's getColStatistics BINARY branch never populates countDistinct, so it arrives at
extractNDVGroupingColumns with the canonical (NDV=0, numNulls>0) unknown-NDV signature this PR targets.

The new estimate is also empirically more accurate. The actual data in the test (visible from the SELECT rows in the .out) has 36 distinct non-NULL binary values + 1 NULL bucket = 37 actual GROUP
BY groups. Estimates:

Estimate Error

Master (+1 of 0) 1 37× under

This PR (heuristic fallback at hash) 150 ~4× over

This PR (heuristic at mergepartial) 75 ~2× over

sonarqubecloud · 2026-05-01T20:03:52Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

konstantinb · 2026-05-01T21:49:13Z

@zabetak — picking up on our earlier discussion in #6359, I'd like to walk through what this revision actually does and how it lines up with the concerns you raised there. The PR's primary subject
(the extractNDVGroupingColumns over-confident +1 for unknown NDV) is fixed in a way that resolves the const-NULL edge case you flagged, without an architectural refactor.

What was done

A single in-memory boolean field isConst is added to ColStatistics, set at exactly the sites that produce a verifiably-zero NDV:

1. `StatsUtils.buildColStatForConstant` — every constant projection (NULL or non-NULL literal).
2. `StatsUtils.getColStatistics` BOOLEAN branch, all-NULL sub-case (`numTrues == 0 && numFalses == 0`).
3. `StatsUtils.getColStatistics` BOOLEAN branch, verified single-value sub-case (numNulls == 0 with one of {numTrues, numFalses} verified positive) — covers all-TRUE and all-FALSE.

PessimisticStatCombiner.add AND-clears the flag across multi-branch combinations:

  result.setConst(result.isConst() && stat.isConst());

This directly closes the gotcha you flagged on April 2: "if in the CASE statement you have one branch with unknown NDV and another that is the NULL constant the result of the combiner will appear to
other consumers as a NULL constant". With the AND-clear, an unknown branch combined with a NULL-literal branch produces a result with isConst = false, so downstream consumers see "unknown" rather
than "verified const-NULL".

The only reader of the flag is extractNDVGroupingColumns, which uses one unified condition for the +1 NULL-bucket adjustment:

  if (cs.getNumNulls() > 0 && (ndv > 0 || cs.isConst())) {
    ndv = StatsUtils.safeAdd(ndv, 1);
  }

This applies the +1 for verified const-NULL keys (yielding the exact 1 group) and skips for unknown-NDV columns (yielding 0 → caller falls back to parentNumRows / 2).

Impact surface

ColStatistics is a plain Java POJO in org.apache.hadoop.hive.ql.plan — no Thrift inheritance, no Serializable, no metastore persistence. The field is purely in-memory query-plan state within a
single optimizer pass. The metastore Thrift schema (ColumnStatisticsObj, the per-type stats data classes) is unchanged; existing persisted column stats are unaffected; cross-engine consumers (Spark
/ Iceberg / Trino bridges) see the same wire format.

Touched sites in production code:

ColStatistics: new field + getter/setter, propagated in clone() and dumped in toString() — the same housekeeping every other boolean field on ColStatistics already carries (mirrors
isEstimated / isFilteredColumn).
StatsUtils: 3 producer sites where setConst(true) is called — buildColStatForConstant (constant projections), and two sub-cases of the getColStatistics BOOLEAN branch (all-NULL + verified
single-value).
PessimisticStatCombiner.add: 1 line — the AND-clear rule.
extractNDVGroupingColumns: 1 read site (the only consumer).

Net flag-aware logic: 4 sites (3 producers + 1 combiner rule) + 1 reader. Everything else is one-time bookkeeping that any new boolean field on ColStatistics carries.

On your March 25 concern

You wrote on March 25: "If we add a new field it means that we need to keep them up to date in various places where ColStatistics are used so I would prefer to avoid this if possible." That concern
is correct in principle — and worth weighing against the alternative below. In practice, with this PR's flag-aware surface bounded to 4 sites + 1 reader, and the default-safe failure mode (a missing
setConst(true) call leaves the recovery dormant rather than introducing wrong answers), the maintenance cost is one-time and finite rather than open-ended.

A side benefit for #6359 itself

With isConst in place, the StatEstimator interface refactor in #6359 (adding a numRows parameter to estimate(...)) becomes unnecessary. Two motivations for that refactor dissolve:

The min(rows, sum NDV) bound enforcement — your March 25 suggestion: "A GenericUDF can never have an NDV > numRows so instead of putting the responsibility to the implementors of StatEstimator
we could enforce the bound here. This would avoid the changes in the existing APIs." Applying the cap at the call site keeps estimate(parents) unchanged.
The combiner's need to detect const-NULL via signature (which required parent rows internally to check numNulls == numRows) — closed by isConst being explicit in ColStatistics. The combiner
reads the flag directly; no signature inference, no parent-row plumbing needed.

Net effect: #6359 can drop the API change and stay focused on its stated goal (the combiner improvement) — directly addressing your March 25 "There is no reason to add a default implementation for
this method. Leaving things as they are does not break anything."

Alternative considered

The architecturally complete fix is the one you yourself suggested on March 26: "everything would be simpler if we could use -1 for NDV to declare unknown as it happens for the other stats. This is
probably a bigger change to digest so let's not go into this direction for now." I agree it's the right long-term answer; concretely the cost is:

~20+ read sites of ColStatistics.getCountDistint() across optimizer/, stats/, etc. — most check <= 0 or == 0 under the implicit "0 means unknown" assumption and would each need
case-by-case re-evaluation;
the metastore Thrift schema's per-type stats data (getNumDVs() returning long with implicit 0 default) — definitionally tangles "verified zero" with "unset";
backward compatibility for already-persisted column stats, where existing 0 values become ambiguous between "verified zero" and "old/unset";
cross-engine coordination (Spark / Iceberg / Trino consume Hive's column stats — semantic shift in NDV=0 is a wire-compatibility break);
regenerating likely hundreds of .q.out goldens that print NDV in EXPLAIN annotations.

vs isConst's 4 flag-aware sites / 1 reader / no schema or wire change.

I think this revision is the right balance for the cost/benefit ratio, and on March 26 in the same thread you indicated you'd accept this approach: "I am OK to revert to your original proposal of
adding a new boolean field in ColStatistics object for marking constants. Leaving the final choice to you :)" Happy to discuss further if any of the above raises concerns.

zabetak · 2026-05-04T14:44:05Z

@konstantinb Thanks for the detailed answer.

NULL is not an actual value so not sure if it should be reflected in the NDV. In other words, maybe we should not adjust NDV at all independent if it is constant or not. I think we have discussed this at some point but don't remember if we concluded somewhere. Have you explored this alternative?

konstantinb · 2026-05-04T23:41:20Z

@konstantinb Thanks for the detailed answer.

NULL is not an actual value so not sure if it should be reflected in the NDV. In other words, maybe we should not adjust NDV at all independent if it is constant or not. I think we have discussed this at some point but don't remember if we concluded somewhere. Have you explored this alternative?

@zabetak — sorry for not following up sooner on your question in #6359. You raised the possibility of dropping the +1 for the NDV > 0 case entirely rather than narrowing it. I wanted to show a concrete case where that matters before going that route.

The +1 dates to HIVE-5369 (Prasanth Jayachandran, 2013), introduced with the comment "if NULLs exists, add 1 to distinct count":

long dvs = cs.getCountDistint();
// if NULLs exists, add 1 to distinct count
if (cs.getNumNulls() > 0) {
dvs += 1;
}

The reason is GROUP BY semantics: a nullable column with NDV > 0 produces one extra output group for NULL rows. Dropping the +1 means that extra group disappears from the estimate.

I added ql/src/test/queries/clientpositive/groupby_null_bucket.q to make this concrete. The scenario: two columns each with NDV=1 but a large null fraction (the pattern you get from LEFT JOINs with many unmatched rows), grouped alongside a high-cardinality key:

Column	NDV	numNulls
product_id	100,000	0
funnel_step	1	480,000,000
device_type	1	450,000,000

With +1: NDV product = 100K × 2 × 2 = 400K → 13 MB → above 10 MB map join threshold → Merge Join
Without +1: NDV product = 100K × 1 × 1 = 100K → 3.3 MB → below threshold → Map Join (4× underestimate, OOM risk on the actual ~400K-row result)

What would be removed:

    long ndv = cs.getCountDistint();
  - if (cs.getNumNulls() > 0) {
  -   ndv = StatsUtils.safeAdd(ndv, 1);
  - }
    ndvValues.add(ndv);

Plan change (key excerpts):

With +1:
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Map 4 (SIMPLE_EDGE), Reducer 2 (SIMPLE_EDGE)
...
Group By Operator [mergepartial]
Statistics: Num rows: 400000 Data size: 13064238
...
Merge Join Operator

Without +1:
Edges:
Map 3 <- Reducer 2 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
...
Group By Operator [mergepartial]
Statistics: Num rows: 100000 Data size: 3266238
...
Map Join Operator

groupby_null_bucket.q.txt
groupby_null_bucket.q.out.WITHOUT_PLUS1.txt
groupby_null_bucket.q.out.WITH_PLUS1.txt

Could you please let me know whether you think this is a realistic scenario or an artificial corner case that can safely be ignored?

asf-ci-hive added the tests pending label Apr 9, 2026

konstantinb force-pushed the HIVE-29556 branch from 099db09 to b59fb72 Compare April 9, 2026 22:19

asf-ci-hive added tests failed tests pending tests unstable and removed tests pending tests failed labels Apr 9, 2026

konstantinb added 2 commits April 30, 2026 13:47

HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns

1e903bb

HIVE-29556: can't elegantly resolve "const NULL" columns without a ne…

d873d6f

…w flag (or bigger refactoring)

konstantinb force-pushed the HIVE-29556 branch from b59fb72 to d873d6f Compare April 30, 2026 23:11

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Apr 30, 2026

HIVE-29556: one more file's plan has been positively impacted, plus o…

297cab5

…ne more "const" condition missed earlier

asf-ci-hive added tests pending and removed tests unstable labels May 1, 2026

konstantinb commented May 1, 2026

View reviewed changes

asf-ci-hive added tests passed and removed tests pending labels May 1, 2026

konstantinb marked this pull request as ready for review May 1, 2026 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns#6418

HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns#6418
konstantinb wants to merge 3 commits intoapache:masterfrom
konstantinb:HIVE-29556

konstantinb commented Apr 9, 2026

Uh oh!

konstantinb May 1, 2026

Uh oh!

sonarqubecloud Bot commented May 1, 2026

Uh oh!

konstantinb commented May 1, 2026

Uh oh!

zabetak commented May 4, 2026

Uh oh!

konstantinb commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	Estimate	Error
Master (`+1` of `0`)	1	37× under
This PR (heuristic fallback at hash)	150	~4× over
This PR (heuristic at mergepartial)	75	~2× over

Conversation

konstantinb commented Apr 9, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

konstantinb May 1, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 1, 2026

Quality Gate passed

Uh oh!

konstantinb commented May 1, 2026

Uh oh!

zabetak commented May 4, 2026

Uh oh!

konstantinb commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

konstantinb commented May 4, 2026 •

edited

Loading