HIVE-29625: Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"#6505
HIVE-29625: Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"#6505konstantinb wants to merge 8 commits into
Conversation
|
@zabetak I believe that this PR is a much cleaner alternative to #6418 On top of that, it simplifies the fixes required for PessimisticStatCombiner. The original PR #6359 would become much smaller if this is accepted. |
|
@zabetak there are two more considerations. One relates to "const null" column statistics to which buildColStatForConstant() assigns an NDV of 0 while the Hive metastore saves such columns with an NDV of 1: results in the describe output of The second topic is about a truly narrow case when the NDV is unknown, but numNulls is known and is either equal to numRows or is equal to (numRows-1). Technically, the NDV can be accurately inferred as 0 or 1 in those cases, even for binary columns. The PR into this branch shows a possible approach: konstantinb#1 , but it seems excessive to me; I'd appreciate knowing your opinion on the matter |
|



What changes were proposed in this pull request?
HIVE-29625: Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"
Establishes -1 as the unknown NDV sentinel for ColStatistics.countDistint (NDV/countDistinct). The proto-to-ColStatistics conversion emits -1 when the underlying NDV is unavailable (rather than the previous default of 0), and consumers of getCountDistint() are updated to apply an appropriate fallback for the unknown state instead of treating it as 0.
Why are the changes needed?
ColStatistics.countDistint == 0 was overloaded to mean both no distinct values and unknown NDV. The two states have opposite implications for cost-based planning — a real zero supports tight estimates, while a genuinely-unknown NDV needs a conservative fallback. Treating them identically led to inconsistent and sometimes catastrophic cardinality estimates whenever a column's NDV was unavailable. Disambiguating the sentinel lets each consumer apply the correct logic.
Does this PR introduce any user-facing change?
No. Query results, SQL syntax, and configuration are unchanged. Plan estimates in EXPLAIN output may differ for queries reading columns whose NDV is unavailable, since the planner now distinguishes those from columns with 0 distinct values.
How was this patch tested?
Added unit tests for each consumer-side change and updated .q.out goldens for queries whose plans shifted as a result of the disambiguation. All affected test classes pass.