HIVE-29625: Disambiguate ColStatistics.countDistinct "unknown" from "verified zero" by konstantinb · Pull Request #6505 · apache/hive

konstantinb · 2026-05-21T22:54:01Z

What changes were proposed in this pull request?

HIVE-29625: Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"

Establishes -1 as the unknown NDV sentinel for ColStatistics.countDistint (NDV/countDistinct). The proto-to-ColStatistics conversion emits -1 when the underlying NDV is unavailable (rather than the previous default of 0), and consumers of getCountDistint() are updated to apply an appropriate fallback for the unknown state instead of treating it as 0.

Why are the changes needed?

ColStatistics.countDistint == 0 was overloaded to mean both no distinct values and unknown NDV. The two states have opposite implications for cost-based planning — a real zero supports tight estimates, while a genuinely-unknown NDV needs a conservative fallback. Treating them identically led to inconsistent and sometimes catastrophic cardinality estimates whenever a column's NDV was unavailable. Disambiguating the sentinel lets each consumer apply the correct logic.

Does this PR introduce any user-facing change?

No. Query results, SQL syntax, and configuration are unchanged. Plan estimates in EXPLAIN output may differ for queries reading columns whose NDV is unavailable, since the planner now distinguishes those from columns with 0 distinct values.

How was this patch tested?

Added unit tests for each consumer-side change and updated .q.out goldens for queries whose plans shifted as a result of the disambiguation. All affected test classes pass.

…verified zero"

konstantinb · 2026-05-29T22:50:11Z

@zabetak I believe that this PR is a much cleaner alternative to #6418
While changing more source files, it impacts much fewer test output results and provides clean separation between "unknown" and "verified 0" NDV values. It also naturally provides a fix for extractNDVGroupingColumns() reported as HIVE-29556

On top of that, it simplifies the fixes required for PessimisticStatCombiner. The original PR #6359 would become much smaller if this is accepted.

konstantinb · 2026-05-29T23:09:22Z

@zabetak there are two more considerations. One relates to "const null" column statistics to which buildColStatForConstant() assigns an NDV of 0 while the Hive metastore saves such columns with an NDV of 1:

CREATE TABLE test_const_null_ndv (i INT, s STRING) STORED AS ORC;
INSERT INTO test_const_null_ndv VALUES (NULL, 'a'), (NULL, 'b');
DESCRIBE FORMATTED test_const_null_ndv i;

results in the describe output of

POSTHOOK: Input: default@test_const_null_ndv
col_name            	i                   
data_type           	int                 
min                 	                    
max                 	                    
num_nulls           	2                   
distinct_count      	1                   
avg_col_len         	                    
max_col_len         	                    
num_trues           	                    
num_falses          	                    
bit_vector          	HL                  
comment             	from deserializer   
COLUMN_STATS_ACCURATE	{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"i\":\"true\",\"s\":\"true\"}}

The second topic is about a truly narrow case when the NDV is unknown, but numNulls is known and is either equal to numRows or is equal to (numRows-1). Technically, the NDV can be accurately inferred as 0 or 1 in those cases, even for binary columns. The PR into this branch shows a possible approach: konstantinb#1 , but it seems excessive to me; I'd appreciate knowing your opinion on the matter

sonarqubecloud · 2026-05-30T16:41:44Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

asf-ci-hive added tests pending tests unstable and removed tests pending tests unstable labels May 21, 2026

konstantinb added 3 commits May 25, 2026 08:31

HIVE-29625: Disambiguate ColStatistics.countDistinct "unknown" from "…

bed032d

…verified zero"

HIVE-29625: impacted .out files + reverting an unintended edit

124abe6

HIVE-29625: itest code, small tweaks + better code reuse

f5b19e9

konstantinb force-pushed the HIVE-29625 branch from 43f74c2 to f5b19e9 Compare May 25, 2026 15:48

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels May 25, 2026

HIVE-29625: SQ feedback + better test code

349be2f

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels May 27, 2026

HIVE-29625: more SQ feedback

03960fd

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending labels May 28, 2026

Merge branch 'master' into HIVE-29625

5b2bc13

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels May 28, 2026

Merge branch 'master' into HIVE-29625

a94dbe9

asf-ci-hive added tests pending and removed tests unstable labels May 29, 2026

konstantinb marked this pull request as ready for review May 29, 2026 22:19

konstantinb mentioned this pull request May 29, 2026

HIVE-29368: More accurate pessimistic stats combining #6359

Draft

asf-ci-hive added tests unstable and removed tests pending labels May 30, 2026

HIVE-29625: trigger a rebuild

98cff4e

asf-ci-hive added tests pending and removed tests unstable labels May 30, 2026

asf-ci-hive added tests passed and removed tests pending labels May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-29625: Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"#6505

HIVE-29625: Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"#6505
konstantinb wants to merge 8 commits into
apache:masterfrom
konstantinb:HIVE-29625

konstantinb commented May 21, 2026 •

edited

Loading

Uh oh!

konstantinb commented May 29, 2026

Uh oh!

konstantinb commented May 29, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

konstantinb commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

konstantinb commented May 29, 2026

Uh oh!

konstantinb commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 30, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

konstantinb commented May 21, 2026 •

edited

Loading

konstantinb commented May 29, 2026 •

edited

Loading