Skip to content

HIVE-29625: Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"#6505

Open
konstantinb wants to merge 8 commits into
apache:masterfrom
konstantinb:HIVE-29625
Open

HIVE-29625: Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"#6505
konstantinb wants to merge 8 commits into
apache:masterfrom
konstantinb:HIVE-29625

Conversation

@konstantinb
Copy link
Copy Markdown
Contributor

@konstantinb konstantinb commented May 21, 2026

What changes were proposed in this pull request?

HIVE-29625: Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"

Establishes -1 as the unknown NDV sentinel for ColStatistics.countDistint (NDV/countDistinct). The proto-to-ColStatistics conversion emits -1 when the underlying NDV is unavailable (rather than the previous default of 0), and consumers of getCountDistint() are updated to apply an appropriate fallback for the unknown state instead of treating it as 0.

Why are the changes needed?

ColStatistics.countDistint == 0 was overloaded to mean both no distinct values and unknown NDV. The two states have opposite implications for cost-based planning — a real zero supports tight estimates, while a genuinely-unknown NDV needs a conservative fallback. Treating them identically led to inconsistent and sometimes catastrophic cardinality estimates whenever a column's NDV was unavailable. Disambiguating the sentinel lets each consumer apply the correct logic.

Does this PR introduce any user-facing change?

No. Query results, SQL syntax, and configuration are unchanged. Plan estimates in EXPLAIN output may differ for queries reading columns whose NDV is unavailable, since the planner now distinguishes those from columns with 0 distinct values.

How was this patch tested?

Added unit tests for each consumer-side change and updated .q.out goldens for queries whose plans shifted as a result of the disambiguation. All affected test classes pass.

@konstantinb konstantinb marked this pull request as ready for review May 29, 2026 22:19
@konstantinb
Copy link
Copy Markdown
Contributor Author

@zabetak I believe that this PR is a much cleaner alternative to #6418
While changing more source files, it impacts much fewer test output results and provides clean separation between "unknown" and "verified 0" NDV values. It also naturally provides a fix for extractNDVGroupingColumns() reported as HIVE-29556

On top of that, it simplifies the fixes required for PessimisticStatCombiner. The original PR #6359 would become much smaller if this is accepted.

@konstantinb
Copy link
Copy Markdown
Contributor Author

konstantinb commented May 29, 2026

@zabetak there are two more considerations. One relates to "const null" column statistics to which buildColStatForConstant() assigns an NDV of 0 while the Hive metastore saves such columns with an NDV of 1:

CREATE TABLE test_const_null_ndv (i INT, s STRING) STORED AS ORC;
INSERT INTO test_const_null_ndv VALUES (NULL, 'a'), (NULL, 'b');
DESCRIBE FORMATTED test_const_null_ndv i;

results in the describe output of

POSTHOOK: Input: default@test_const_null_ndv
col_name            	i                   
data_type           	int                 
min                 	                    
max                 	                    
num_nulls           	2                   
distinct_count      	1                   
avg_col_len         	                    
max_col_len         	                    
num_trues           	                    
num_falses          	                    
bit_vector          	HL                  
comment             	from deserializer   
COLUMN_STATS_ACCURATE	{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"i\":\"true\",\"s\":\"true\"}}

The second topic is about a truly narrow case when the NDV is unknown, but numNulls is known and is either equal to numRows or is equal to (numRows-1). Technically, the NDV can be accurately inferred as 0 or 1 in those cases, even for binary columns. The PR into this branch shows a possible approach: konstantinb#1 , but it seems excessive to me; I'd appreciate knowing your opinion on the matter

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants