Skip to content

Clarify histogram NDV metadata and add real NDV-based estimates #356

@KKould

Description

@KKould

Problem

HistogramMeta::number_of_distinct_value currently looks like NDV, but it is assigned from values_len during analyze. In practice it is used as a minimum selectivity denominator in histogram range estimation, not as true number-of-distinct-values metadata.

This naming can be misleading and may cause future optimizer rules to treat it as real NDV.

Possible Fix

  • Rename the current field to reflect its actual meaning, such as values_len, non_null_count, or selectivity_denominator.
  • Add separate real NDV metadata later, likely from analyze using HLL or an exact-small/fuzzy-large strategy.
  • Use real NDV for optimizer estimates where appropriate:
    • equality selectivity fallback: rows / ndv
    • group by / distinct cardinality
    • join cardinality
    • distinct/group cost
    • index scan equality-prefix fallback cost

Notes

CMS should still be preferred for concrete value frequency when available. Histogram should still drive range estimates. NDV should fill the broader cardinality/selectivity gaps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions