Skip to content

Fix strict NotEqualTo/NotIn pruning with partial nulls or NaNs#3521

Merged
kevinjqliu merged 3 commits into
apache:mainfrom
tanmayrauth:fix-3498-strict-evaluator-not-eq-not-in-null-overprune
Jun 22, 2026
Merged

Fix strict NotEqualTo/NotIn pruning with partial nulls or NaNs#3521
kevinjqliu merged 3 commits into
apache:mainfrom
tanmayrauth:fix-3498-strict-evaluator-not-eq-not-in-null-overprune

Conversation

@tanmayrauth

@tanmayrauth tanmayrauth commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Related to #3498

Fix strict metrics evaluation for NotEqualTo and NotIn so files are only proven to match when a column contains only nulls or only NaNs. Mixed null/NaN files now continue through the existing bounds checks instead of being treated as ROWS_MUST_MATCH.

Root Cause

The strict evaluator used _can_contain_nulls / _can_contain_nans for negative predicates. That is too broad: a file with values like [null, 5] and bounds 5..5 cannot be proven to match x != 5 or x not in {5} because the non-null row may still fail the predicate.

Java Parity

This matches Java's StrictMetricsEvaluator, which only short-circuits negative predicates when the column contains only nulls or only NaNs:

Validation

  • UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.py -k "mixed_nulls_and_matching_bounds or mixed_nans_and_matching_bounds or all_nulls or all_nans or strict_integer_not_in"
  • UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.py
  • UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run ruff check pyiceberg/expressions/visitors.py tests/expressions/test_evaluator.py
  • git diff --check

…with partial nulls

_StrictMetricsEvaluator.visit_not_equal and visit_not_in short-circuited on
_can_contain_nulls / _can_contain_nans (null/NaN count > 0) and returned
ROWS_MUST_MATCH without checking the value bounds. A file holding any null or
NaN was therefore reported as fully matching the predicate, even when a
non-null value inside the bounds did not match.

This drives _DeleteFiles (table/update/snapshot.py): ROWS_MUST_MATCH drops the
whole data file without rewriting it. So delete(NotEqualTo("x", 5)) against a
file with stats [null, 5] and bounds lower=upper=5 would delete the entire
file, silently losing the row with value 5 that should have survived.

Every other strict ROWS_MUST_MATCH path already guards on the "only" variants
(_contains_nulls_only / _contains_nans_only), matching the reference
StrictMetricsEvaluator. Switch both methods to the same guard so that an
all-null/all-NaN column still short-circuits to ROWS_MUST_MATCH (those rows
satisfy not-equal/not-in), while a partially-null column falls through to the
bounds check.

Update the existing NotIn-on-some-nulls test that encoded the buggy result and
add a regression test covering the [null, value] / bounds-include-literal case
for both NotEqualTo and NotIn.

Fixes apache#3498 (partially)
@tanmayrauth

tanmayrauth commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@kevinjqliu @Fokko Can you please take a look when you have a moment?

Comment thread pyiceberg/expressions/visitors.py Outdated
Comment thread pyiceberg/expressions/visitors.py Outdated
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
@kevinjqliu kevinjqliu changed the title Fix strict metrics evaluator over-pruning files for NotEqualTo/NotIn with partial nulls Fix strict NotEqualTo/NotIn pruning with partial nulls or NaNs Jun 22, 2026
Comment thread tests/expressions/test_evaluator.py Outdated

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

i was working on this in #3547 as well. lets merge this one and i'll change #3547 as a follow up with more tests

@kevinjqliu kevinjqliu merged commit d0a9b91 into apache:main Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants