Fix strict NotEqualTo/NotIn pruning with partial nulls or NaNs by tanmayrauth · Pull Request #3521 · apache/iceberg-python

tanmayrauth · 2026-06-17T07:03:08Z

Summary

Related to #3498

Fix strict metrics evaluation for NotEqualTo and NotIn so files are only proven to match when a column contains only nulls or only NaNs. Mixed null/NaN files now continue through the existing bounds checks instead of being treated as ROWS_MUST_MATCH.

Root Cause

The strict evaluator used _can_contain_nulls / _can_contain_nans for negative predicates. That is too broad: a file with values like [null, 5] and bounds 5..5 cannot be proven to match x != 5 or x not in {5} because the non-null row may still fail the predicate.

Java Parity

This matches Java's StrictMetricsEvaluator, which only short-circuits negative predicates when the column contains only nulls or only NaNs:

Validation

UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.py -k "mixed_nulls_and_matching_bounds or mixed_nans_and_matching_bounds or all_nulls or all_nans or strict_integer_not_in"
UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.py
UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run ruff check pyiceberg/expressions/visitors.py tests/expressions/test_evaluator.py
git diff --check

…with partial nulls _StrictMetricsEvaluator.visit_not_equal and visit_not_in short-circuited on _can_contain_nulls / _can_contain_nans (null/NaN count > 0) and returned ROWS_MUST_MATCH without checking the value bounds. A file holding any null or NaN was therefore reported as fully matching the predicate, even when a non-null value inside the bounds did not match. This drives _DeleteFiles (table/update/snapshot.py): ROWS_MUST_MATCH drops the whole data file without rewriting it. So delete(NotEqualTo("x", 5)) against a file with stats [null, 5] and bounds lower=upper=5 would delete the entire file, silently losing the row with value 5 that should have survived. Every other strict ROWS_MUST_MATCH path already guards on the "only" variants (_contains_nulls_only / _contains_nans_only), matching the reference StrictMetricsEvaluator. Switch both methods to the same guard so that an all-null/all-NaN column still short-circuits to ROWS_MUST_MATCH (those rows satisfy not-equal/not-in), while a partially-null column falls through to the bounds check. Update the existing NotIn-on-some-nulls test that encoded the buggy result and add a regression test covering the [null, value] / bounds-include-literal case for both NotEqualTo and NotIn. Fixes apache#3498 (partially)

tanmayrauth · 2026-06-17T16:49:30Z

@kevinjqliu @Fokko Can you please take a look when you have a moment?

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>

kevinjqliu

LGTM

i was working on this in #3547 as well. lets merge this one and i'll change #3547 as a follow up with more tests

rambleraptor mentioned this pull request Jun 22, 2026

Add strict NotEqualTo/NotIn null and NaN tests #3547

Merged

kevinjqliu reviewed Jun 22, 2026

View reviewed changes

Comment thread pyiceberg/expressions/visitors.py Outdated

Comment thread pyiceberg/expressions/visitors.py Outdated

comment

be9dd55

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>

kevinjqliu changed the title ~~Fix strict metrics evaluator over-pruning files for NotEqualTo/NotIn with partial nulls~~ Fix strict NotEqualTo/NotIn pruning with partial nulls or NaNs Jun 22, 2026

kevinjqliu reviewed Jun 22, 2026

View reviewed changes

Comment thread tests/expressions/test_evaluator.py Outdated

assert msg

a07ce7f

kevinjqliu approved these changes Jun 22, 2026

View reviewed changes

kevinjqliu merged commit d0a9b91 into apache:main Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix strict NotEqualTo/NotIn pruning with partial nulls or NaNs#3521

Fix strict NotEqualTo/NotIn pruning with partial nulls or NaNs#3521
kevinjqliu merged 3 commits into
apache:mainfrom
tanmayrauth:fix-3498-strict-evaluator-not-eq-not-in-null-overprune

tanmayrauth commented Jun 17, 2026 •

edited by kevinjqliu

Loading

Uh oh!

tanmayrauth commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjqliu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tanmayrauth commented Jun 17, 2026 • edited by kevinjqliu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Java Parity

Validation

Uh oh!

tanmayrauth commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tanmayrauth commented Jun 17, 2026 •

edited by kevinjqliu

Loading

tanmayrauth commented Jun 17, 2026 •

edited

Loading