Describe the enhancement requested
pyarrow.compute.Expression has no Python-accessible way to enumerate the
fields it references. The C++ side already exposes the underlying primitive
(arrow::compute::FieldsInExpression),
but the Python Expression class only surfaces cast, equals, is_null,
is_nan, is_valid, isin, and Substrait round-trip. Every downstream tool
that needs the column set of a predicate today either:
- Regex-parses
str(expression) (fragile — quoted string literals and
keywords like and leak into the result).
- Serializes to Substrait via
to_substrait(schema) and walks the protobuf
(heavy — requires a bound schema and a substrait dependency just to ask
"which columns?").
- Maintains a parallel AST upstream of
pc.Expression, like
Ray Data's _PyArrowExpressionVisitor.
Exposing the existing C++ primitive removes all three workarounds.
Motivating use cases
The recurring shape is: a library or end user has a pc.Expression in hand
and needs to decide which columns to read off disk before evaluating it.
- Column projection on cold storage. Wrapping
pyarrow.dataset.Scanner
or pyiceberg.Table.scan(...) with a user-supplied filter — the wrapper
wants to set selected_fields = user_projection ∪ filter_refs to avoid
pulling unused columns off S3 / disk.
- Conditional MERGE / upsert on Iceberg. PyIceberg's
Table.upsert
currently has no when_matched_condition parameter
(apache/iceberg-python#1534
explicitly scoped to "when matched update all / when not matched insert
all" and directed users to Spark for predicate-based MERGE). Implementing
a conditional upsert in Python requires projecting only the destination
columns the predicate touches before joining and filtering — which needs
field-ref introspection.
- Predicate splitting across two sources. Any library that accepts a
single user-facing predicate and routes it across a join (source ↔ target,
stream ↔ table, etc.) needs to bucket field references by side.
- Ray Data, delta-rs, Lance. Cross-engine routers that translate
pc.Expression to a non-Arrow execution engine all start with the same
question — which fields does it touch? — to decide which engine knows
about which columns and which side of a join to push the filter on.
Prior discussion
Comment thread on the closed
#27160 [Python] Allow to create field reference to nested field
records this as a known gap that was never tracked:
bkietz:
"currently field_refs can only extract a field from the scanned dataset.
It'd be helpful if they could also extract a field from an Expression."
nealrichardson:
"Agree that it would be helpful (possibly necessary) to be able to extract
a field from an Expression more generally."
That thread closed on the inverse direction (constructing nested refs);
this issue tracks the missing direction.
Proposed API
def field_refs(self) -> list[str | int | tuple[str | int, ...]]:
"""
Return the field references contained in this expression.
Each reference is reported once per call site (matches the C++
`FieldsInExpression` semantics). The returned value shape mirrors
`pyarrow.compute.field()`'s input — by-name references come back as
`str`, by-index as `int`, and nested references as `tuple`.
"""
Round-trip example:
>>> import pyarrow.compute as pc
>>> ((pc.field("a") > 0) & pc.field("b").is_null()).field_refs()
['a', 'b']
>>> pc.field("user", "city").field_refs()
[('user', 'city')]
>>> pc.scalar(5).field_refs()
[]
Open API decisions to settle before implementation
| Decision |
Proposed |
Rationale |
| Method name |
field_refs() |
Mirrors C++ free function FieldsInExpression and the existing singular accessor Expression.field_ref(). Alternatives: references(), referenced_fields(). |
| Return type |
list[str | int | tuple] |
Round-trip compatible with pc.field(*ref). Avoids introducing a new public FieldRef Python type (which would deserve its own design discussion — likely a follow-up). |
| Dedup |
No |
Matches C++ FieldsInExpression. Callers do set(...) if desired. |
| Order |
Traversal (left-to-right, depth-first) |
Documented as "not part of the public contract" to leave room. |
| Single-element FieldPath |
Plain int, not (int,) |
Symmetric with pc.field(3) returning a non-nested ref. |
Happy to defer any of these to maintainer preference.
Implementation outline
Small (~80 lines including tests). Three files touched:
python/pyarrow/includes/libarrow.pxd — declare FieldsInExpression and
the additional CFieldRef accessors (IsName, IsFieldPath, IsNested,
field_path, nested_refs) needed for the conversion helper.
python/pyarrow/_compute.pyx — add a _fieldref_to_python helper and a
field_refs() method on Expression. Both small.
python/pyarrow/tests/test_compute.py — coverage for the four FieldRef
shapes (name / index / nested name / nested index), empty (constant
expression), and round-trip through pc.field().
Plus one autosummary line in docs/source/python/api/compute.rst.
I'm happy to put up a PR once the API is agreed.
Related issues
- #27160 — closed; this
issue captures the unfiled follow-up.
- #34433 — adjacent;
asks for table.evaluate(expr) returning a boolean mask. Both are
"more handles on Expression" requests but distinct in scope.
- #49885 — adjacent;
binding unresolved Substrait expressions. Complementary work on the
Expression API.
Component(s)
Python
Describe the enhancement requested
pyarrow.compute.Expressionhas no Python-accessible way to enumerate thefields it references. The C++ side already exposes the underlying primitive
(
arrow::compute::FieldsInExpression),but the Python
Expressionclass only surfacescast,equals,is_null,is_nan,is_valid,isin, and Substrait round-trip. Every downstream toolthat needs the column set of a predicate today either:
str(expression)(fragile — quoted string literals andkeywords like
andleak into the result).to_substrait(schema)and walks the protobuf(heavy — requires a bound schema and a substrait dependency just to ask
"which columns?").
pc.Expression, likeRay Data's
_PyArrowExpressionVisitor.Exposing the existing C++ primitive removes all three workarounds.
Motivating use cases
The recurring shape is: a library or end user has a
pc.Expressionin handand needs to decide which columns to read off disk before evaluating it.
pyarrow.dataset.Scanneror
pyiceberg.Table.scan(...)with a user-supplied filter — the wrapperwants to set
selected_fields = user_projection ∪ filter_refsto avoidpulling unused columns off S3 / disk.
Table.upsertcurrently has no
when_matched_conditionparameter(apache/iceberg-python#1534
explicitly scoped to "when matched update all / when not matched insert
all" and directed users to Spark for predicate-based MERGE). Implementing
a conditional upsert in Python requires projecting only the destination
columns the predicate touches before joining and filtering — which needs
field-ref introspection.
single user-facing predicate and routes it across a join (source ↔ target,
stream ↔ table, etc.) needs to bucket field references by side.
pc.Expressionto a non-Arrow execution engine all start with the samequestion — which fields does it touch? — to decide which engine knows
about which columns and which side of a join to push the filter on.
Prior discussion
Comment thread on the closed
#27160 [Python] Allow to create field reference to nested field
records this as a known gap that was never tracked:
That thread closed on the inverse direction (constructing nested refs);
this issue tracks the missing direction.
Proposed API
Round-trip example:
Open API decisions to settle before implementation
field_refs()FieldsInExpressionand the existing singular accessorExpression.field_ref(). Alternatives:references(),referenced_fields().list[str | int | tuple]pc.field(*ref). Avoids introducing a new publicFieldRefPython type (which would deserve its own design discussion — likely a follow-up).FieldsInExpression. Callers doset(...)if desired.int, not(int,)pc.field(3)returning a non-nested ref.Happy to defer any of these to maintainer preference.
Implementation outline
Small (~80 lines including tests). Three files touched:
python/pyarrow/includes/libarrow.pxd— declareFieldsInExpressionandthe additional
CFieldRefaccessors (IsName,IsFieldPath,IsNested,field_path,nested_refs) needed for the conversion helper.python/pyarrow/_compute.pyx— add a_fieldref_to_pythonhelper and afield_refs()method onExpression. Both small.python/pyarrow/tests/test_compute.py— coverage for the four FieldRefshapes (name / index / nested name / nested index), empty (constant
expression), and round-trip through
pc.field().Plus one autosummary line in
docs/source/python/api/compute.rst.I'm happy to put up a PR once the API is agreed.
Related issues
issue captures the unfiled follow-up.
asks for
table.evaluate(expr)returning a boolean mask. Both are"more handles on
Expression" requests but distinct in scope.binding unresolved Substrait expressions. Complementary work on the
Expression API.
Component(s)
Python