Skip to content

[Python] Expose Expression.field_refs() to enumerate referenced fields #50031

@paultmathew

Description

@paultmathew

Describe the enhancement requested

pyarrow.compute.Expression has no Python-accessible way to enumerate the
fields it references. The C++ side already exposes the underlying primitive
(arrow::compute::FieldsInExpression),
but the Python Expression class only surfaces cast, equals, is_null,
is_nan, is_valid, isin, and Substrait round-trip. Every downstream tool
that needs the column set of a predicate today either:

  1. Regex-parses str(expression) (fragile — quoted string literals and
    keywords like and leak into the result).
  2. Serializes to Substrait via to_substrait(schema) and walks the protobuf
    (heavy — requires a bound schema and a substrait dependency just to ask
    "which columns?").
  3. Maintains a parallel AST upstream of pc.Expression, like
    Ray Data's _PyArrowExpressionVisitor.

Exposing the existing C++ primitive removes all three workarounds.

Motivating use cases

The recurring shape is: a library or end user has a pc.Expression in hand
and needs to decide which columns to read off disk before evaluating it.

  1. Column projection on cold storage. Wrapping pyarrow.dataset.Scanner
    or pyiceberg.Table.scan(...) with a user-supplied filter — the wrapper
    wants to set selected_fields = user_projection ∪ filter_refs to avoid
    pulling unused columns off S3 / disk.
  2. Conditional MERGE / upsert on Iceberg. PyIceberg's Table.upsert
    currently has no when_matched_condition parameter
    (apache/iceberg-python#1534
    explicitly scoped to "when matched update all / when not matched insert
    all" and directed users to Spark for predicate-based MERGE). Implementing
    a conditional upsert in Python requires projecting only the destination
    columns the predicate touches before joining and filtering — which needs
    field-ref introspection.
  3. Predicate splitting across two sources. Any library that accepts a
    single user-facing predicate and routes it across a join (source ↔ target,
    stream ↔ table, etc.) needs to bucket field references by side.
  4. Ray Data, delta-rs, Lance. Cross-engine routers that translate
    pc.Expression to a non-Arrow execution engine all start with the same
    question — which fields does it touch? — to decide which engine knows
    about which columns and which side of a join to push the filter on.

Prior discussion

Comment thread on the closed
#27160 [Python] Allow to create field reference to nested field
records this as a known gap that was never tracked:

bkietz:
"currently field_refs can only extract a field from the scanned dataset.
It'd be helpful if they could also extract a field from an Expression."

nealrichardson:
"Agree that it would be helpful (possibly necessary) to be able to extract
a field from an Expression more generally."

That thread closed on the inverse direction (constructing nested refs);
this issue tracks the missing direction.

Proposed API

def field_refs(self) -> list[str | int | tuple[str | int, ...]]:
    """
    Return the field references contained in this expression.

    Each reference is reported once per call site (matches the C++
    `FieldsInExpression` semantics). The returned value shape mirrors
    `pyarrow.compute.field()`'s input — by-name references come back as
    `str`, by-index as `int`, and nested references as `tuple`.
    """

Round-trip example:

>>> import pyarrow.compute as pc
>>> ((pc.field("a") > 0) & pc.field("b").is_null()).field_refs()
['a', 'b']
>>> pc.field("user", "city").field_refs()
[('user', 'city')]
>>> pc.scalar(5).field_refs()
[]

Open API decisions to settle before implementation

Decision Proposed Rationale
Method name field_refs() Mirrors C++ free function FieldsInExpression and the existing singular accessor Expression.field_ref(). Alternatives: references(), referenced_fields().
Return type list[str | int | tuple] Round-trip compatible with pc.field(*ref). Avoids introducing a new public FieldRef Python type (which would deserve its own design discussion — likely a follow-up).
Dedup No Matches C++ FieldsInExpression. Callers do set(...) if desired.
Order Traversal (left-to-right, depth-first) Documented as "not part of the public contract" to leave room.
Single-element FieldPath Plain int, not (int,) Symmetric with pc.field(3) returning a non-nested ref.

Happy to defer any of these to maintainer preference.

Implementation outline

Small (~80 lines including tests). Three files touched:

  • python/pyarrow/includes/libarrow.pxd — declare FieldsInExpression and
    the additional CFieldRef accessors (IsName, IsFieldPath, IsNested,
    field_path, nested_refs) needed for the conversion helper.
  • python/pyarrow/_compute.pyx — add a _fieldref_to_python helper and a
    field_refs() method on Expression. Both small.
  • python/pyarrow/tests/test_compute.py — coverage for the four FieldRef
    shapes (name / index / nested name / nested index), empty (constant
    expression), and round-trip through pc.field().

Plus one autosummary line in docs/source/python/api/compute.rst.

I'm happy to put up a PR once the API is agreed.

Related issues

  • #27160 — closed; this
    issue captures the unfiled follow-up.
  • #34433 — adjacent;
    asks for table.evaluate(expr) returning a boolean mask. Both are
    "more handles on Expression" requests but distinct in scope.
  • #49885 — adjacent;
    binding unresolved Substrait expressions. Complementary work on the
    Expression API.

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions