Skip to content

examples(polars): Polars × PyMEOS TemporalParquet round-trip example (depends on PyMEOS #84)#6

Open
estebanzimanyi wants to merge 1 commit into
MobilityDB:mainfrom
estebanzimanyi:feat/polars-temporalparquet
Open

examples(polars): Polars × PyMEOS TemporalParquet round-trip example (depends on PyMEOS #84)#6
estebanzimanyi wants to merge 1 commit into
MobilityDB:mainfrom
estebanzimanyi:feat/polars-temporalparquet

Conversation

@estebanzimanyi
Copy link
Copy Markdown
Member

Adds a worked-out example of consuming TemporalParquet files from the Polars DataFrame engine, zero-copy via PyMEOS' pymeos.io data-lake interchange layer.

What's in the example

PyMEOS_Examples/Polars_TemporalParquet.py — a single self-contained script demonstrating the full round-trip:

  1. Build a small temporal-point dataset using PyMEOS (3 trips, 4 instants each, EPSG:4326 near Brussels).
  2. Write to TemporalParquet via pymeos.io.write_temporal — opaque MEOS-WKB payload column + native-scalar sidecar columns (<col>__xmin/xmax/ymin/ymax/tmin/tmax) + self-describing temporal footer in the Parquet schema metadata. Byte-compatible with files written by MobilityDuck's temporalFooter() consumer recipe — files are portable across both tools.
  3. Read back with PyMEOS — full TGeomPointSeq object reconstruction.
  4. Consume the SAME file in Polars zero-copy via pl.from_arrow(pyarrow.parquet.read_table(path)). Polars sees the sidecar columns as native primitives, so its lazy / predicate-pushdown machinery works without decoding the MEOS-WKB payload. The temporal column appears as opaque BINARY for analysts who don't need MEOS-aware operations on every column.
  5. Sidecar-driven predicate pushdownpyarrow.parquet.read_table(filters=[("trip__xmax", "<", 4.45)]) prunes row groups before any per-row decode.

Example shows the dual consumption model that motivates the data-lake layer: PyMEOS for MEOS-aware reads, Polars (or any Arrow-aware engine) for native-column analytics, both reading the same on-disk file.

Install caveat

The pymeos.io module ships in PyMEOS PR #84 (feat/datalake-consumer, OPEN at time of writing). Until #84 merges into PyMEOS master, install PyMEOS from the branch directly:

pip install "git+https://github.com/MobilityDB/PyMEOS.git@feat/datalake-consumer#egg=pymeos[parquet]"
pip install polars pyarrow

After #84 merges, the standard install path works with zero code change:

pip install "pymeos[parquet]" polars pyarrow

The script itself doesn't reference any branch-specific path — only pymeos.io, which is the stable public surface in PR #84.

Why this PR lands now rather than after #84 merges

Two reasons:

  1. The example is the verification that pymeos.io's public surface is genuinely Polars-compatible. Writing it now surfaces any contract gaps while PR #84 is still in review (the script uses to_arrow, from_arrow, write_temporal, read_temporal, temporal_footer — the full public surface).
  2. Adopters get an upfront recipe rather than waiting for a separate follow-up after #84 lands. Once #84 reaches master, the only change needed here is dropping the install caveat from the README.

The README's install instruction is explicit about the dependency, so users hitting the example before #84 lands aren't surprised.

File checklist

  • PyMEOS_Examples/Polars_TemporalParquet.py — the example script (~200 lines, single file, no other deps)
  • README.md — one new bullet indexing the example with the install caveat

What's NOT in scope here

  • Iceberg — Polars composes with Iceberg via pl.scan_iceberg, but that requires a live Iceberg catalog (e.g. Apache Polaris) and is gated on the MobilityDuck temporal_iceberg_scan UDF. Tracked separately per iceberg-readiness memo; an Iceberg-Polars composition example would land as a sibling here once those substrates exist.
  • Lazy scan_pyarrow_dataset — for multi-file Parquet datasets, but adds complexity without changing the conceptual round-trip. Easy follow-up once adopters ask for it.

Adds PyMEOS_Examples/Polars_TemporalParquet.py demonstrating the
zero-copy bridge between PyMEOS' data-lake interchange layer
(`pymeos.io`) and the Polars DataFrame engine.

Round-trip covered:
1. Build a temporal-point dataset using PyMEOS (3 trips, 4 instants each)
2. Write to TemporalParquet via `pymeos.io.write_temporal` — opaque
   MEOS-WKB payload + native-scalar sidecar columns + self-describing
   `temporal` footer (byte-compatible with MobilityDuck's
   `temporalFooter()` consumer recipe)
3. Read back with PyMEOS — full PyMEOS object reconstruction
4. Consume the SAME file in Polars zero-copy via `pl.from_arrow` —
   Polars sees sidecar columns as native primitives
5. Sidecar-driven predicate pushdown via `pyarrow.parquet.read_table`
   `filters=[…]` — row-groups pruned before any per-row decode

Depends on the `pymeos.io` module shipping in PyMEOS PR #84
(`feat/datalake-consumer`). Until #84 reaches PyMEOS master,
adopters install PyMEOS from the branch directly:

  pip install "git+https://github.com/MobilityDB/PyMEOS.git@feat/datalake-consumer#egg=pymeos[parquet]"

After #84 merges, the standard `pip install pymeos[parquet]` path
works without code changes.

README updated to index the new example with the install caveat.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant