Skip to content

[python] Reject non-checkpoint-stable Python memory values at set()#839

Merged
wenjin272 merged 1 commit into
apache:mainfrom
weiqingy:723-pr2-impl
Jun 12, 2026
Merged

[python] Reject non-checkpoint-stable Python memory values at set()#839
wenjin272 merged 1 commit into
apache:mainfrom
weiqingy:723-pr2-impl

Conversation

@weiqingy

@weiqingy weiqingy commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Linked issue: #723

Purpose of change

Python memory values cross the Pemja boundary into Flink state when written via MemoryObject.set(). Values Pemja cannot materialize into native JVM types — Pydantic models, uuid.UUID, str/int Enums, custom classes, tuple/set, dicts with non-str keys — are stored as stale PyObject wrappers and SIGSEGV in JcpPyObject_FromJObject on state restore.

This adds a set()-time guard that rejects such values early with a clear TypeError (naming the offending location and a conversion), instead of letting them fail on restore. It follows #828 (built-in tool-context normalization) and covers arbitrary user values. Forward-looking: it does not migrate pre-fix checkpoints.

The accepted contract is recursive and exact-typed: None | bool | int | float | str | list[...] | dict[str, ...]. Exact-type (not isinstance) is required so a str/int Enum — which passes isinstance(x, str) yet is still Pemja-wrapped — cannot slip through. The same guard runs in both LocalMemoryObject.set and FlinkMemoryObject.set, so local execution fails the same way production would.

The memory unit tests no longer store a set/custom class, and the workflow/flink-integration e2e agents now materialize their Pydantic payloads (model_dump(mode="json") on write, model_validate on read) — they stored models on the real Flink path and were latent instances of this bug.

bytes is intentionally not yet accepted: the Python→Java bytes conversion through Pemja is unverified, and wrongly accepting an unsafe value would defeat the guard. A follow-up will verify it. The Python value-contract docs are a separate follow-up PR (disjoint file).

Tests

New runtime/tests/test_memory_value_validation.py covers the accepted types (incl. nested list/dict) and every rejection, plus that FlinkMemoryObject.set raises a raw TypeError (not MemoryObjectError). A true checkpoint-restore test can't run on the MiniCluster (in-place recovery doesn't recreate the JVM, so the Pemja path isn't crossed); the unit tests assert the accept/reject classification as the proxy, same as #828.

uv run --no-sync pytest flink_agents/runtime/tests flink_agents/plan/tests/actions -k "not e2e" passes; ruff check is clean.

API

Adds validate_memory_value(path, value) in flink_agents.api.memory_object. MemoryObject.set() now raises TypeError for values that are not recursively checkpoint-stable; previously they were accepted and failed only on restore.

Documentation

  • doc-needed
  • doc-not-needed
  • doc-included

Python memory values cross the Pemja boundary into Flink state. Values
Pemja cannot materialize into native JVM types (Pydantic models, UUID,
str/int Enums, custom classes, tuple/set, non-str dict keys) are stored
as stale PyObject wrappers and SIGSEGV in JcpPyObject_FromJObject on
checkpoint restore.

Add validate_memory_value() and call it from both LocalMemoryObject.set
and FlinkMemoryObject.set, rejecting any value not recursively composed
of None/bool/int/float/str/list/dict[str, ...] with a clear, actionable
TypeError. Exact-type matching (not isinstance) is required so a
str/int Enum cannot slip through. Conform the existing call sites that
stored non-stable values: drop the set/custom-class stores in the memory
unit tests, and materialize the Pydantic payloads in the workflow and
flink-integration e2e agents (model_dump on write, model_validate on
read).

This follows apache#828 (built-in tool-context normalization) and is
forward-looking: it does not migrate pre-fix checkpoints.

This closes part of apache#723.
@github-actions github-actions Bot added doc-needed Your PR changes impact docs. fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. and removed doc-needed Your PR changes impact docs. labels Jun 11, 2026

@wenjin272 wenjin272 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this @weiqingy, LGTM.

@wenjin272 wenjin272 merged commit d7f8911 into apache:main Jun 12, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-needed Your PR changes impact docs. fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants