[python] Normalize built-in tool-call context for checkpoint safety by weiqingy · Pull Request #828 · apache/flink-agents

weiqingy · 2026-06-10T05:46:17Z

Linked issue: #723

Purpose of change

The built-in chat-model action stored non-primitive Python objects in sensory memory: UUID values, an OutputSchema, and ChatMessage lists. Pemja wraps such objects as PyObject holders whose JNI pointers go stale after a TaskManager/Python restart, so restoring the checkpointed tool context crashes in JcpPyObject_FromJObject.

This normalizes those values to a primitive-only form before they reach memory and reconstructs the rich types on read, entirely inside the three tool-context helpers in chat_model_action.py (no caller or signature changes):

ChatMessage lists are stored via model_dump(mode="json") and reconstructed via ChatMessage.model_validate. mode="json" is required because MessageRole is a str, Enum that a bare model_dump() would leave as an enum member.
initial_request_id is stored as str and reconstructed to UUID.
output_schema is stored via OutputSchema.model_dump() and reconstructed via OutputSchema.model_validate (None-guarded).

Dict keys were already strings and the retry-stats context already holds only ints, so both are unchanged. prompt_args is user-supplied and already round-trips as a ChatRequestEvent attribute, so it is left as-is.

This is the first of the agreed changes on #723. A follow-up PR will add a set()-time validator for user-supplied memory values and document the Python memory value contract.

Tests

plan/tests/actions/test_chat_model_action.py: 9 new unit tests plus a recursive _assert_primitive helper that asserts the stored form is primitive-only (a checkpoint-safety proxy, since no Python checkpoint/restore harness exists), and that round-trips reconstruct UUID / OutputSchema / ChatMessage, preserve the UUID→str dict-key match and RowTypeInfo, handle a None output_schema, and keep model / prompt_args.

plan/tests/actions/test_chat_model_action_retry.py: updated the hand-seeded context in test_forwards_saved_prompt_args_to_chat to the new primitive-only stored form.

cd python
uv run --no-sync pytest flink_agents/plan/tests/actions/ -v   # 22 passed
uv run --no-sync pytest flink_agents -k "not e2e_tests" -q    # all green

API

No public API change. The normalization is fully encapsulated in the existing tool-context helpers; callers and method signatures are unchanged.

Documentation

doc-needed
doc-not-needed
doc-included

The built-in chat-model action stored non-primitive Python objects in sensory memory: UUID values, an OutputSchema, and ChatMessage lists. Pemja wraps such objects as PyObject holders whose JNI pointers go stale after a TaskManager/Python restart, so restoring the checkpointed tool context crashes in JcpPyObject_FromJObject. Normalize these values to a primitive-only form before they reach memory and reconstruct the rich types on read, fully inside the three tool-context helpers (no caller or signature changes): - ChatMessage lists are stored via model_dump(mode="json") and reconstructed via ChatMessage.model_validate. - initial_request_id is stored as str and reconstructed to UUID. - output_schema is stored via OutputSchema.model_dump() and reconstructed via OutputSchema.model_validate. Dict keys were already strings and the retry-stats context already holds only ints, so both are unchanged. prompt_args is user-supplied and already round-trips as a ChatRequestEvent attribute, so it is left as-is.

joeyutong · 2026-06-10T07:40:19Z

LGTM. One thing to note: this fixes assume the job is restarted from a clean state.

wenjin272

LGTM.

I tried to add an e2e case to truly verify that recovery from a checkpoint fails before the fix and is resolved after it. However, in the MiniCluster, triggering an in-place recovery by throwing an exception does not recreate the JVM, so the problematic code path is never exercised. I think we can, after #708 merge, add a case that manually kills the TM process in a standalone cluster to trigger recovery, and use that to verify this fix.

weiqingy · 2026-06-11T04:45:15Z

LGTM. One thing to note: this fixes assume the job is restarted from a clean state.

Thanks for the review @joeyutong .
Agreed — that's the right way to read it. The change makes the context written to memory primitive-only from here on, and the read path now expects that primitive layout, so it doesn't migrate state already checkpointed by the old code: a savepoint taken before this fix still holds the non-primitive form and would hit the original problem on restore. So it's forward-looking — correct for fresh runs and for checkpoints written after it lands, not for restoring a pre-fix checkpoint. That fits where the project is today (no state-migration guarantees on the built-in tool context yet); if we ever need pre-fix-checkpoint compatibility we can add a migration shim on the read path.

weiqingy · 2026-06-11T04:47:50Z

LGTM.

I tried to add an e2e case to truly verify that recovery from a checkpoint fails before the fix and is resolved after it. However, in the MiniCluster, triggering an in-place recovery by throwing an exception does not recreate the JVM, so the problematic code path is never exercised. I think we can, after #708 merge, add a case that manually kills the TM process in a standalone cluster to trigger recovery, and use that to verify this fix.

@wenjin272 Thanks for digging into the e2e angle. That matches what we found while scoping this — local/MiniCluster mode never crosses Pemja, so the SIGSEGV path can't be reproduced there, which is why the unit tests assert the stored form is recursively primitive as a checkpoint-safety proxy instead of driving a real restore. Killing the TM process in a standalone cluster after #708 is exactly the right way to get true before/after verification. I filed #836 to track it so it doesn't get lost.

github-actions Bot added doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. labels Jun 10, 2026

weiqingy mentioned this pull request Jun 10, 2026

[Bug] Python memory can persist Pemja-incompatible values and fail after state backend restore #723

Closed

2 tasks

wenjin272 approved these changes Jun 10, 2026

View reviewed changes

wenjin272 merged commit 06e7722 into apache:main Jun 10, 2026
26 checks passed

weiqingy mentioned this pull request Jun 11, 2026

[Tech Debt] Add end-to-end checkpoint-recovery test for built-in tool-context #836

Open

weiqingy mentioned this pull request Jun 11, 2026

[python] Reject non-checkpoint-stable Python memory values at set() #839

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[python] Normalize built-in tool-call context for checkpoint safety#828

[python] Normalize built-in tool-call context for checkpoint safety#828
wenjin272 merged 1 commit into
apache:mainfrom
weiqingy:723-impl

weiqingy commented Jun 10, 2026

Uh oh!

joeyutong commented Jun 10, 2026

Uh oh!

wenjin272 left a comment

Uh oh!

Uh oh!

weiqingy commented Jun 11, 2026

Uh oh!

weiqingy commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

weiqingy commented Jun 10, 2026

Purpose of change

Tests

API

Documentation

Uh oh!

joeyutong commented Jun 10, 2026

Uh oh!

wenjin272 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weiqingy commented Jun 11, 2026

Uh oh!

weiqingy commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants