Skip to content

[python] Normalize built-in tool-call context for checkpoint safety#828

Merged
wenjin272 merged 1 commit into
apache:mainfrom
weiqingy:723-impl
Jun 10, 2026
Merged

[python] Normalize built-in tool-call context for checkpoint safety#828
wenjin272 merged 1 commit into
apache:mainfrom
weiqingy:723-impl

Conversation

@weiqingy

Copy link
Copy Markdown
Collaborator

Linked issue: #723

Purpose of change

The built-in chat-model action stored non-primitive Python objects in sensory memory: UUID values, an OutputSchema, and ChatMessage lists. Pemja wraps such objects as PyObject holders whose JNI pointers go stale after a TaskManager/Python restart, so restoring the checkpointed tool context crashes in JcpPyObject_FromJObject.

This normalizes those values to a primitive-only form before they reach memory and reconstructs the rich types on read, entirely inside the three tool-context helpers in chat_model_action.py (no caller or signature changes):

  • ChatMessage lists are stored via model_dump(mode="json") and reconstructed via ChatMessage.model_validate. mode="json" is required because MessageRole is a str, Enum that a bare model_dump() would leave as an enum member.
  • initial_request_id is stored as str and reconstructed to UUID.
  • output_schema is stored via OutputSchema.model_dump() and reconstructed via OutputSchema.model_validate (None-guarded).

Dict keys were already strings and the retry-stats context already holds only ints, so both are unchanged. prompt_args is user-supplied and already round-trips as a ChatRequestEvent attribute, so it is left as-is.

This is the first of the agreed changes on #723. A follow-up PR will add a set()-time validator for user-supplied memory values and document the Python memory value contract.

Tests

plan/tests/actions/test_chat_model_action.py: 9 new unit tests plus a recursive _assert_primitive helper that asserts the stored form is primitive-only (a checkpoint-safety proxy, since no Python checkpoint/restore harness exists), and that round-trips reconstruct UUID / OutputSchema / ChatMessage, preserve the UUIDstr dict-key match and RowTypeInfo, handle a None output_schema, and keep model / prompt_args.

plan/tests/actions/test_chat_model_action_retry.py: updated the hand-seeded context in test_forwards_saved_prompt_args_to_chat to the new primitive-only stored form.

cd python
uv run --no-sync pytest flink_agents/plan/tests/actions/ -v   # 22 passed
uv run --no-sync pytest flink_agents -k "not e2e_tests" -q    # all green

API

No public API change. The normalization is fully encapsulated in the existing tool-context helpers; callers and method signatures are unchanged.

Documentation

  • doc-needed
  • doc-not-needed
  • doc-included

The built-in chat-model action stored non-primitive Python objects in
sensory memory: UUID values, an OutputSchema, and ChatMessage lists.
Pemja wraps such objects as PyObject holders whose JNI pointers go stale
after a TaskManager/Python restart, so restoring the checkpointed tool
context crashes in JcpPyObject_FromJObject.

Normalize these values to a primitive-only form before they reach memory
and reconstruct the rich types on read, fully inside the three
tool-context helpers (no caller or signature changes):

- ChatMessage lists are stored via model_dump(mode="json") and
  reconstructed via ChatMessage.model_validate.
- initial_request_id is stored as str and reconstructed to UUID.
- output_schema is stored via OutputSchema.model_dump() and
  reconstructed via OutputSchema.model_validate.

Dict keys were already strings and the retry-stats context already holds
only ints, so both are unchanged. prompt_args is user-supplied and
already round-trips as a ChatRequestEvent attribute, so it is left as-is.
@github-actions github-actions Bot added doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. labels Jun 10, 2026
@joeyutong

Copy link
Copy Markdown
Collaborator

LGTM. One thing to note: this fixes assume the job is restarted from a clean state.

@wenjin272 wenjin272 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I tried to add an e2e case to truly verify that recovery from a checkpoint fails before the fix and is resolved after it. However, in the MiniCluster, triggering an in-place recovery by throwing an exception does not recreate the JVM, so the problematic code path is never exercised. I think we can, after #708 merge, add a case that manually kills the TM process in a standalone cluster to trigger recovery, and use that to verify this fix.

@wenjin272 wenjin272 merged commit 06e7722 into apache:main Jun 10, 2026
26 checks passed
@weiqingy

Copy link
Copy Markdown
Collaborator Author

LGTM. One thing to note: this fixes assume the job is restarted from a clean state.

Thanks for the review @joeyutong .
Agreed — that's the right way to read it. The change makes the context written to memory primitive-only from here on, and the read path now expects that primitive layout, so it doesn't migrate state already checkpointed by the old code: a savepoint taken before this fix still holds the non-primitive form and would hit the original problem on restore. So it's forward-looking — correct for fresh runs and for checkpoints written after it lands, not for restoring a pre-fix checkpoint. That fits where the project is today (no state-migration guarantees on the built-in tool context yet); if we ever need pre-fix-checkpoint compatibility we can add a migration shim on the read path.

@weiqingy

Copy link
Copy Markdown
Collaborator Author

LGTM.

I tried to add an e2e case to truly verify that recovery from a checkpoint fails before the fix and is resolved after it. However, in the MiniCluster, triggering an in-place recovery by throwing an exception does not recreate the JVM, so the problematic code path is never exercised. I think we can, after #708 merge, add a case that manually kills the TM process in a standalone cluster to trigger recovery, and use that to verify this fix.

@wenjin272 Thanks for digging into the e2e angle. That matches what we found while scoping this — local/MiniCluster mode never crosses Pemja, so the SIGSEGV path can't be reproduced there, which is why the unit tests assert the stored form is recursively primitive as a checkpoint-safety proxy instead of driving a real restore. Killing the TM process in a standalone cluster after #708 is exactly the right way to get true before/after verification. I filed #836 to track it so it doesn't get lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants