[python] Normalize built-in tool-call context for checkpoint safety#828
Conversation
The built-in chat-model action stored non-primitive Python objects in sensory memory: UUID values, an OutputSchema, and ChatMessage lists. Pemja wraps such objects as PyObject holders whose JNI pointers go stale after a TaskManager/Python restart, so restoring the checkpointed tool context crashes in JcpPyObject_FromJObject. Normalize these values to a primitive-only form before they reach memory and reconstruct the rich types on read, fully inside the three tool-context helpers (no caller or signature changes): - ChatMessage lists are stored via model_dump(mode="json") and reconstructed via ChatMessage.model_validate. - initial_request_id is stored as str and reconstructed to UUID. - output_schema is stored via OutputSchema.model_dump() and reconstructed via OutputSchema.model_validate. Dict keys were already strings and the retry-stats context already holds only ints, so both are unchanged. prompt_args is user-supplied and already round-trips as a ChatRequestEvent attribute, so it is left as-is.
|
LGTM. One thing to note: this fixes assume the job is restarted from a clean state. |
wenjin272
left a comment
There was a problem hiding this comment.
LGTM.
I tried to add an e2e case to truly verify that recovery from a checkpoint fails before the fix and is resolved after it. However, in the MiniCluster, triggering an in-place recovery by throwing an exception does not recreate the JVM, so the problematic code path is never exercised. I think we can, after #708 merge, add a case that manually kills the TM process in a standalone cluster to trigger recovery, and use that to verify this fix.
Thanks for the review @joeyutong . |
@wenjin272 Thanks for digging into the e2e angle. That matches what we found while scoping this — local/MiniCluster mode never crosses Pemja, so the SIGSEGV path can't be reproduced there, which is why the unit tests assert the stored form is recursively primitive as a checkpoint-safety proxy instead of driving a real restore. Killing the TM process in a standalone cluster after #708 is exactly the right way to get true before/after verification. I filed #836 to track it so it doesn't get lost. |
Linked issue: #723
Purpose of change
The built-in chat-model action stored non-primitive Python objects in sensory memory:
UUIDvalues, anOutputSchema, andChatMessagelists. Pemja wraps such objects asPyObjectholders whose JNI pointers go stale after a TaskManager/Python restart, so restoring the checkpointed tool context crashes inJcpPyObject_FromJObject.This normalizes those values to a primitive-only form before they reach memory and reconstructs the rich types on read, entirely inside the three tool-context helpers in
chat_model_action.py(no caller or signature changes):ChatMessagelists are stored viamodel_dump(mode="json")and reconstructed viaChatMessage.model_validate.mode="json"is required becauseMessageRoleis astr, Enumthat a baremodel_dump()would leave as an enum member.initial_request_idis stored asstrand reconstructed toUUID.output_schemais stored viaOutputSchema.model_dump()and reconstructed viaOutputSchema.model_validate(None-guarded).Dict keys were already strings and the retry-stats context already holds only
ints, so both are unchanged.prompt_argsis user-supplied and already round-trips as aChatRequestEventattribute, so it is left as-is.This is the first of the agreed changes on #723. A follow-up PR will add a
set()-time validator for user-supplied memory values and document the Python memory value contract.Tests
plan/tests/actions/test_chat_model_action.py: 9 new unit tests plus a recursive_assert_primitivehelper that asserts the stored form is primitive-only (a checkpoint-safety proxy, since no Python checkpoint/restore harness exists), and that round-trips reconstructUUID/OutputSchema/ChatMessage, preserve theUUID→strdict-key match andRowTypeInfo, handle aNoneoutput_schema, and keepmodel/prompt_args.plan/tests/actions/test_chat_model_action_retry.py: updated the hand-seeded context intest_forwards_saved_prompt_args_to_chatto the new primitive-only stored form.API
No public API change. The normalization is fully encapsulated in the existing tool-context helpers; callers and method signatures are unchanged.
Documentation
doc-neededdoc-not-neededdoc-included