feat: add checkpoint/resume for long document processing#227
Open
ag9920 wants to merge 1 commit intoVectifyAI:mainfrom
Open
feat: add checkpoint/resume for long document processing#227ag9920 wants to merge 1 commit intoVectifyAI:mainfrom
ag9920 wants to merge 1 commit intoVectifyAI:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add two-phase checkpoint support to the document processing pipeline,
enabling users to resume from the last completed stage instead of
restarting from scratch when processing is interrupted.
Closes #170
Problem
Processing large documents (100+ pages) through PageIndex is expensive
and time-consuming — the LLM calls in
tree_parserandgenerate_summariescan take 10-30 minutes and cost significant tokens.If the process crashes at any point (API rate limit, network timeout,
context overflow), all progress is lost and users must start over.
Issue #170 raised this exact pain point: users need a way to recover
from failures without re-running the entire pipeline.
Solution
Introduce a two-phase checkpoint mechanism that saves intermediate
results after each expensive LLM stage:
{doc}_tree.jsontree_parsercompletes{doc}_summary.jsongenerate_summariescompletesOn resume, the pipeline automatically picks the latest available
checkpoint (summary > tree), skipping all completed LLM calls.
For Markdown documents, a single checkpoint (
{doc}_md_summary.json)is saved after summary generation, since tree construction is local
and doesn't require LLM calls.
Usage
CLI:
Python API:
Human-in-the-loop correction:
Users can manually edit checkpoint JSON files (e.g., fix an incorrect
page number identified by the LLM) before resuming, enabling a
human-in-the-loop workflow not previously possible.
Implementation Details
Atomic writes: Checkpoints are written to a
.tmpfile first,then
os.replace()atomically swaps it into place. This preventscorruption if the process crashes mid-write.
Backward compatible: When
checkpoint_diris not set (defaultNone), behavior is identical to before — zero overhead.Config integration: New
checkpoint_dir(default:null) andresume(default:"no") fields added toconfig.yaml, validatedby the existing
ConfigLoader.Error handling:
--resumewithout--checkpoint-dirraises aclear error. Resume with no checkpoint files found raises
FileNotFoundErrorwith the expected file path.Files Changed
pageindex/page_index.pypage_index_builder()+_save_checkpoint()helperpageindex/page_index_md.pymd_to_tree()+_save_checkpoint_md()helperpageindex/client.pycheckpoint_dir/resumethroughPageIndexClient.index()for both PDF and MDpageindex/config.yamlcheckpoint_dir: nullandresume: "no"run_pageindex.py--checkpoint-dirand--resumeCLI args, global validation, pass to both PDF/MD branchesTesting
_tree.jsonand_summary.jsoncheckpoints are saved correctly--resumesuccessfully skips LLM calls and loads from checkpoint--resumewithout--checkpoint-dirraises clear error