Skip to content

feat: add checkpoint/resume for long document processing#227

Open
ag9920 wants to merge 1 commit intoVectifyAI:mainfrom
ag9920:feat_checkpoint_document
Open

feat: add checkpoint/resume for long document processing#227
ag9920 wants to merge 1 commit intoVectifyAI:mainfrom
ag9920:feat_checkpoint_document

Conversation

@ag9920
Copy link
Copy Markdown

@ag9920 ag9920 commented Apr 11, 2026

Summary

Add two-phase checkpoint support to the document processing pipeline,
enabling users to resume from the last completed stage instead of
restarting from scratch when processing is interrupted.

Closes #170

Problem

Processing large documents (100+ pages) through PageIndex is expensive
and time-consuming — the LLM calls in tree_parser and
generate_summaries can take 10-30 minutes and cost significant tokens.
If the process crashes at any point (API rate limit, network timeout,
context overflow), all progress is lost and users must start over.

Issue #170 raised this exact pain point: users need a way to recover
from failures without re-running the entire pipeline.

Solution

Introduce a two-phase checkpoint mechanism that saves intermediate
results after each expensive LLM stage:

Checkpoint File Saved After What It Contains
{doc}_tree.json tree_parser completes Raw tree structure from TOC parsing
{doc}_summary.json generate_summaries completes Tree structure with summaries attached

On resume, the pipeline automatically picks the latest available
checkpoint (summary > tree), skipping all completed LLM calls.

For Markdown documents, a single checkpoint ({doc}_md_summary.json)
is saved after summary generation, since tree construction is local
and doesn't require LLM calls.

Usage

CLI:

# First run: parse + save checkpoints
python run_pageindex.py --pdf_path doc.pdf --checkpoint-dir ./checkpoints

# If interrupted, resume from the latest checkpoint
python run_pageindex.py --pdf_path doc.pdf --checkpoint-dir ./checkpoints --resume

# Markdown works the same way
python run_pageindex.py --md_path doc.md --checkpoint-dir ./checkpoints --resume

Python API:

from pageindex import PageIndexClient

client = PageIndexClient(workspace="./workspace")

# Enable checkpointing
doc_id = client.index("doc.pdf", checkpoint_dir="./checkpoints")

# Resume after interruption
doc_id = client.index("doc.pdf", checkpoint_dir="./checkpoints", resume=True)

Human-in-the-loop correction:

Users can manually edit checkpoint JSON files (e.g., fix an incorrect
page number identified by the LLM) before resuming, enabling a
human-in-the-loop workflow not previously possible.

Implementation Details

  • Atomic writes: Checkpoints are written to a .tmp file first,
    then os.replace() atomically swaps it into place. This prevents
    corruption if the process crashes mid-write.

  • Backward compatible: When checkpoint_dir is not set (default
    None), behavior is identical to before — zero overhead.

  • Config integration: New checkpoint_dir (default: null) and
    resume (default: "no") fields added to config.yaml, validated
    by the existing ConfigLoader.

  • Error handling: --resume without --checkpoint-dir raises a
    clear error. Resume with no checkpoint files found raises
    FileNotFoundError with the expected file path.

Files Changed

File Change
pageindex/page_index.py Two-phase checkpoint in page_index_builder() + _save_checkpoint() helper
pageindex/page_index_md.py Checkpoint after summary generation in md_to_tree() + _save_checkpoint_md() helper
pageindex/client.py Pass checkpoint_dir/resume through PageIndexClient.index() for both PDF and MD
pageindex/config.yaml Add checkpoint_dir: null and resume: "no"
run_pageindex.py Add --checkpoint-dir and --resume CLI args, global validation, pass to both PDF/MD branches

Testing

  • Verified with a 21-page PDF using Kimi K2.5: both _tree.json and
    _summary.json checkpoints are saved correctly
  • --resume successfully skips LLM calls and loads from checkpoint
  • --resume without --checkpoint-dir raises clear error
  • Default behavior (no checkpoint_dir) unchanged
  • AST syntax check passes for all modified files

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Breakpoint error debugging and correction

1 participant