feat: add checkpoint/resume for long document processing by ag9920 · Pull Request #227 · VectifyAI/PageIndex

ag9920 · 2026-04-11T05:52:08Z

Summary

Add two-phase checkpoint support to the document processing pipeline,
enabling users to resume from the last completed stage instead of
restarting from scratch when processing is interrupted.

Closes #170

Problem

Processing large documents (100+ pages) through PageIndex is expensive
and time-consuming — the LLM calls in tree_parser and
generate_summaries can take 10-30 minutes and cost significant tokens.
If the process crashes at any point (API rate limit, network timeout,
context overflow), all progress is lost and users must start over.

Issue #170 raised this exact pain point: users need a way to recover
from failures without re-running the entire pipeline.

Solution

Introduce a two-phase checkpoint mechanism that saves intermediate
results after each expensive LLM stage:

Checkpoint File	Saved After	What It Contains
`{doc}_tree.json`	`tree_parser` completes	Raw tree structure from TOC parsing
`{doc}_summary.json`	`generate_summaries` completes	Tree structure with summaries attached

On resume, the pipeline automatically picks the latest available
checkpoint (summary > tree), skipping all completed LLM calls.

For Markdown documents, a single checkpoint ({doc}_md_summary.json)
is saved after summary generation, since tree construction is local
and doesn't require LLM calls.

Usage

CLI:

# First run: parse + save checkpoints
python run_pageindex.py --pdf_path doc.pdf --checkpoint-dir ./checkpoints

# If interrupted, resume from the latest checkpoint
python run_pageindex.py --pdf_path doc.pdf --checkpoint-dir ./checkpoints --resume

# Markdown works the same way
python run_pageindex.py --md_path doc.md --checkpoint-dir ./checkpoints --resume

Python API:

from pageindex import PageIndexClient

client = PageIndexClient(workspace="./workspace")

# Enable checkpointing
doc_id = client.index("doc.pdf", checkpoint_dir="./checkpoints")

# Resume after interruption
doc_id = client.index("doc.pdf", checkpoint_dir="./checkpoints", resume=True)

Human-in-the-loop correction:

Users can manually edit checkpoint JSON files (e.g., fix an incorrect
page number identified by the LLM) before resuming, enabling a
human-in-the-loop workflow not previously possible.

Implementation Details

Atomic writes: Checkpoints are written to a .tmp file first,
then os.replace() atomically swaps it into place. This prevents
corruption if the process crashes mid-write.
Backward compatible: When checkpoint_dir is not set (default
None), behavior is identical to before — zero overhead.
Config integration: New checkpoint_dir (default: null) and
resume (default: "no") fields added to config.yaml, validated
by the existing ConfigLoader.
Error handling: --resume without --checkpoint-dir raises a
clear error. Resume with no checkpoint files found raises
FileNotFoundError with the expected file path.

Files Changed

File	Change
`pageindex/page_index.py`	Two-phase checkpoint in `page_index_builder()` + `_save_checkpoint()` helper
`pageindex/page_index_md.py`	Checkpoint after summary generation in `md_to_tree()` + `_save_checkpoint_md()` helper
`pageindex/client.py`	Pass `checkpoint_dir`/`resume` through `PageIndexClient.index()` for both PDF and MD
`pageindex/config.yaml`	Add `checkpoint_dir: null` and `resume: "no"`
`run_pageindex.py`	Add `--checkpoint-dir` and `--resume` CLI args, global validation, pass to both PDF/MD branches

Testing

Verified with a 21-page PDF using Kimi K2.5: both _tree.json and
_summary.json checkpoints are saved correctly
--resume successfully skips LLM calls and loads from checkpoint
--resume without --checkpoint-dir raises clear error
Default behavior (no checkpoint_dir) unchanged
AST syntax check passes for all modified files

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

feat: add checkpoint/resume for long document processing

93b6ac6

claude bot reviewed Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add checkpoint/resume for long document processing#227

feat: add checkpoint/resume for long document processing#227
ag9920 wants to merge 1 commit intoVectifyAI:mainfrom
ag9920:feat_checkpoint_document

ag9920 commented Apr 11, 2026

Uh oh!

claude bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ag9920 commented Apr 11, 2026

Summary

Problem

Solution

Usage

Implementation Details

Files Changed

Testing

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant