Skip to content

feat(text): extract text from form XObjects#85

Open
l-ajeti wants to merge 3 commits into
LibPDF-js:mainfrom
l-ajeti:feat/form-xobject-text-extraction
Open

feat(text): extract text from form XObjects#85
l-ajeti wants to merge 3 commits into
LibPDF-js:mainfrom
l-ajeti:feat/form-xobject-text-extraction

Conversation

@l-ajeti

@l-ajeti l-ajeti commented Jul 3, 2026

Copy link
Copy Markdown

Text extraction only processed a page's top-level content stream, so pages that draw all their text inside form XObjects (common in tax/accounting and reporting PDFs, e.g. IRS Form 8879-PE) extracted as empty strings — indistinguishable from a scanned image.

TextExtractor now handles the Do operator: on a /Subtype /Form XObject it recurses into the form's content stream using the form's own /Resources, with the form's /Matrix concatenated onto the CTM. Image XObjects resolve to null and are skipped.

  • Add ResourceResolver/FormXObject abstraction (fonts + XObjects), scoped per content stream; PDFPage builds and memoizes resolvers by Resources-dict identity (matching _resourceCache/_annotationCache).
  • TextState gains captureState/restoreState that snapshot the full state and graphics-stack depth, so unbalanced q/Q inside a form cannot corrupt the rest of the page (lenient malformed-PDF handling).
  • Guard nested/cyclic forms with a depth cap.

Tests: unit coverage for nested extraction, form-scoped fonts, state isolation, /Matrix application, cycle safety, and back-compat; an integration fixture (form-xobject-text.pdf). The rtl-placed-text fixture is regenerated to drop a redundant duplicate text layer that conflicted with now-correct form recursion; its RTL content stream (the test subject) is preserved byte-for-byte.

Plan: .agents/plans/046-form-xobject-text-extraction.md

Text extraction only processed a page's top-level content stream, so
pages that draw all their text inside form XObjects (common in
tax/accounting and reporting PDFs, e.g. IRS Form 8879-PE) extracted as
empty strings — indistinguishable from a scanned image.

TextExtractor now handles the `Do` operator: on a /Subtype /Form
XObject it recurses into the form's content stream using the form's
own /Resources, with the form's /Matrix concatenated onto the CTM.
Image XObjects resolve to null and are skipped.

- Add ResourceResolver/FormXObject abstraction (fonts + XObjects),
  scoped per content stream; PDFPage builds and memoizes resolvers by
  Resources-dict identity (matching _resourceCache/_annotationCache).
- TextState gains captureState/restoreState that snapshot the full
  state and graphics-stack depth, so unbalanced q/Q inside a form
  cannot corrupt the rest of the page (lenient malformed-PDF handling).
- Guard nested/cyclic forms with a depth cap.

Tests: unit coverage for nested extraction, form-scoped fonts, state
isolation, /Matrix application, cycle safety, and back-compat; an
integration fixture (form-xobject-text.pdf). The rtl-placed-text
fixture is regenerated to drop a redundant duplicate text layer that
conflicted with now-correct form recursion; its RTL content stream
(the test subject) is preserved byte-for-byte.

Plan: .agents/plans/046-form-xobject-text-extraction.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@l-ajeti is attempting to deploy a commit to the mythie's projects Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant