feat(text): extract text from form XObjects by l-ajeti · Pull Request #85 · LibPDF-js/core

l-ajeti · 2026-07-03T11:07:39Z

Text extraction only processed a page's top-level content stream, so pages that draw all their text inside form XObjects (common in tax/accounting and reporting PDFs, e.g. IRS Form 8879-PE) extracted as empty strings — indistinguishable from a scanned image.

TextExtractor now handles the Do operator: on a /Subtype /Form XObject it recurses into the form's content stream using the form's own /Resources, with the form's /Matrix concatenated onto the CTM. Image XObjects resolve to null and are skipped.

Add ResourceResolver/FormXObject abstraction (fonts + XObjects), scoped per content stream; PDFPage builds and memoizes resolvers by Resources-dict identity (matching _resourceCache/_annotationCache).
TextState gains captureState/restoreState that snapshot the full state and graphics-stack depth, so unbalanced q/Q inside a form cannot corrupt the rest of the page (lenient malformed-PDF handling).
Guard nested/cyclic forms with a depth cap.

Tests: unit coverage for nested extraction, form-scoped fonts, state isolation, /Matrix application, cycle safety, and back-compat; an integration fixture (form-xobject-text.pdf). The rtl-placed-text fixture is regenerated to drop a redundant duplicate text layer that conflicted with now-correct form recursion; its RTL content stream (the test subject) is preserved byte-for-byte.

Plan: .agents/plans/046-form-xobject-text-extraction.md

Text extraction only processed a page's top-level content stream, so pages that draw all their text inside form XObjects (common in tax/accounting and reporting PDFs, e.g. IRS Form 8879-PE) extracted as empty strings — indistinguishable from a scanned image. TextExtractor now handles the `Do` operator: on a /Subtype /Form XObject it recurses into the form's content stream using the form's own /Resources, with the form's /Matrix concatenated onto the CTM. Image XObjects resolve to null and are skipped. - Add ResourceResolver/FormXObject abstraction (fonts + XObjects), scoped per content stream; PDFPage builds and memoizes resolvers by Resources-dict identity (matching _resourceCache/_annotationCache). - TextState gains captureState/restoreState that snapshot the full state and graphics-stack depth, so unbalanced q/Q inside a form cannot corrupt the rest of the page (lenient malformed-PDF handling). - Guard nested/cyclic forms with a depth cap. Tests: unit coverage for nested extraction, form-scoped fonts, state isolation, /Matrix application, cycle safety, and back-compat; an integration fixture (form-xobject-text.pdf). The rtl-placed-text fixture is regenerated to drop a redundant duplicate text layer that conflicted with now-correct form recursion; its RTL content stream (the test subject) is preserved byte-for-byte. Plan: .agents/plans/046-form-xobject-text-extraction.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel · 2026-07-03T11:07:44Z

@l-ajeti is attempting to deploy a commit to the mythie's projects Team on Vercel.

A member of the Team first needs to authorize it.

l-ajeti added 2 commits July 3, 2026 13:14

release: v0.4.2

b681eb2

Merge branch main into feat/form-xobject-text-extraction

72d95b2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(text): extract text from form XObjects#85

feat(text): extract text from form XObjects#85
l-ajeti wants to merge 3 commits into
LibPDF-js:mainfrom
l-ajeti:feat/form-xobject-text-extraction

l-ajeti commented Jul 3, 2026

Uh oh!

vercel Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

l-ajeti commented Jul 3, 2026

Uh oh!

vercel Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant