LibPDF-js · l-ajeti · Jun 30, 2026 · Jul 3, 2026 · Jul 3, 2026
diff --git a/.agents/plans/046-form-xobject-text-extraction.md b/.agents/plans/046-form-xobject-text-extraction.md
@@ -0,0 +1,197 @@
+# 046: Form XObject Text Extraction
+
+## Problem Statement
+
+Text extraction (plan [035](./035-text-extraction.md)) only processes a page's
+top-level content stream. Many real-world PDFs — especially those produced by
+tax/accounting software (e.g. IRS Form 8879-PE), reporting tools, and design
+tools — draw all or part of their text inside **form XObjects** that the page
+content stream merely paints with the `Do` operator:
+
+```
+q /Fm0 Do Q          % page content: paints form XObjects, no text operators
+```
+
+The text (`BT ... Tj ... ET`) lives inside `/Fm0`, which carries its **own**
+`/Resources/Font` dictionary. Because `TextExtractor` had no `Do` handler, these
+pages extracted as empty strings, which is indistinguishable from a scanned
+image to a caller — a correctness gap, not a layout nicety.
+
+This is a follow-up to plan 035 (Tier 3 text features per GOALS.md), closing the
+gap between "page has no top-level text operators" and "page has no text."
+
+## Scope
+
+### In Scope
+
+- Recurse into form XObjects (`Subtype /Form`) invoked via `Do`
+- Resolve fonts and nested XObjects against each form's own `/Resources`
+- Apply the form's `/Matrix` to nested text positions
+- Isolate the caller's graphics/text state across a form invocation, tolerating
+  malformed (unbalanced) `q`/`Q` inside the form
+- Guard against cyclic form references
+- Reuse the existing line-grouping, span, and search pipeline unchanged
+
+### Out of Scope
+
+- Image XObjects (resolve to `null`; no OCR — consistent with plan 035)
+- Tiling patterns and Type 3 font glyph procedures (separate content streams)
+- Annotation appearance streams (separate feature — plan 037)
+- Deduplicating overlapping visible + invisible ("ActualText"-style) text layers
+- Marked content / tagged-PDF logical structure
+
+## Dependencies
+
+- **Content stream parser** — `src/content/parsing/content-stream-parser.ts`
+  (reused as-is to parse form content)
+- **Font layer** — `src/fonts/` (`parseFont`, ToUnicode) — reused per form
+- **TextExtractor / TextState** — `src/text/` (extended, not replaced)
+- **COS accessors** — `PdfDict.getDict/getArray/getName`, `PdfStream.getDecodedData`
+
+No new external dependencies.
+
+## Desired API
+
+No public API change. The existing entry points transparently gain form
+coverage:
+
+```typescript
+const pdf = await PDF.load(bytes);
+const page = pdf.getPage(1);
+
+// Previously returned "" for form-drawn pages; now returns the real text.
+const { text, lines } = page.extractText();
+
+// findText (page- and document-wide) benefits automatically since it
+// delegates to extractText().
+const matches = page.findText(/\{\{\s*\w+\s*\}\}/g);
+```
+
+## Architecture
+
+### Components
+
+```
+PDFPage.extractText()
+        │
+        ├─► createResourceResolver(pageResources)  ──► ResourceResolver
+        │        ├─ createFontResolver   (per-Resources font cache)
+        │        └─ createXObjectResolver(per-Resources, lazy, memoized)
+        ▼
+TextExtractor (constructed with the page-level resolver)
+        │
+        ├─► ContentStreamParser (existing)
+        │
+        ├─► TextState (existing; + captureState/restoreState)
+        │
+        └─► on `Do`: runForm()
+                 ├─ snapshot state + push form /Matrix onto CTM
+                 ├─ swap active ResourceResolver to the form's
+                 ├─ recurse over the form's content (depth-guarded)
+                 └─ restore snapshot + resolver
+```
+
+### Key abstraction: `ResourceResolver`
+
+A form's resources are scoped to the form, so font/XObject lookup cannot be a
+single page-wide callback. `ResourceResolver` bundles the two lookups for one
+content stream:
+
+```typescript
+interface ResourceResolver {
+  resolveFont: (name: string) => PdfFont | null;
+  resolveXObject: (name: string) => FormXObject | null; // null for images
+}
+
+interface FormXObject {
+  bytes: Uint8Array; // decoded content
+  matrix?: readonly [number, number, number, number, number, number];
+  resources: ResourceResolver; // the form's own
+}
+```
+
+`TextExtractor` tracks the _active_ resolver and swaps it while inside a form.
+`PDFPage` builds resolvers from COS dictionaries and memoizes them by
+dictionary identity (`_resourceResolverCache`), matching the existing
+`_resourceCache` / `_annotationCache` pattern on the class.
+
+### State isolation
+
+Per PDF spec §8.10.1, painting a form behaves as if wrapped in `q`/`Q` with the
+form's `/Matrix` concatenated onto the CTM. `TextState` gains
+`captureState()` / `restoreState()` that snapshot the full text+graphics state
+_and the graphics-stack depth_, so a form with unbalanced `q`/`Q` (lenient
+handling per the project's malformed-PDF principle) cannot corrupt the rest of
+the page.
+
+### Cycle safety
+
+A `formDepth` counter in the extractor caps nesting at `MAX_FORM_DEPTH` (16).
+Combined with identity memoization of resolvers, a form that paints itself
+terminates instead of recursing forever.
+
+## Test Plan
+
+### Unit (`src/text/text-extractor.test.ts`)
+
+- Extract text nested one level inside a form invoked by `Do`
+- Unresolvable / image XObject (`Do` is a no-op)
+- Form uses its **own** font resources (prove via a font that shifts codes)
+- State isolation: form with stray `Q` operators leaves later page text intact
+- `/Matrix` translation offsets nested text position
+- Cyclic self-referential form terminates without throwing
+- No `resolveXObject` provided → `Do` ignored (back-compat)
+
+### Integration (`src/integration/text/text-extraction.test.ts`)
+
+- New fixture `fixtures/text/form-xobject-text.pdf`: a page whose only text is
+  drawn via a form XObject with its own font → `extractText().text` contains it
+
+### Regression
+
+- `rtl-placed-text` fixture regenerated to drop a redundant duplicate text layer
+  (a clean-LTR form copy that real design-tool exports don't carry and that
+  conflicted with now-correct form recursion); the RTL content stream — the
+  actual test subject — is preserved byte-for-byte
+
+### Full suite
+
+- `bun run test:run` (all files), `bun run typecheck`, `bun run lint` green
+
+## Open Questions
+
+1. **Overlapping visible + invisible text** — When a PDF carries both a visible
+   layer and an invisible logical-order layer for the same words, extraction now
+   surfaces both. Real-world dedup (by position + content) is deferred; it is a
+   broader feature than form recursion. _Current approach_: extract everything,
+   matching pdf.js behavior.
+
+2. **Render mode 3 (invisible) text** — Kept in output, as before, because it is
+   the canonical layer for searchable/scanned PDFs. Not changed here.
+
+## Risks
+
+- **Double-counting** in the rare visible+invisible duplicate-layer case (see
+  Open Question 1). Mitigated by it being uncommon in generated PDFs; flagged
+  for a future dedup pass.
+- **Performance** — Each distinct form's fonts are parsed once and memoized;
+  repeated `Do` of the same form is O(1) after first resolve.
+
+## Implementation Phases
+
+### Phase 1: Resolver abstraction
+
+- Add `ResourceResolver` / `FormXObject` to `text-extractor.ts`
+- Refactor `PDFPage.createFontResolver` → `createResourceResolver` +
+  `createFontResolver(dict)` + `createXObjectResolver(dict)` + `readMatrix`
+
+### Phase 2: Extractor recursion
+
+- Track active resolver + `formDepth` in `TextExtractor`
+- Add `Do` handler → `runForm()` (snapshot, matrix, swap, recurse, restore)
+- Add `TextState.captureState` / `restoreState`
+
+### Phase 3: Tests & fixtures
+
+- Unit tests, integration fixture, regenerate `rtl-placed-ltr-text.pdf`
+- Verify full suite, typecheck, lint
diff --git a/fixtures/text/form-xobject-text.pdf b/fixtures/text/form-xobject-text.pdf
diff --git a/fixtures/text/rtl-placed-ltr-text.pdf b/fixtures/text/rtl-placed-ltr-text.pdf
diff --git a/package.json b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "@libpdf/core",
-  "version": "0.4.1",
+  "version": "0.4.2",
   "description": "A modern PDF library for TypeScript - parsing and generation",
   "keywords": [
     "digital-signature",

diff --git a/src/api/pdf-page.ts b/src/api/pdf-page.ts
@@ -117,6 +117,7 @@ import {
   showText,
 } from "#src/helpers/operators";
 import * as operatorHelpers from "#src/helpers/operators";
+import type { RefResolver } from "#src/helpers/types";
 import type { PDFImage } from "#src/images/pdf-image";
 import { PdfArray } from "#src/objects/pdf-array";
 import { PdfDict } from "#src/objects/pdf-dict";
@@ -126,7 +127,7 @@ import { PdfRef } from "#src/objects/pdf-ref";
 import { PdfStream } from "#src/objects/pdf-stream";
 import { PdfString } from "#src/objects/pdf-string";
 import { getPlainText, groupCharsIntoLines } from "#src/text/line-grouper";
-import { TextExtractor } from "#src/text/text-extractor";
+import { type FormXObject, type ResourceResolver, TextExtractor } from "#src/text/text-extractor";
 import { searchPage } from "#src/text/text-search";
 import type { ExtractTextOptions, FindTextOptions, PageText, TextMatch } from "#src/text/types";
 
@@ -2817,11 +2818,14 @@ export class PDFPage {
     // Get content stream bytes
     const contentBytes = this.getContentBytes();
 
-    // Create font resolver
-    const resolveFont = this.createFontResolver();
+    // Build a resource resolver for fonts and form XObjects
+    const resources = this.createResourceResolver(this.resolveInheritedResources());
 
     // Extract characters
-    const extractor = new TextExtractor({ resolveFont });
+    const extractor = new TextExtractor({
+      resolveFont: resources.resolveFont,
+      resolveXObject: resources.resolveXObject,
+    });
     const chars = extractor.extract(contentBytes);
 
     // Group into lines and spans
@@ -2959,16 +2963,45 @@ export class PDFPage {
   }
 
   /**
-   * Create a font resolver function for text extraction.
+   * Memoized resource resolvers, keyed by Resources dictionary identity.
+   * Shared across nested form XObjects to avoid rebuilding font caches and to
+   * break cyclic XObject references.
    */
-  private createFontResolver(): (name: string) => PdfFont | null {
-    // Get the page's Font resources (may be a ref or inherited from parent)
-    const resourcesDict = this.resolveInheritedResources();
+  private readonly _resourceResolverCache = new Map<PdfDict, ResourceResolver>();
 
+  /**
+   * Build a resource resolver (fonts + form XObjects) for a Resources dict.
+   *
+   * Form XObjects carry their own Resources, so resolvers are scoped per
+   * Resources dictionary. Resolvers are memoized by dictionary identity, both
+   * to avoid rebuilding font caches for repeated XObjects and so that cyclic
+   * resource references resolve to the same instance. (The XObject resolver
+   * recurses lazily, so building one resolver never builds another.)
+   */
+  private createResourceResolver(resourcesDict: PdfDict | null): ResourceResolver {
     if (!resourcesDict) {
-      return () => null;
+      return { resolveFont: () => null, resolveXObject: () => null };
     }
 
+    const cached = this._resourceResolverCache.get(resourcesDict);
+
+    if (cached) {
+      return cached;
+    }
+
+    const resolver: ResourceResolver = {
+      resolveFont: this.createFontResolver(resourcesDict),
+      resolveXObject: this.createXObjectResolver(resourcesDict),
+    };
+    this._resourceResolverCache.set(resourcesDict, resolver);
+
+    return resolver;
+  }
+
+  /**
+   * Create a font resolver function for a given Resources dictionary.
+   */
+  private createFontResolver(resourcesDict: PdfDict): (name: string) => PdfFont | null {
     const font = resourcesDict.getDict("Font", this.ctx.resolve.bind(this.ctx));
 
     if (!font) {
@@ -3015,4 +3048,85 @@ export class PDFPage {
       return fontCache.get(name) ?? null;
     };
   }
+
+  /**
+   * Create a form-XObject resolver for a given Resources dictionary.
+   *
+   * Only form XObjects (Subtype /Form) carry extractable text; image XObjects
+   * resolve to null so the extractor skips them.
+   */
+  private createXObjectResolver(resourcesDict: PdfDict): (name: string) => FormXObject | null {
+    const resolve = this.ctx.resolve.bind(this.ctx);
+    const xobjects = resourcesDict.getDict("XObject", resolve);
+
+    if (!xobjects) {
+      return () => null;
+    }
+
+    const cache = new Map<string, FormXObject | null>();
+
+    return (name: string): FormXObject | null => {
+      const existing = cache.get(name);
+
+      if (existing !== undefined) {
+        return existing;
+      }
+
+      let result: FormXObject | null = null;
+      const entry = xobjects.get(name, resolve);
+
+      if (entry instanceof PdfStream && entry.getName("Subtype", resolve)?.value === "Form") {
+        let bytes: Uint8Array;
+
+        try {
+          bytes = entry.getDecodedData();
+        } catch {
+          // Undecodable stream — treat as empty rather than throwing.
+          bytes = new Uint8Array(0);
+        }
+
+        // A form's content is processed with its own Resources, falling back to
+        // the enclosing resources when the form omits them (lenient handling).
+        const formResources = entry.getDict("Resources", resolve) ?? resourcesDict;
+
+        result = {
+          bytes,
+          matrix: this.readMatrix(entry, resolve),
+          resources: this.createResourceResolver(formResources),
+        };
+      }
+
+      cache.set(name, result);
+
+      return result;
+    };
+  }
+
+  /**
+   * Read a 6-element /Matrix from an XObject dictionary, if present and valid.
+   */
+  private readMatrix(
+    dict: PdfDict,
+    resolve: RefResolver,
+  ): [number, number, number, number, number, number] | undefined {
+    const array = dict.getArray("Matrix", resolve);
+
+    if (!array || array.length !== 6) {
+      return undefined;
+    }
+
+    const values: number[] = [];
+
+    for (let i = 0; i < 6; i++) {
+      const value = array.at(i, resolve);
+
+      if (value?.type !== "number") {
+        return undefined;
+      }
+
+      values.push(value.value);
+    }
+
+    return [values[0], values[1], values[2], values[3], values[4], values[5]];
+  }
 }
diff --git a/src/integration/text/text-extraction.test.ts b/src/integration/text/text-extraction.test.ts
@@ -70,6 +70,20 @@ describe("Text Extraction Integration", () => {
     });
   });
 
+  describe("form XObjects", () => {
+    it("extracts text nested inside a form XObject", async () => {
+      // The page draws all of its text via a form XObject (/Fm0 Do) that
+      // carries its own font resources, so extraction must recurse into it.
+      const bytes = await loadFixture("text", "form-xobject-text.pdf");
+      const pdf = await PDF.load(bytes);
+      const page = pdf.getPage(0);
+
+      const pageText = page!.extractText();
+
+      expect(pageText.text).toContain("FormXObjectText");
+    });
+  });
+
   describe("document-wide extractText", () => {
     it("extracts text from all pages", async () => {
       const bytes = await loadFixture("text", "openoffice-test-document.pdf");