Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 197 additions & 0 deletions .agents/plans/046-form-xobject-text-extraction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# 046: Form XObject Text Extraction

## Problem Statement

Text extraction (plan [035](./035-text-extraction.md)) only processes a page's
top-level content stream. Many real-world PDFs — especially those produced by
tax/accounting software (e.g. IRS Form 8879-PE), reporting tools, and design
tools — draw all or part of their text inside **form XObjects** that the page
content stream merely paints with the `Do` operator:

```
q /Fm0 Do Q % page content: paints form XObjects, no text operators
```

The text (`BT ... Tj ... ET`) lives inside `/Fm0`, which carries its **own**
`/Resources/Font` dictionary. Because `TextExtractor` had no `Do` handler, these
pages extracted as empty strings, which is indistinguishable from a scanned
image to a caller — a correctness gap, not a layout nicety.

This is a follow-up to plan 035 (Tier 3 text features per GOALS.md), closing the
gap between "page has no top-level text operators" and "page has no text."

## Scope

### In Scope

- Recurse into form XObjects (`Subtype /Form`) invoked via `Do`
- Resolve fonts and nested XObjects against each form's own `/Resources`
- Apply the form's `/Matrix` to nested text positions
- Isolate the caller's graphics/text state across a form invocation, tolerating
malformed (unbalanced) `q`/`Q` inside the form
- Guard against cyclic form references
- Reuse the existing line-grouping, span, and search pipeline unchanged

### Out of Scope

- Image XObjects (resolve to `null`; no OCR — consistent with plan 035)
- Tiling patterns and Type 3 font glyph procedures (separate content streams)
- Annotation appearance streams (separate feature — plan 037)
- Deduplicating overlapping visible + invisible ("ActualText"-style) text layers
- Marked content / tagged-PDF logical structure

## Dependencies

- **Content stream parser** — `src/content/parsing/content-stream-parser.ts`
(reused as-is to parse form content)
- **Font layer** — `src/fonts/` (`parseFont`, ToUnicode) — reused per form
- **TextExtractor / TextState** — `src/text/` (extended, not replaced)
- **COS accessors** — `PdfDict.getDict/getArray/getName`, `PdfStream.getDecodedData`

No new external dependencies.

## Desired API

No public API change. The existing entry points transparently gain form
coverage:

```typescript
const pdf = await PDF.load(bytes);
const page = pdf.getPage(1);

// Previously returned "" for form-drawn pages; now returns the real text.
const { text, lines } = page.extractText();

// findText (page- and document-wide) benefits automatically since it
// delegates to extractText().
const matches = page.findText(/\{\{\s*\w+\s*\}\}/g);
```

## Architecture

### Components

```
PDFPage.extractText()
├─► createResourceResolver(pageResources) ──► ResourceResolver
│ ├─ createFontResolver (per-Resources font cache)
│ └─ createXObjectResolver(per-Resources, lazy, memoized)
TextExtractor (constructed with the page-level resolver)
├─► ContentStreamParser (existing)
├─► TextState (existing; + captureState/restoreState)
└─► on `Do`: runForm()
├─ snapshot state + push form /Matrix onto CTM
├─ swap active ResourceResolver to the form's
├─ recurse over the form's content (depth-guarded)
└─ restore snapshot + resolver
```

### Key abstraction: `ResourceResolver`

A form's resources are scoped to the form, so font/XObject lookup cannot be a
single page-wide callback. `ResourceResolver` bundles the two lookups for one
content stream:

```typescript
interface ResourceResolver {
resolveFont: (name: string) => PdfFont | null;
resolveXObject: (name: string) => FormXObject | null; // null for images
}

interface FormXObject {
bytes: Uint8Array; // decoded content
matrix?: readonly [number, number, number, number, number, number];
resources: ResourceResolver; // the form's own
}
```

`TextExtractor` tracks the _active_ resolver and swaps it while inside a form.
`PDFPage` builds resolvers from COS dictionaries and memoizes them by
dictionary identity (`_resourceResolverCache`), matching the existing
`_resourceCache` / `_annotationCache` pattern on the class.

### State isolation

Per PDF spec §8.10.1, painting a form behaves as if wrapped in `q`/`Q` with the
form's `/Matrix` concatenated onto the CTM. `TextState` gains
`captureState()` / `restoreState()` that snapshot the full text+graphics state
_and the graphics-stack depth_, so a form with unbalanced `q`/`Q` (lenient
handling per the project's malformed-PDF principle) cannot corrupt the rest of
the page.

### Cycle safety

A `formDepth` counter in the extractor caps nesting at `MAX_FORM_DEPTH` (16).
Combined with identity memoization of resolvers, a form that paints itself
terminates instead of recursing forever.

## Test Plan

### Unit (`src/text/text-extractor.test.ts`)

- Extract text nested one level inside a form invoked by `Do`
- Unresolvable / image XObject (`Do` is a no-op)
- Form uses its **own** font resources (prove via a font that shifts codes)
- State isolation: form with stray `Q` operators leaves later page text intact
- `/Matrix` translation offsets nested text position
- Cyclic self-referential form terminates without throwing
- No `resolveXObject` provided → `Do` ignored (back-compat)

### Integration (`src/integration/text/text-extraction.test.ts`)

- New fixture `fixtures/text/form-xobject-text.pdf`: a page whose only text is
drawn via a form XObject with its own font → `extractText().text` contains it

### Regression

- `rtl-placed-text` fixture regenerated to drop a redundant duplicate text layer
(a clean-LTR form copy that real design-tool exports don't carry and that
conflicted with now-correct form recursion); the RTL content stream — the
actual test subject — is preserved byte-for-byte

### Full suite

- `bun run test:run` (all files), `bun run typecheck`, `bun run lint` green

## Open Questions

1. **Overlapping visible + invisible text** — When a PDF carries both a visible
layer and an invisible logical-order layer for the same words, extraction now
surfaces both. Real-world dedup (by position + content) is deferred; it is a
broader feature than form recursion. _Current approach_: extract everything,
matching pdf.js behavior.

2. **Render mode 3 (invisible) text** — Kept in output, as before, because it is
the canonical layer for searchable/scanned PDFs. Not changed here.

## Risks

- **Double-counting** in the rare visible+invisible duplicate-layer case (see
Open Question 1). Mitigated by it being uncommon in generated PDFs; flagged
for a future dedup pass.
- **Performance** — Each distinct form's fonts are parsed once and memoized;
repeated `Do` of the same form is O(1) after first resolve.

## Implementation Phases

### Phase 1: Resolver abstraction

- Add `ResourceResolver` / `FormXObject` to `text-extractor.ts`
- Refactor `PDFPage.createFontResolver` → `createResourceResolver` +
`createFontResolver(dict)` + `createXObjectResolver(dict)` + `readMatrix`

### Phase 2: Extractor recursion

- Track active resolver + `formDepth` in `TextExtractor`
- Add `Do` handler → `runForm()` (snapshot, matrix, swap, recurse, restore)
- Add `TextState.captureState` / `restoreState`

### Phase 3: Tests & fixtures

- Unit tests, integration fixture, regenerate `rtl-placed-ltr-text.pdf`
- Verify full suite, typecheck, lint
Binary file added fixtures/text/form-xobject-text.pdf
Binary file not shown.
Binary file modified fixtures/text/rtl-placed-ltr-text.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@libpdf/core",
"version": "0.4.1",
"version": "0.4.2",
"description": "A modern PDF library for TypeScript - parsing and generation",
"keywords": [
"digital-signature",
Expand Down
132 changes: 123 additions & 9 deletions src/api/pdf-page.ts
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ import {
showText,
} from "#src/helpers/operators";
import * as operatorHelpers from "#src/helpers/operators";
import type { RefResolver } from "#src/helpers/types";
import type { PDFImage } from "#src/images/pdf-image";
import { PdfArray } from "#src/objects/pdf-array";
import { PdfDict } from "#src/objects/pdf-dict";
Expand All @@ -126,7 +127,7 @@ import { PdfRef } from "#src/objects/pdf-ref";
import { PdfStream } from "#src/objects/pdf-stream";
import { PdfString } from "#src/objects/pdf-string";
import { getPlainText, groupCharsIntoLines } from "#src/text/line-grouper";
import { TextExtractor } from "#src/text/text-extractor";
import { type FormXObject, type ResourceResolver, TextExtractor } from "#src/text/text-extractor";
import { searchPage } from "#src/text/text-search";
import type { ExtractTextOptions, FindTextOptions, PageText, TextMatch } from "#src/text/types";

Expand Down Expand Up @@ -2817,11 +2818,14 @@ export class PDFPage {
// Get content stream bytes
const contentBytes = this.getContentBytes();

// Create font resolver
const resolveFont = this.createFontResolver();
// Build a resource resolver for fonts and form XObjects
const resources = this.createResourceResolver(this.resolveInheritedResources());

// Extract characters
const extractor = new TextExtractor({ resolveFont });
const extractor = new TextExtractor({
resolveFont: resources.resolveFont,
resolveXObject: resources.resolveXObject,
});
const chars = extractor.extract(contentBytes);

// Group into lines and spans
Expand Down Expand Up @@ -2959,16 +2963,45 @@ export class PDFPage {
}

/**
* Create a font resolver function for text extraction.
* Memoized resource resolvers, keyed by Resources dictionary identity.
* Shared across nested form XObjects to avoid rebuilding font caches and to
* break cyclic XObject references.
*/
private createFontResolver(): (name: string) => PdfFont | null {
// Get the page's Font resources (may be a ref or inherited from parent)
const resourcesDict = this.resolveInheritedResources();
private readonly _resourceResolverCache = new Map<PdfDict, ResourceResolver>();

/**
* Build a resource resolver (fonts + form XObjects) for a Resources dict.
*
* Form XObjects carry their own Resources, so resolvers are scoped per
* Resources dictionary. Resolvers are memoized by dictionary identity, both
* to avoid rebuilding font caches for repeated XObjects and so that cyclic
* resource references resolve to the same instance. (The XObject resolver
* recurses lazily, so building one resolver never builds another.)
*/
private createResourceResolver(resourcesDict: PdfDict | null): ResourceResolver {
if (!resourcesDict) {
return () => null;
return { resolveFont: () => null, resolveXObject: () => null };
}

const cached = this._resourceResolverCache.get(resourcesDict);

if (cached) {
return cached;
}

const resolver: ResourceResolver = {
resolveFont: this.createFontResolver(resourcesDict),
resolveXObject: this.createXObjectResolver(resourcesDict),
};
this._resourceResolverCache.set(resourcesDict, resolver);

return resolver;
}

/**
* Create a font resolver function for a given Resources dictionary.
*/
private createFontResolver(resourcesDict: PdfDict): (name: string) => PdfFont | null {
const font = resourcesDict.getDict("Font", this.ctx.resolve.bind(this.ctx));

if (!font) {
Expand Down Expand Up @@ -3015,4 +3048,85 @@ export class PDFPage {
return fontCache.get(name) ?? null;
};
}

/**
* Create a form-XObject resolver for a given Resources dictionary.
*
* Only form XObjects (Subtype /Form) carry extractable text; image XObjects
* resolve to null so the extractor skips them.
*/
private createXObjectResolver(resourcesDict: PdfDict): (name: string) => FormXObject | null {
const resolve = this.ctx.resolve.bind(this.ctx);
const xobjects = resourcesDict.getDict("XObject", resolve);

if (!xobjects) {
return () => null;
}

const cache = new Map<string, FormXObject | null>();

return (name: string): FormXObject | null => {
const existing = cache.get(name);

if (existing !== undefined) {
return existing;
}

let result: FormXObject | null = null;
const entry = xobjects.get(name, resolve);

if (entry instanceof PdfStream && entry.getName("Subtype", resolve)?.value === "Form") {
let bytes: Uint8Array;

try {
bytes = entry.getDecodedData();
} catch {
// Undecodable stream — treat as empty rather than throwing.
bytes = new Uint8Array(0);
}

// A form's content is processed with its own Resources, falling back to
// the enclosing resources when the form omits them (lenient handling).
const formResources = entry.getDict("Resources", resolve) ?? resourcesDict;

result = {
bytes,
matrix: this.readMatrix(entry, resolve),
resources: this.createResourceResolver(formResources),
};
}

cache.set(name, result);

return result;
};
}

/**
* Read a 6-element /Matrix from an XObject dictionary, if present and valid.
*/
private readMatrix(
dict: PdfDict,
resolve: RefResolver,
): [number, number, number, number, number, number] | undefined {
const array = dict.getArray("Matrix", resolve);

if (!array || array.length !== 6) {
return undefined;
}

const values: number[] = [];

for (let i = 0; i < 6; i++) {
const value = array.at(i, resolve);

if (value?.type !== "number") {
return undefined;
}

values.push(value.value);
}

return [values[0], values[1], values[2], values[3], values[4], values[5]];
}
}
14 changes: 14 additions & 0 deletions src/integration/text/text-extraction.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,20 @@ describe("Text Extraction Integration", () => {
});
});

describe("form XObjects", () => {
it("extracts text nested inside a form XObject", async () => {
// The page draws all of its text via a form XObject (/Fm0 Do) that
// carries its own font resources, so extraction must recurse into it.
const bytes = await loadFixture("text", "form-xobject-text.pdf");
const pdf = await PDF.load(bytes);
const page = pdf.getPage(0);

const pageText = page!.extractText();

expect(pageText.text).toContain("FormXObjectText");
});
});

describe("document-wide extractText", () => {
it("extracts text from all pages", async () => {
const bytes = await loadFixture("text", "openoffice-test-document.pdf");
Expand Down
Loading