diff --git a/.agents/plans/046-form-xobject-text-extraction.md b/.agents/plans/046-form-xobject-text-extraction.md new file mode 100644 index 0000000..400a670 --- /dev/null +++ b/.agents/plans/046-form-xobject-text-extraction.md @@ -0,0 +1,197 @@ +# 046: Form XObject Text Extraction + +## Problem Statement + +Text extraction (plan [035](./035-text-extraction.md)) only processes a page's +top-level content stream. Many real-world PDFs — especially those produced by +tax/accounting software (e.g. IRS Form 8879-PE), reporting tools, and design +tools — draw all or part of their text inside **form XObjects** that the page +content stream merely paints with the `Do` operator: + +``` +q /Fm0 Do Q % page content: paints form XObjects, no text operators +``` + +The text (`BT ... Tj ... ET`) lives inside `/Fm0`, which carries its **own** +`/Resources/Font` dictionary. Because `TextExtractor` had no `Do` handler, these +pages extracted as empty strings, which is indistinguishable from a scanned +image to a caller — a correctness gap, not a layout nicety. + +This is a follow-up to plan 035 (Tier 3 text features per GOALS.md), closing the +gap between "page has no top-level text operators" and "page has no text." + +## Scope + +### In Scope + +- Recurse into form XObjects (`Subtype /Form`) invoked via `Do` +- Resolve fonts and nested XObjects against each form's own `/Resources` +- Apply the form's `/Matrix` to nested text positions +- Isolate the caller's graphics/text state across a form invocation, tolerating + malformed (unbalanced) `q`/`Q` inside the form +- Guard against cyclic form references +- Reuse the existing line-grouping, span, and search pipeline unchanged + +### Out of Scope + +- Image XObjects (resolve to `null`; no OCR — consistent with plan 035) +- Tiling patterns and Type 3 font glyph procedures (separate content streams) +- Annotation appearance streams (separate feature — plan 037) +- Deduplicating overlapping visible + invisible ("ActualText"-style) text layers +- Marked content / tagged-PDF logical structure + +## Dependencies + +- **Content stream parser** — `src/content/parsing/content-stream-parser.ts` + (reused as-is to parse form content) +- **Font layer** — `src/fonts/` (`parseFont`, ToUnicode) — reused per form +- **TextExtractor / TextState** — `src/text/` (extended, not replaced) +- **COS accessors** — `PdfDict.getDict/getArray/getName`, `PdfStream.getDecodedData` + +No new external dependencies. + +## Desired API + +No public API change. The existing entry points transparently gain form +coverage: + +```typescript +const pdf = await PDF.load(bytes); +const page = pdf.getPage(1); + +// Previously returned "" for form-drawn pages; now returns the real text. +const { text, lines } = page.extractText(); + +// findText (page- and document-wide) benefits automatically since it +// delegates to extractText(). +const matches = page.findText(/\{\{\s*\w+\s*\}\}/g); +``` + +## Architecture + +### Components + +``` +PDFPage.extractText() + │ + ├─► createResourceResolver(pageResources) ──► ResourceResolver + │ ├─ createFontResolver (per-Resources font cache) + │ └─ createXObjectResolver(per-Resources, lazy, memoized) + ▼ +TextExtractor (constructed with the page-level resolver) + │ + ├─► ContentStreamParser (existing) + │ + ├─► TextState (existing; + captureState/restoreState) + │ + └─► on `Do`: runForm() + ├─ snapshot state + push form /Matrix onto CTM + ├─ swap active ResourceResolver to the form's + ├─ recurse over the form's content (depth-guarded) + └─ restore snapshot + resolver +``` + +### Key abstraction: `ResourceResolver` + +A form's resources are scoped to the form, so font/XObject lookup cannot be a +single page-wide callback. `ResourceResolver` bundles the two lookups for one +content stream: + +```typescript +interface ResourceResolver { + resolveFont: (name: string) => PdfFont | null; + resolveXObject: (name: string) => FormXObject | null; // null for images +} + +interface FormXObject { + bytes: Uint8Array; // decoded content + matrix?: readonly [number, number, number, number, number, number]; + resources: ResourceResolver; // the form's own +} +``` + +`TextExtractor` tracks the _active_ resolver and swaps it while inside a form. +`PDFPage` builds resolvers from COS dictionaries and memoizes them by +dictionary identity (`_resourceResolverCache`), matching the existing +`_resourceCache` / `_annotationCache` pattern on the class. + +### State isolation + +Per PDF spec §8.10.1, painting a form behaves as if wrapped in `q`/`Q` with the +form's `/Matrix` concatenated onto the CTM. `TextState` gains +`captureState()` / `restoreState()` that snapshot the full text+graphics state +_and the graphics-stack depth_, so a form with unbalanced `q`/`Q` (lenient +handling per the project's malformed-PDF principle) cannot corrupt the rest of +the page. + +### Cycle safety + +A `formDepth` counter in the extractor caps nesting at `MAX_FORM_DEPTH` (16). +Combined with identity memoization of resolvers, a form that paints itself +terminates instead of recursing forever. + +## Test Plan + +### Unit (`src/text/text-extractor.test.ts`) + +- Extract text nested one level inside a form invoked by `Do` +- Unresolvable / image XObject (`Do` is a no-op) +- Form uses its **own** font resources (prove via a font that shifts codes) +- State isolation: form with stray `Q` operators leaves later page text intact +- `/Matrix` translation offsets nested text position +- Cyclic self-referential form terminates without throwing +- No `resolveXObject` provided → `Do` ignored (back-compat) + +### Integration (`src/integration/text/text-extraction.test.ts`) + +- New fixture `fixtures/text/form-xobject-text.pdf`: a page whose only text is + drawn via a form XObject with its own font → `extractText().text` contains it + +### Regression + +- `rtl-placed-text` fixture regenerated to drop a redundant duplicate text layer + (a clean-LTR form copy that real design-tool exports don't carry and that + conflicted with now-correct form recursion); the RTL content stream — the + actual test subject — is preserved byte-for-byte + +### Full suite + +- `bun run test:run` (all files), `bun run typecheck`, `bun run lint` green + +## Open Questions + +1. **Overlapping visible + invisible text** — When a PDF carries both a visible + layer and an invisible logical-order layer for the same words, extraction now + surfaces both. Real-world dedup (by position + content) is deferred; it is a + broader feature than form recursion. _Current approach_: extract everything, + matching pdf.js behavior. + +2. **Render mode 3 (invisible) text** — Kept in output, as before, because it is + the canonical layer for searchable/scanned PDFs. Not changed here. + +## Risks + +- **Double-counting** in the rare visible+invisible duplicate-layer case (see + Open Question 1). Mitigated by it being uncommon in generated PDFs; flagged + for a future dedup pass. +- **Performance** — Each distinct form's fonts are parsed once and memoized; + repeated `Do` of the same form is O(1) after first resolve. + +## Implementation Phases + +### Phase 1: Resolver abstraction + +- Add `ResourceResolver` / `FormXObject` to `text-extractor.ts` +- Refactor `PDFPage.createFontResolver` → `createResourceResolver` + + `createFontResolver(dict)` + `createXObjectResolver(dict)` + `readMatrix` + +### Phase 2: Extractor recursion + +- Track active resolver + `formDepth` in `TextExtractor` +- Add `Do` handler → `runForm()` (snapshot, matrix, swap, recurse, restore) +- Add `TextState.captureState` / `restoreState` + +### Phase 3: Tests & fixtures + +- Unit tests, integration fixture, regenerate `rtl-placed-ltr-text.pdf` +- Verify full suite, typecheck, lint diff --git a/fixtures/text/form-xobject-text.pdf b/fixtures/text/form-xobject-text.pdf new file mode 100644 index 0000000..4a020df Binary files /dev/null and b/fixtures/text/form-xobject-text.pdf differ diff --git a/fixtures/text/rtl-placed-ltr-text.pdf b/fixtures/text/rtl-placed-ltr-text.pdf index ca42157..2c7790b 100644 Binary files a/fixtures/text/rtl-placed-ltr-text.pdf and b/fixtures/text/rtl-placed-ltr-text.pdf differ diff --git a/package.json b/package.json index dfd76f2..10cbe49 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "@libpdf/core", - "version": "0.4.1", + "version": "0.4.2", "description": "A modern PDF library for TypeScript - parsing and generation", "keywords": [ "digital-signature", diff --git a/src/api/pdf-page.ts b/src/api/pdf-page.ts index 628c15c..40b64bd 100644 --- a/src/api/pdf-page.ts +++ b/src/api/pdf-page.ts @@ -117,6 +117,7 @@ import { showText, } from "#src/helpers/operators"; import * as operatorHelpers from "#src/helpers/operators"; +import type { RefResolver } from "#src/helpers/types"; import type { PDFImage } from "#src/images/pdf-image"; import { PdfArray } from "#src/objects/pdf-array"; import { PdfDict } from "#src/objects/pdf-dict"; @@ -126,7 +127,7 @@ import { PdfRef } from "#src/objects/pdf-ref"; import { PdfStream } from "#src/objects/pdf-stream"; import { PdfString } from "#src/objects/pdf-string"; import { getPlainText, groupCharsIntoLines } from "#src/text/line-grouper"; -import { TextExtractor } from "#src/text/text-extractor"; +import { type FormXObject, type ResourceResolver, TextExtractor } from "#src/text/text-extractor"; import { searchPage } from "#src/text/text-search"; import type { ExtractTextOptions, FindTextOptions, PageText, TextMatch } from "#src/text/types"; @@ -2817,11 +2818,14 @@ export class PDFPage { // Get content stream bytes const contentBytes = this.getContentBytes(); - // Create font resolver - const resolveFont = this.createFontResolver(); + // Build a resource resolver for fonts and form XObjects + const resources = this.createResourceResolver(this.resolveInheritedResources()); // Extract characters - const extractor = new TextExtractor({ resolveFont }); + const extractor = new TextExtractor({ + resolveFont: resources.resolveFont, + resolveXObject: resources.resolveXObject, + }); const chars = extractor.extract(contentBytes); // Group into lines and spans @@ -2959,16 +2963,45 @@ export class PDFPage { } /** - * Create a font resolver function for text extraction. + * Memoized resource resolvers, keyed by Resources dictionary identity. + * Shared across nested form XObjects to avoid rebuilding font caches and to + * break cyclic XObject references. */ - private createFontResolver(): (name: string) => PdfFont | null { - // Get the page's Font resources (may be a ref or inherited from parent) - const resourcesDict = this.resolveInheritedResources(); + private readonly _resourceResolverCache = new Map(); + /** + * Build a resource resolver (fonts + form XObjects) for a Resources dict. + * + * Form XObjects carry their own Resources, so resolvers are scoped per + * Resources dictionary. Resolvers are memoized by dictionary identity, both + * to avoid rebuilding font caches for repeated XObjects and so that cyclic + * resource references resolve to the same instance. (The XObject resolver + * recurses lazily, so building one resolver never builds another.) + */ + private createResourceResolver(resourcesDict: PdfDict | null): ResourceResolver { if (!resourcesDict) { - return () => null; + return { resolveFont: () => null, resolveXObject: () => null }; } + const cached = this._resourceResolverCache.get(resourcesDict); + + if (cached) { + return cached; + } + + const resolver: ResourceResolver = { + resolveFont: this.createFontResolver(resourcesDict), + resolveXObject: this.createXObjectResolver(resourcesDict), + }; + this._resourceResolverCache.set(resourcesDict, resolver); + + return resolver; + } + + /** + * Create a font resolver function for a given Resources dictionary. + */ + private createFontResolver(resourcesDict: PdfDict): (name: string) => PdfFont | null { const font = resourcesDict.getDict("Font", this.ctx.resolve.bind(this.ctx)); if (!font) { @@ -3015,4 +3048,85 @@ export class PDFPage { return fontCache.get(name) ?? null; }; } + + /** + * Create a form-XObject resolver for a given Resources dictionary. + * + * Only form XObjects (Subtype /Form) carry extractable text; image XObjects + * resolve to null so the extractor skips them. + */ + private createXObjectResolver(resourcesDict: PdfDict): (name: string) => FormXObject | null { + const resolve = this.ctx.resolve.bind(this.ctx); + const xobjects = resourcesDict.getDict("XObject", resolve); + + if (!xobjects) { + return () => null; + } + + const cache = new Map(); + + return (name: string): FormXObject | null => { + const existing = cache.get(name); + + if (existing !== undefined) { + return existing; + } + + let result: FormXObject | null = null; + const entry = xobjects.get(name, resolve); + + if (entry instanceof PdfStream && entry.getName("Subtype", resolve)?.value === "Form") { + let bytes: Uint8Array; + + try { + bytes = entry.getDecodedData(); + } catch { + // Undecodable stream — treat as empty rather than throwing. + bytes = new Uint8Array(0); + } + + // A form's content is processed with its own Resources, falling back to + // the enclosing resources when the form omits them (lenient handling). + const formResources = entry.getDict("Resources", resolve) ?? resourcesDict; + + result = { + bytes, + matrix: this.readMatrix(entry, resolve), + resources: this.createResourceResolver(formResources), + }; + } + + cache.set(name, result); + + return result; + }; + } + + /** + * Read a 6-element /Matrix from an XObject dictionary, if present and valid. + */ + private readMatrix( + dict: PdfDict, + resolve: RefResolver, + ): [number, number, number, number, number, number] | undefined { + const array = dict.getArray("Matrix", resolve); + + if (!array || array.length !== 6) { + return undefined; + } + + const values: number[] = []; + + for (let i = 0; i < 6; i++) { + const value = array.at(i, resolve); + + if (value?.type !== "number") { + return undefined; + } + + values.push(value.value); + } + + return [values[0], values[1], values[2], values[3], values[4], values[5]]; + } } diff --git a/src/integration/text/text-extraction.test.ts b/src/integration/text/text-extraction.test.ts index 65ff72b..6b68fdd 100644 --- a/src/integration/text/text-extraction.test.ts +++ b/src/integration/text/text-extraction.test.ts @@ -70,6 +70,20 @@ describe("Text Extraction Integration", () => { }); }); + describe("form XObjects", () => { + it("extracts text nested inside a form XObject", async () => { + // The page draws all of its text via a form XObject (/Fm0 Do) that + // carries its own font resources, so extraction must recurse into it. + const bytes = await loadFixture("text", "form-xobject-text.pdf"); + const pdf = await PDF.load(bytes); + const page = pdf.getPage(0); + + const pageText = page!.extractText(); + + expect(pageText.text).toContain("FormXObjectText"); + }); + }); + describe("document-wide extractText", () => { it("extracts text from all pages", async () => { const bytes = await loadFixture("text", "openoffice-test-document.pdf"); diff --git a/src/text/index.ts b/src/text/index.ts index e7f2703..4ae390b 100644 --- a/src/text/index.ts +++ b/src/text/index.ts @@ -6,7 +6,12 @@ */ export { getPlainText, groupCharsIntoLines, type LineGrouperOptions } from "./line-grouper"; -export { TextExtractor, type TextExtractorOptions } from "./text-extractor"; +export { + type FormXObject, + type ResourceResolver, + TextExtractor, + type TextExtractorOptions, +} from "./text-extractor"; export { searchPage, searchPages } from "./text-search"; export { TextState } from "./text-state"; export * from "./types"; diff --git a/src/text/text-extractor.test.ts b/src/text/text-extractor.test.ts new file mode 100644 index 0000000..6bbb670 --- /dev/null +++ b/src/text/text-extractor.test.ts @@ -0,0 +1,157 @@ +import type { PdfFont } from "#src/fonts/pdf-font"; +import { describe, expect, it } from "vitest"; + +import { type FormXObject, type ResourceResolver, TextExtractor } from "./text-extractor"; + +/** + * Minimal font stub: each byte code maps to its ASCII character and a fixed + * advance width. Enough to exercise the extractor's text-showing path. + */ +const stubFont = { + subtype: "Type1", + baseFontName: "StubFont", + descriptor: null, + getWidth: () => 500, + toUnicode: (code: number) => String.fromCharCode(code), +} as unknown as PdfFont; + +function bytes(content: string): Uint8Array { + return new TextEncoder().encode(content); +} + +function fontOnly(): ResourceResolver { + return { resolveFont: () => stubFont, resolveXObject: () => null }; +} + +function extract(content: string, options: Partial = {}): string { + const extractor = new TextExtractor({ + resolveFont: options.resolveFont ?? (() => stubFont), + resolveXObject: options.resolveXObject ?? (() => null), + }); + + return extractor + .extract(bytes(content)) + .map(c => c.char) + .join(""); +} + +describe("TextExtractor form XObjects", () => { + it("extracts text nested inside a form XObject invoked with Do", () => { + const form: FormXObject = { + bytes: bytes("BT /F1 12 Tf (Hello) Tj ET"), + resources: fontOnly(), + }; + + const text = extract("/Fm0 Do", { + resolveXObject: name => (name === "Fm0" ? form : null), + }); + + expect(text).toBe("Hello"); + }); + + it("ignores Do when the XObject cannot be resolved (e.g. an image)", () => { + const text = extract("/Im0 Do", { resolveXObject: () => null }); + + expect(text).toBe(""); + }); + + it("uses the form's own resources to resolve fonts", () => { + const formFont = { + subtype: "Type1", + baseFontName: "FormFont", + descriptor: null, + getWidth: () => 500, + // Shift every code by one so we can prove the form's font was used. + toUnicode: (code: number) => String.fromCharCode(code + 1), + } as unknown as PdfFont; + + const form: FormXObject = { + bytes: bytes("BT /FF 12 Tf (AB) Tj ET"), + resources: { resolveFont: () => formFont, resolveXObject: () => null }, + }; + + const extractor = new TextExtractor({ + resolveFont: () => stubFont, + resolveXObject: name => (name === "Fm0" ? form : null), + }); + + const text = extractor + .extract(bytes("/Fm0 Do")) + .map(c => c.char) + .join(""); + + // "AB" shifted by the form's font becomes "BC". + expect(text).toBe("BC"); + }); + + it("isolates the caller's state from imbalanced q/Q inside the form", () => { + // The form pops more graphics states than it pushes; this must not corrupt + // the text drawn on the page after the form returns. + const form: FormXObject = { + bytes: bytes("Q Q Q BT /F1 12 Tf (X) Tj ET"), + resources: fontOnly(), + }; + + const extractor = new TextExtractor({ + resolveFont: () => stubFont, + resolveXObject: () => form, + }); + + const chars = extractor.extract( + bytes("q BT /F1 12 Tf (A) Tj ET Q /Fm0 Do q BT /F1 12 Tf (B) Tj ET Q"), + ); + + expect(chars.map(c => c.char).join("")).toBe("AXB"); + // "B" is drawn after the form returned; its position should be unaffected + // by the form's stray Q operators. + const b = chars[chars.length - 1]; + expect(Number.isFinite(b.bbox.x)).toBe(true); + expect(Number.isFinite(b.bbox.y)).toBe(true); + }); + + it("applies the form /Matrix to nested text position", () => { + const form: FormXObject = { + bytes: bytes("BT /F1 10 Tf (A) Tj ET"), + matrix: [1, 0, 0, 1, 100, 200], + resources: fontOnly(), + }; + + const extractor = new TextExtractor({ + resolveFont: () => stubFont, + resolveXObject: () => form, + }); + + const withMatrix = extractor.extract(bytes("/Fm0 Do")); + + const plain = new TextExtractor({ + resolveFont: () => stubFont, + resolveXObject: () => ({ ...form, matrix: undefined }), + }).extract(bytes("/Fm0 Do")); + + expect(withMatrix[0].bbox.x).toBeCloseTo(plain[0].bbox.x + 100); + expect(withMatrix[0].bbox.y).toBeCloseTo(plain[0].bbox.y + 200); + }); + + it("stops recursing on cyclic form references without throwing", () => { + // A form that paints itself would recurse forever without a depth guard. + const self: FormXObject = { + bytes: bytes("BT /F1 12 Tf (Z) Tj ET /Fm0 Do"), + resources: fontOnly(), + }; + + const extractor = new TextExtractor({ + resolveFont: () => stubFont, + resolveXObject: () => self, + }); + + expect(() => extractor.extract(bytes("/Fm0 Do"))).not.toThrow(); + }); + + it("ignores Do when no XObject resolver is provided", () => { + const extractor = new TextExtractor({ resolveFont: () => stubFont }); + + const chars = extractor.extract(bytes("/Fm0 Do")); + + expect(chars).toEqual([]); + }); +}); diff --git a/src/text/text-extractor.ts b/src/text/text-extractor.ts index 4a9c41f..bfad415 100644 --- a/src/text/text-extractor.ts +++ b/src/text/text-extractor.ts @@ -16,6 +16,41 @@ import type { PdfFont } from "#src/fonts/pdf-font"; import { TextState } from "./text-state"; import type { ExtractedChar } from "./types"; +/** Maximum form XObject nesting depth, to guard against cyclic references. */ +const MAX_FORM_DEPTH = 16; + +/** + * Resolves the named resources of a content stream (fonts and XObjects). + * + * Each form XObject carries its own resource dictionary, so a resolver is + * scoped to a single content stream. + */ +export interface ResourceResolver { + /** + * Resolve a font name to a PdfFont object. + * Font names are keys in the /Resources/Font dictionary (e.g., "F1", "TT0"). + */ + resolveFont: (name: string) => PdfFont | null; + + /** + * Resolve an XObject name (key in /Resources/XObject) to a form XObject. + * Returns null for image XObjects or names that cannot be resolved. + */ + resolveXObject: (name: string) => FormXObject | null; +} + +/** + * A form XObject whose content stream should be processed inline. + */ +export interface FormXObject { + /** Decoded content stream bytes of the form. */ + bytes: Uint8Array; + /** Optional /Matrix mapping form space into the current coordinate space. */ + matrix?: readonly [number, number, number, number, number, number]; + /** Resources scoped to the form's own content stream. */ + resources: ResourceResolver; +} + /** * Options for text extraction. */ @@ -25,18 +60,32 @@ export interface TextExtractorOptions { * Font names are keys in the /Resources/Font dictionary (e.g., "F1", "TT0"). */ resolveFont: (name: string) => PdfFont | null; + + /** + * Resolve an XObject name to a form XObject so its text can be extracted. + * Optional — when omitted, `Do` operators are ignored. + */ + resolveXObject?: (name: string) => FormXObject | null; } /** * Extracts text from PDF content streams. */ export class TextExtractor { - private readonly resolveFont: (name: string) => PdfFont | null; private readonly state: TextState; private readonly chars: ExtractedChar[] = []; + /** Resources for the content stream currently being processed. */ + private resources: ResourceResolver; + + /** Current form XObject nesting depth. */ + private formDepth = 0; + constructor(options: TextExtractorOptions) { - this.resolveFont = options.resolveFont; + this.resources = { + resolveFont: options.resolveFont, + resolveXObject: options.resolveXObject ?? (() => null), + }; this.state = new TextState(); } @@ -47,14 +96,21 @@ export class TextExtractor { * @returns Array of extracted characters with positions */ extract(contentBytes: Uint8Array): ExtractedChar[] { + this.runContent(contentBytes); + + return this.chars; + } + + /** + * Parse and process a content stream's operations with the active resources. + */ + private runContent(contentBytes: Uint8Array): void { const parser = new ContentStreamParser(contentBytes); const { operations } = parser.parse(); for (const op of operations) { this.processOperation(op); } - - return this.chars; } /** @@ -169,6 +225,74 @@ export class TextExtractor { this.state.moveToNextLine(); this.handleTj([operands[2]]); break; + + // XObject invocation + case "Do": + this.handleDo(operands); + break; + } + } + + /** + * Handle Do (paint XObject) operator. + * + * Form XObjects carry their own content stream and resources, so any text + * inside them is extracted by processing the form inline. Image XObjects + * resolve to null and are skipped. + */ + private handleDo(operands: ContentToken[]): void { + const name = this.getName(operands[0]); + + if (!name) { + return; + } + + const form = this.resources.resolveXObject(name); + + if (!form) { + return; + } + + this.runForm(form); + } + + /** + * Process a form XObject's content stream inline. + * + * Per the PDF spec (8.10.1), invoking a form is equivalent to wrapping its + * content in q/Q with the form's /Matrix concatenated onto the CTM. The + * caller's state is fully snapshotted and restored so that imbalanced q/Q or + * leftover text state inside the form cannot affect the rest of the page. + */ + private runForm(form: FormXObject): void { + if (this.formDepth >= MAX_FORM_DEPTH) { + return; + } + + this.formDepth += 1; + + const snapshot = this.state.captureState(); + const previousResources = this.resources; + + if (form.matrix) { + this.state.concatMatrix( + form.matrix[0], + form.matrix[1], + form.matrix[2], + form.matrix[3], + form.matrix[4], + form.matrix[5], + ); + } + + this.resources = form.resources; + + try { + this.runContent(form.bytes); + } finally { + this.resources = previousResources; + this.state.restoreState(snapshot); + this.formDepth -= 1; } } @@ -194,7 +318,7 @@ export class TextExtractor { const fontSize = this.getNumber(operands[1]); if (fontName) { - const font = this.resolveFont(fontName); + const font = this.resources.resolveFont(fontName); this.state.font = font; } diff --git a/src/text/text-state.ts b/src/text/text-state.ts index 82f4911..5bcfba6 100644 --- a/src/text/text-state.ts +++ b/src/text/text-state.ts @@ -37,6 +37,26 @@ export interface TextStateParams { renderMode: number; } +/** + * A full snapshot of the rendering state, used to isolate nested content + * (e.g. form XObjects) so that imbalanced q/Q inside them cannot corrupt the + * caller's state. + */ +export interface StateSnapshot { + ctm: Matrix; + tm: Matrix; + tlm: Matrix; + font: PdfFont | null; + fontSize: number; + charSpacing: number; + wordSpacing: number; + horizontalScale: number; + leading: number; + rise: number; + renderMode: number; + graphicsStackDepth: number; +} + /** * Tracks all text rendering state during content stream processing. */ @@ -332,6 +352,54 @@ export class TextState { }; } + /** + * Capture a full snapshot of the current rendering state. + * + * Used to isolate nested content streams (form XObjects): the snapshot + * records the graphics-state stack depth so it can be unwound even if the + * nested content has unbalanced q/Q operators. + */ + captureState(): StateSnapshot { + return { + ctm: this.ctm.clone(), + tm: this.tm.clone(), + tlm: this.tlm.clone(), + font: this.font, + fontSize: this.fontSize, + charSpacing: this.charSpacing, + wordSpacing: this.wordSpacing, + horizontalScale: this.horizontalScale, + leading: this.leading, + rise: this.rise, + renderMode: this.renderMode, + graphicsStackDepth: this.graphicsStateStack.length, + }; + } + + /** + * Restore a snapshot captured by {@link captureState}. + * + * Any graphics-state entries pushed since the snapshot are discarded, so a + * nested stream with extra q (or missing Q) operators cannot leak state. + */ + restoreState(snapshot: StateSnapshot): void { + this.ctm = snapshot.ctm; + this.tm = snapshot.tm; + this.tlm = snapshot.tlm; + this.font = snapshot.font; + this.fontSize = snapshot.fontSize; + this.charSpacing = snapshot.charSpacing; + this.wordSpacing = snapshot.wordSpacing; + this.horizontalScale = snapshot.horizontalScale; + this.leading = snapshot.leading; + this.rise = snapshot.rise; + this.renderMode = snapshot.renderMode; + + if (this.graphicsStateStack.length > snapshot.graphicsStackDepth) { + this.graphicsStateStack.length = snapshot.graphicsStackDepth; + } + } + /** * Clone the current text state. */