Fix scanned PDF OCR fallback when only page headers are extracted by dcaayushd · Pull Request #1920 · microsoft/markitdown

dcaayushd · 2026-05-29T12:32:21Z

Fixes scanned/image-only PDF conversion in markitdown-ocr where generated page headers made the output look non-empty, causing full-page OCR fallback to be skipped. This could produce Markdown containing only ## Page N headings.

Changes:

Track whether real text or OCR content was extracted separately from page headers.
Run pdfminer and full-page OCR fallback when page headers are the only content.
Add a regression test for the header-only extraction path.
Update the multipage PDF test expectation for current pdfplumber behavior.

dcaayushd · 2026-05-29T12:39:14Z

@afourney @gagb Could you please review this PR and approve the workflow when you have a chance? Thanks!

Fix OCR fallback for scanned PDFs returning only page headers

b5851a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scanned PDF OCR fallback when only page headers are extracted#1920

Fix scanned PDF OCR fallback when only page headers are extracted#1920
dcaayushd wants to merge 1 commit into
microsoft:mainfrom
dcaayushd:fix/ocr-skipped-due-to-page-header

dcaayushd commented May 29, 2026

Uh oh!

dcaayushd commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dcaayushd commented May 29, 2026

Uh oh!

dcaayushd commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant