Skip to content

Fix scanned PDF OCR fallback when only page headers are extracted#1920

Open
dcaayushd wants to merge 1 commit into
microsoft:mainfrom
dcaayushd:fix/ocr-skipped-due-to-page-header
Open

Fix scanned PDF OCR fallback when only page headers are extracted#1920
dcaayushd wants to merge 1 commit into
microsoft:mainfrom
dcaayushd:fix/ocr-skipped-due-to-page-header

Conversation

@dcaayushd
Copy link
Copy Markdown

Fixes #1863

Fixes scanned/image-only PDF conversion in markitdown-ocr where generated page headers made the output look non-empty, causing full-page OCR fallback to be skipped. This could produce Markdown containing only ## Page N headings.

Changes:

Track whether real text or OCR content was extracted separately from page headers.
Run pdfminer and full-page OCR fallback when page headers are the only content.
Add a regression test for the header-only extraction path.
Update the multipage PDF test expectation for current pdfplumber behavior.

@dcaayushd
Copy link
Copy Markdown
Author

@afourney @gagb Could you please review this PR and approve the workflow when you have a chance? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

markitdown-ocr process non-text layer PDFs generate an MD document containing only page numbers

1 participant