Skip to content

feat: add clean_newline utility for hyphenated line breaks (#2513)#4339

Open
DevAbdullah90 wants to merge 1 commit into
Unstructured-IO:mainfrom
DevAbdullah90:DevAbdullah90/feat/clean-newline
Open

feat: add clean_newline utility for hyphenated line breaks (#2513)#4339
DevAbdullah90 wants to merge 1 commit into
Unstructured-IO:mainfrom
DevAbdullah90:DevAbdullah90/feat/clean-newline

Conversation

@DevAbdullah90

@DevAbdullah90 DevAbdullah90 commented Apr 16, 2026

Copy link
Copy Markdown

Problem

Issue #2513 identified a need for a utility function to handle hyphenated words split across newlines (e.g., "re- \nsearch" → "research"). This is a common issue in document partitioning where layout-preserving text extraction introduces artificial breaks in words.

Solution

This PR adds the clean_newline function to unstructured/cleaners/core.py.

  • Logic: Uses regex r"(\w+)-\s+(\w+)" to rejoin hyphenated words.
  • Flexibility: The \s+ pattern ensures it handles single spaces, tabs, and newlines between the hyphen and the word continuation.

Changes

  • Added clean_newline to unstructured/cleaners/core.py.
  • Added test cases to test_unstructured/cleaners/test_core.py covering various indentation and newline scenarios.

Verification

  • Added parameterized unit tests in test_unstructured/cleaners/test_core.py.
  • Verified all core cleaning tests pass (91 passed).

Fixes #2513

uv run python -m pytest test_unstructured/cleaners/test_core.py

@DevAbdullah90 DevAbdullah90 force-pushed the DevAbdullah90/feat/clean-newline branch from b5cde03 to bf18801 Compare June 6, 2026 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/clean_newline

2 participants