fix: avoid false-positive Title classification for long no-space text by claytonlin1110 · Pull Request #4348 · Unstructured-IO/unstructured

claytonlin1110 · 2026-04-28T20:44:55Z

Summary

add a no-space length guard in is_possible_title() to handle CJK text better
introduce UNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH (default: 40) for tuning
add regression tests for Chinese text without spaces (short heading vs long body)
Closes bug/Incorrectly classifying Chinese text as "title" #3930

Why

Chinese OCR output often has no spaces, so long narrative text could be interpreted as a single "word" and incorrectly classified as Title. This change reduces those false positives while preserving expected short-title behavior.

Test plan

Added unit test coverage in test_unstructured/partition/test_text_type.py
Run: pytest -q test_unstructured/partition/test_text_type.py -k "title or chinese"
Verify OCR-only PDF flow no longer maps long Chinese body text to Title
Sanity-check English title classification is unchanged

Note

Medium Risk
Changes is_possible_title() heuristics for Han-ideograph text, which can affect downstream element classification and chunking behavior; env-default tuning mismatch could surprise users if relied upon implicitly.

Overview
Reduces false-positive Title classification for long whitespace-free CJK/Han text by adding a Han-ideograph regex and a max-length guard in is_possible_title() (configurable via UNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH).

Adds regression tests covering short Chinese headings vs long Chinese body text and ensuring the new guard doesn’t apply to other no-space scripts. Bumps version to 0.22.28 and documents the fix in CHANGELOG.md.

^{Reviewed by Cursor Bugbot for commit eda9fe4. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor · 2026-04-28T21:47:26Z

+
+    long_non_han_no_space = "የሰው፡ልጅ፡ሁሉ፡ሲወለድ፡ነጻና፡በክብርና፡በመብትም፡እኩልነት፡ያለው፡ነው።"
+
+    assert text_type.is_possible_title(long_non_han_no_space) is True


Non-Han guard test text too short to exercise guard

Low Severity

test_is_possible_title_does_not_apply_no_space_guard_to_non_han_text doesn't actually test the Han-scoping behavior. The Amharic string is ~44 characters, well below the default title_max_no_space_length of 120. The len(text) > 120 condition is always False here, so the guard never fires regardless of script. The test would pass even if the HAN_IDEOGRAPH_RE check were removed entirely, providing false confidence against regressions.

^{Reviewed by Cursor Bugbot for commit 33b7dfb. Configure here.}

claytonlin1110 · 2026-05-05T03:48:21Z

@SudSampath Would you please review?

claytonlin1110 · 2026-05-06T06:15:52Z

@cragwolfe Would you please review?

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit eda9fe4. Configure here.}

cursor · 2026-05-06T10:52:45Z

    )
+    title_max_no_space_length = int(
+        os.environ.get("UNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH", 120),
+    )


Default threshold 120 contradicts documented value of 40

Medium Severity

The code sets the default for UNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH to 120, but both the CHANGELOG and PR description explicitly state the default is 40. With a threshold of 120, Chinese body text between 41–120 characters without spaces will still be misclassified as Title, significantly reducing the effectiveness of the fix that this PR is intended to deliver.

Additional Locations (1)

CHANGELOG.md#L4-L5

^{Reviewed by Cursor Bugbot for commit eda9fe4. Configure here.}

claytonlin1110 force-pushed the fix/chinese-title-classification-ocr branch from c9f1665 to c2452b9 Compare April 28, 2026 21:00

cursor Bot reviewed Apr 28, 2026

View reviewed changes

claytonlin1110 force-pushed the fix/chinese-title-classification-ocr branch from 33b7dfb to 18a14dd Compare May 6, 2026 10:43

fix: avoid false-positive Title classification for long no-space text

f66ef05

claytonlin1110 force-pushed the fix/chinese-title-classification-ocr branch from 18a14dd to f66ef05 Compare May 6, 2026 10:45

cursor Bot reviewed May 6, 2026

View reviewed changes

Comment thread CHANGELOG.md Outdated

fix: update

eda9fe4

cursor Bot reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid false-positive Title classification for long no-space text#4348

fix: avoid false-positive Title classification for long no-space text#4348
claytonlin1110 wants to merge 2 commits into
Unstructured-IO:mainfrom
claytonlin1110:fix/chinese-title-classification-ocr

claytonlin1110 commented Apr 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot Apr 28, 2026

Uh oh!

claytonlin1110 commented May 5, 2026

Uh oh!

claytonlin1110 commented May 6, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		long_non_han_no_space = "የሰው፡ልጅ፡ሁሉ፡ሲወለድ፡ነጻና፡በክብርና፡በመብትም፡እኩልነት፡ያለው፡ነው።"

		assert text_type.is_possible_title(long_non_han_no_space) is True

Conversation

claytonlin1110 commented Apr 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

Non-Han guard test text too short to exercise guard

Uh oh!

claytonlin1110 commented May 5, 2026

Uh oh!

claytonlin1110 commented May 6, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 6, 2026

Choose a reason for hiding this comment

Default threshold 120 contradicts documented value of 40

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claytonlin1110 commented Apr 28, 2026 •

edited by cursor Bot

Loading