Skip to content

fix: avoid false-positive Title classification for long no-space text#4348

Open
claytonlin1110 wants to merge 2 commits into
Unstructured-IO:mainfrom
claytonlin1110:fix/chinese-title-classification-ocr
Open

fix: avoid false-positive Title classification for long no-space text#4348
claytonlin1110 wants to merge 2 commits into
Unstructured-IO:mainfrom
claytonlin1110:fix/chinese-title-classification-ocr

Conversation

@claytonlin1110

@claytonlin1110 claytonlin1110 commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add a no-space length guard in is_possible_title() to handle CJK text better
  • introduce UNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH (default: 40) for tuning
  • add regression tests for Chinese text without spaces (short heading vs long body)
    Closes bug/Incorrectly classifying Chinese text as "title" #3930

Why

Chinese OCR output often has no spaces, so long narrative text could be interpreted as a single "word" and incorrectly classified as Title. This change reduces those false positives while preserving expected short-title behavior.

Test plan

  • Added unit test coverage in test_unstructured/partition/test_text_type.py
  • Run: pytest -q test_unstructured/partition/test_text_type.py -k "title or chinese"
  • Verify OCR-only PDF flow no longer maps long Chinese body text to Title
  • Sanity-check English title classification is unchanged

Note

Medium Risk
Changes is_possible_title() heuristics for Han-ideograph text, which can affect downstream element classification and chunking behavior; env-default tuning mismatch could surprise users if relied upon implicitly.

Overview
Reduces false-positive Title classification for long whitespace-free CJK/Han text by adding a Han-ideograph regex and a max-length guard in is_possible_title() (configurable via UNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH).

Adds regression tests covering short Chinese headings vs long Chinese body text and ensuring the new guard doesn’t apply to other no-space scripts. Bumps version to 0.22.28 and documents the fix in CHANGELOG.md.

Reviewed by Cursor Bugbot for commit eda9fe4. Bugbot is set up for automated code reviews on this repo. Configure here.

@claytonlin1110 claytonlin1110 force-pushed the fix/chinese-title-classification-ocr branch from c9f1665 to c2452b9 Compare April 28, 2026 21:00

long_non_han_no_space = "የሰው፡ልጅ፡ሁሉ፡ሲወለድ፡ነጻና፡በክብርና፡በመብትም፡እኩልነት፡ያለው፡ነው።"

assert text_type.is_possible_title(long_non_han_no_space) is True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-Han guard test text too short to exercise guard

Low Severity

test_is_possible_title_does_not_apply_no_space_guard_to_non_han_text doesn't actually test the Han-scoping behavior. The Amharic string is ~44 characters, well below the default title_max_no_space_length of 120. The len(text) > 120 condition is always False here, so the guard never fires regardless of script. The test would pass even if the HAN_IDEOGRAPH_RE check were removed entirely, providing false confidence against regressions.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 33b7dfb. Configure here.

@claytonlin1110

Copy link
Copy Markdown
Contributor Author

@SudSampath Would you please review?

@claytonlin1110

Copy link
Copy Markdown
Contributor Author

@cragwolfe Would you please review?

@claytonlin1110 claytonlin1110 force-pushed the fix/chinese-title-classification-ocr branch from 33b7dfb to 18a14dd Compare May 6, 2026 10:43
@claytonlin1110 claytonlin1110 force-pushed the fix/chinese-title-classification-ocr branch from 18a14dd to f66ef05 Compare May 6, 2026 10:45
Comment thread CHANGELOG.md Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit eda9fe4. Configure here.

)
title_max_no_space_length = int(
os.environ.get("UNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH", 120),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default threshold 120 contradicts documented value of 40

Medium Severity

The code sets the default for UNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH to 120, but both the CHANGELOG and PR description explicitly state the default is 40. With a threshold of 120, Chinese body text between 41–120 characters without spaces will still be misclassified as Title, significantly reducing the effectiveness of the fix that this PR is intended to deliver.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit eda9fe4. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/Incorrectly classifying Chinese text as "title"

1 participant