fix: avoid false-positive Title classification for long no-space text#4348
fix: avoid false-positive Title classification for long no-space text#4348claytonlin1110 wants to merge 2 commits into
Conversation
c9f1665 to
c2452b9
Compare
|
|
||
| long_non_han_no_space = "የሰው፡ልጅ፡ሁሉ፡ሲወለድ፡ነጻና፡በክብርና፡በመብትም፡እኩልነት፡ያለው፡ነው።" | ||
|
|
||
| assert text_type.is_possible_title(long_non_han_no_space) is True |
There was a problem hiding this comment.
Non-Han guard test text too short to exercise guard
Low Severity
test_is_possible_title_does_not_apply_no_space_guard_to_non_han_text doesn't actually test the Han-scoping behavior. The Amharic string is ~44 characters, well below the default title_max_no_space_length of 120. The len(text) > 120 condition is always False here, so the guard never fires regardless of script. The test would pass even if the HAN_IDEOGRAPH_RE check were removed entirely, providing false confidence against regressions.
Reviewed by Cursor Bugbot for commit 33b7dfb. Configure here.
|
@SudSampath Would you please review? |
|
@cragwolfe Would you please review? |
33b7dfb to
18a14dd
Compare
18a14dd to
f66ef05
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit eda9fe4. Configure here.
| ) | ||
| title_max_no_space_length = int( | ||
| os.environ.get("UNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH", 120), | ||
| ) |
There was a problem hiding this comment.
Default threshold 120 contradicts documented value of 40
Medium Severity
The code sets the default for UNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH to 120, but both the CHANGELOG and PR description explicitly state the default is 40. With a threshold of 120, Chinese body text between 41–120 characters without spaces will still be misclassified as Title, significantly reducing the effectiveness of the fix that this PR is intended to deliver.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit eda9fe4. Configure here.


Summary
is_possible_title()to handle CJK text betterUNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH(default:40) for tuningCloses bug/Incorrectly classifying Chinese text as "title" #3930
Why
Chinese OCR output often has no spaces, so long narrative text could be interpreted as a single "word" and incorrectly classified as
Title. This change reduces those false positives while preserving expected short-title behavior.Test plan
test_unstructured/partition/test_text_type.pypytest -q test_unstructured/partition/test_text_type.py -k "title or chinese"TitleNote
Medium Risk
Changes
is_possible_title()heuristics for Han-ideograph text, which can affect downstream element classification and chunking behavior; env-default tuning mismatch could surprise users if relied upon implicitly.Overview
Reduces false-positive
Titleclassification for long whitespace-free CJK/Han text by adding a Han-ideograph regex and a max-length guard inis_possible_title()(configurable viaUNSTRUCTURED_TITLE_MAX_NO_SPACE_LENGTH).Adds regression tests covering short Chinese headings vs long Chinese body text and ensuring the new guard doesn’t apply to other no-space scripts. Bumps version to
0.22.28and documents the fix inCHANGELOG.md.Reviewed by Cursor Bugbot for commit eda9fe4. Bugbot is set up for automated code reviews on this repo. Configure here.