Skip to content

Use java.text.BreakIterator in DefaultTextDoubleClickStrategy#3990

Open
vogella wants to merge 2 commits into
eclipse-platform:masterfrom
vogella:icu-jfacetext
Open

Use java.text.BreakIterator in DefaultTextDoubleClickStrategy#3990
vogella wants to merge 2 commits into
eclipse-platform:masterfrom
vogella:icu-jfacetext

Conversation

@vogella
Copy link
Copy Markdown
Contributor

@vogella vogella commented May 8, 2026

Replaces com.ibm.icu.text.BreakIterator with java.text.BreakIterator in DefaultTextDoubleClickStrategy and drops the matching Import-Package: com.ibm.icu.text from the bundle manifest. The JDK BreakIterator exposes the same API (getWordInstance, preceding, following, isBoundary, setText, DONE) and the existing POSIX-locale workaround for . not being treated as a word boundary continues to work. This removes the last com.ibm.icu reference from org.eclipse.jface.text.

Planned for 4.41

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Test Results

   864 files  ± 0     864 suites  ±0   51m 40s ⏱️ -50s
 7 996 tests + 8   7 753 ✅ + 8  243 💤 ±0  0 ❌ ±0 
20 442 runs  +24  19 787 ✅ +24  655 💤 ±0  0 ❌ ±0 

Results for commit 58af83a. ± Comparison against base commit 1e6a7c9.

♻️ This comment has been updated with latest results.

@vogella vogella force-pushed the icu-jfacetext branch 2 times, most recently from e9be22e to 40485b6 Compare May 8, 2026 16:30
@HannesWell
Copy link
Copy Markdown
Member

I'm looking forward to have icu entirely removed.
But IIRC in the past difficulties to fully replace it have been mentioned due to different behavior. I searched for corresponding bugs/source but didn't find them. Do you have them at hand?

@vogella
Copy link
Copy Markdown
Contributor Author

vogella commented May 9, 2026

I don't have a reference but the behavior is partially covered by unit tests which I extended. This was therefore not a drop in replacement.

@vogella vogella marked this pull request as ready for review May 31, 2026 11:57
@eclipse-platform-bot
Copy link
Copy Markdown
Contributor

This pull request changes some projects for the first time in this development cycle.
Therefore the following files need a version increment:

bundles/org.eclipse.jface.text/META-INF/MANIFEST.MF

An additional commit containing all the necessary changes was pushed to the top of this PR's branch. To obtain these changes (for example if you want to push more changes) either fetch from your fork or apply the git patch.

Git patch
From 22852b586e8d3640f26e7fd0997422f5c39057df Mon Sep 17 00:00:00 2001
From: Eclipse Platform Bot <platform-bot@eclipse.org>
Date: Sun, 31 May 2026 12:01:55 +0000
Subject: [PATCH] Version bump(s) for 4.41 stream


diff --git a/bundles/org.eclipse.jface.text/META-INF/MANIFEST.MF b/bundles/org.eclipse.jface.text/META-INF/MANIFEST.MF
index 78461c220e..ef7567b50e 100644
--- a/bundles/org.eclipse.jface.text/META-INF/MANIFEST.MF
+++ b/bundles/org.eclipse.jface.text/META-INF/MANIFEST.MF
@@ -2,7 +2,7 @@ Manifest-Version: 1.0
 Bundle-ManifestVersion: 2
 Bundle-Name: %pluginName
 Bundle-SymbolicName: org.eclipse.jface.text
-Bundle-Version: 3.31.0.qualifier
+Bundle-Version: 3.31.100.qualifier
 Bundle-Vendor: %providerName
 Bundle-Localization: plugin
 Export-Package: 
-- 
2.54.0

Further information are available in Common Build Issues - Missing version increments.

@HannesWell
Copy link
Copy Markdown
Member

HannesWell commented May 31, 2026

I don't have a reference but the behavior is partially covered by unit tests which I extended. This was therefore not a drop in replacement.

I'm not deeply familiar with that part of the code, but it might be that The ICU BreakIterator handles non Roman languages better?
I asked Copilot about it and it provided the following answer:

Can com.ibm.icu.text.BreakIterator savely be replaced by java.text.BreakIterator, even for non euro-centric languages?


Not safely in general.

`com.ibm.icu.text.BreakIterator` and `java.text.BreakIterator` are related in purpose, but ICU’s implementation is usually more complete and more up to date for international text segmentation, especially for non-European scripts and edge cases like emoji, grapheme clusters, dictionary-based word breaking, and evolving Unicode rules.

A practical rule of thumb:

- **If you only handle simple Latin-script text** and basic sentence/word boundaries, replacement may work.
- **If you care about correctness across many languages**, especially Thai, Lao, Khmer, Burmese, CJK, Indic scripts, mixed-script text, or modern emoji behavior, **do not assume they are interchangeable**.
- **If your code currently depends on ICU**, it often does so for a reason.

Key differences to watch for:

1. **Unicode data freshness**
   - ICU is generally updated faster and more regularly for Unicode segmentation behavior.
   - JDK behavior depends on the Java version and may lag.

2. **Dictionary-based segmentation**
   - Languages without spaces between words, like **Thai** or **Japanese** in some contexts, are more sensitive.
   - ICU tends to perform better here.

3. **Grapheme cluster handling**
   - User-perceived characters are tricky:
     - emoji with modifiers
     - ZWJ emoji sequences
     - combining marks
     - regional-indicator flags
   - ICU is usually safer if cursor movement, deletion, highlighting, or UI text operations matter.

4. **Locale behavior**
   - Even when both APIs expose similar methods, actual boundary results can differ by locale and JDK/ICU version.

5. **API/behavior mismatch**
   - Even if code compiles after a swap, tests may fail because break positions differ.

So the honest answer is:

- **For euro-centric languages only:** maybe.
- **For non euro-centric languages:** **not reliably/safely without testing**.

What I’d recommend:
- Build a regression corpus with text in:
  - English, German, French
  - Arabic, Hebrew
  - Hindi
  - Thai
  - Chinese, Japanese, Korean
  - emoji-heavy social text
- Compare actual break positions for:
  - character boundaries
  - word boundaries
  - sentence boundaries
  - line boundaries if used
- Only replace ICU if your tests show identical or acceptable behavior for your supported locales and Java version.

If you want, I can also give you:
1. a **risk matrix by language/script**, or  
2. a **small Java test harness** to compare ICU vs JDK `BreakIterator` on sample strings.

Of course I'm not sure how correct that is. But if it's true, we risk breaking the existing translation e.g. for Asian languages.
Maybe some of our fellow developers from India can provide test text in Hindi?

@vogella
Copy link
Copy Markdown
Contributor Author

vogella commented May 31, 2026

The Copilot dump you pasted is generic and over-broad for this specific change.
The platform already uses java.text.BreakIterator for text handling:

  • org.eclipse.ui.editors → AbstractDecoratedTextEditor
  • org.eclipse.ui.forms → FormUtil, TextSegment

So the JDK iterator is already the de-facto choice in the editor stack. After this PR, DefaultTextDoubleClickStrategy is the last com.ibm.icu.text.BreakIterator reference in org.eclipse.jface.text.

Happy to add a CJK/Thai sample to the test to document the behavior if that helps.

@vogella
Copy link
Copy Markdown
Contributor Author

vogella commented May 31, 2026

Also note the scope here: this change only affects double-click word selection

@HannesWell
Copy link
Copy Markdown
Member

The Copilot dump you pasted is generic and over-broad for this specific change.

Yes, of course. But my intend was to understand if it generally is a safe replacement or if one has to pay more attention.

So the JDK iterator is already the de-facto choice in the editor stack. After this PR, DefaultTextDoubleClickStrategy is the last com.ibm.icu.text.BreakIterator reference in org.eclipse.jface.text.

Don't get me wrong, I'm glad when ICU is finally removed as dependency, but we should take care if that severely breaks existing users or not and should assert the risk and then decide if it's acceptable or not.

Happy to add a CJK/Thai sample to the test to document the behavior if that helps.

Sure, that would help. But I don't have one at hand, do you? If not I can ask around if somebody else can provide one.

@vogella
Copy link
Copy Markdown
Contributor Author

vogella commented May 31, 2026

I will use these:

こんにちは世界 Japanese "Hello, world"
foo 我是 bar Chinese 我是 = "I am"
foo ไทย bar Thai ไทย = "Thai"

vogella and others added 2 commits May 31, 2026 15:48
Replaces com.ibm.icu.text.BreakIterator with java.text.BreakIterator
in DefaultTextDoubleClickStrategy and drops the matching
Import-Package: com.ibm.icu.text from the bundle manifest. The JDK
BreakIterator exposes the same API (getWordInstance, preceding,
following, isBoundary, setText, DONE) and the existing POSIX-locale
workaround for '.' not being treated as a word boundary continues to
work with java.text.

Removes the last com.ibm.icu reference from org.eclipse.jface.text.
@vogella
Copy link
Copy Markdown
Contributor Author

vogella commented May 31, 2026

The scope here is narrow: this only uses getWordInstance() for double-click word selection, and the new identifier fast-path means the locale-aware iterator is only a fallback for non-ASCII text.

I added CJK/Thai tests to make it concrete.

@merks
Copy link
Copy Markdown
Contributor

merks commented May 31, 2026

It's nice to see the added test cases!

Copy link
Copy Markdown
Member

@HannesWell HannesWell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope here is narrow: this only uses getWordInstance() for double-click word selection, and the new identifier fast-path means the locale-aware iterator is only a fallback for non-ASCII text.

I added CJK/Thai tests to make it concrete.

Thanks for the update. Then this is probably fine or at least save enough from my POV.

Comment on lines +268 to +270
private static boolean isIdentifierPart(char c) {
return c == '_' || (c < 128 && (c >= '0' && c <= '9' || c >= 'A' && c <= 'Z' || c >= 'a' && c <= 'z'));
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to Character.isJavaIdentifierPart(char) ? Or would it make sense to reuse it, although it's not identical atm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants