Use java.text.BreakIterator in DefaultTextDoubleClickStrategy by vogella · Pull Request #3990 · eclipse-platform/eclipse.platform.ui

vogella · 2026-05-08T11:25:49Z

Replaces com.ibm.icu.text.BreakIterator with java.text.BreakIterator in DefaultTextDoubleClickStrategy and drops the matching Import-Package: com.ibm.icu.text from the bundle manifest. The JDK BreakIterator exposes the same API (getWordInstance, preceding, following, isBoundary, setText, DONE) and the existing POSIX-locale workaround for . not being treated as a word boundary continues to work. This removes the last com.ibm.icu reference from org.eclipse.jface.text.

Planned for 4.41

github-actions · 2026-05-08T12:08:10Z

Test Results

864 files ± 0 864 suites ±0 51m 40s ⏱️ -50s
7 996 tests + 8 7 753 ✅ + 8 243 💤 ±0 0 ❌ ±0
20 442 runs +24 19 787 ✅ +24 655 💤 ±0 0 ❌ ±0

Results for commit 58af83a. ± Comparison against base commit 1e6a7c9.

♻️ This comment has been updated with latest results.

HannesWell · 2026-05-09T06:31:38Z

I'm looking forward to have icu entirely removed.
But IIRC in the past difficulties to fully replace it have been mentioned due to different behavior. I searched for corresponding bugs/source but didn't find them. Do you have them at hand?

vogella · 2026-05-09T06:36:52Z

I don't have a reference but the behavior is partially covered by unit tests which I extended. This was therefore not a drop in replacement.

eclipse-platform-bot · 2026-05-31T12:02:13Z

This pull request changes some projects for the first time in this development cycle.
Therefore the following files need a version increment:

bundles/org.eclipse.jface.text/META-INF/MANIFEST.MF

An additional commit containing all the necessary changes was pushed to the top of this PR's branch. To obtain these changes (for example if you want to push more changes) either fetch from your fork or apply the git patch.

Git patch

From 22852b586e8d3640f26e7fd0997422f5c39057df Mon Sep 17 00:00:00 2001
From: Eclipse Platform Bot <platform-bot@eclipse.org>
Date: Sun, 31 May 2026 12:01:55 +0000
Subject: [PATCH] Version bump(s) for 4.41 stream


diff --git a/bundles/org.eclipse.jface.text/META-INF/MANIFEST.MF b/bundles/org.eclipse.jface.text/META-INF/MANIFEST.MF
index 78461c220e..ef7567b50e 100644
--- a/bundles/org.eclipse.jface.text/META-INF/MANIFEST.MF
+++ b/bundles/org.eclipse.jface.text/META-INF/MANIFEST.MF
@@ -2,7 +2,7 @@ Manifest-Version: 1.0
 Bundle-ManifestVersion: 2
 Bundle-Name: %pluginName
 Bundle-SymbolicName: org.eclipse.jface.text
-Bundle-Version: 3.31.0.qualifier
+Bundle-Version: 3.31.100.qualifier
 Bundle-Vendor: %providerName
 Bundle-Localization: plugin
 Export-Package: 
-- 
2.54.0

Further information are available in Common Build Issues - Missing version increments.

HannesWell · 2026-05-31T12:27:55Z

I don't have a reference but the behavior is partially covered by unit tests which I extended. This was therefore not a drop in replacement.

I'm not deeply familiar with that part of the code, but it might be that The ICU BreakIterator handles non Roman languages better?
I asked Copilot about it and it provided the following answer:

Can com.ibm.icu.text.BreakIterator savely be replaced by java.text.BreakIterator, even for non euro-centric languages?


Not safely in general.

`com.ibm.icu.text.BreakIterator` and `java.text.BreakIterator` are related in purpose, but ICU’s implementation is usually more complete and more up to date for international text segmentation, especially for non-European scripts and edge cases like emoji, grapheme clusters, dictionary-based word breaking, and evolving Unicode rules.

A practical rule of thumb:

- **If you only handle simple Latin-script text** and basic sentence/word boundaries, replacement may work.
- **If you care about correctness across many languages**, especially Thai, Lao, Khmer, Burmese, CJK, Indic scripts, mixed-script text, or modern emoji behavior, **do not assume they are interchangeable**.
- **If your code currently depends on ICU**, it often does so for a reason.

Key differences to watch for:

1. **Unicode data freshness**
   - ICU is generally updated faster and more regularly for Unicode segmentation behavior.
   - JDK behavior depends on the Java version and may lag.

2. **Dictionary-based segmentation**
   - Languages without spaces between words, like **Thai** or **Japanese** in some contexts, are more sensitive.
   - ICU tends to perform better here.

3. **Grapheme cluster handling**
   - User-perceived characters are tricky:
     - emoji with modifiers
     - ZWJ emoji sequences
     - combining marks
     - regional-indicator flags
   - ICU is usually safer if cursor movement, deletion, highlighting, or UI text operations matter.

4. **Locale behavior**
   - Even when both APIs expose similar methods, actual boundary results can differ by locale and JDK/ICU version.

5. **API/behavior mismatch**
   - Even if code compiles after a swap, tests may fail because break positions differ.

So the honest answer is:

- **For euro-centric languages only:** maybe.
- **For non euro-centric languages:** **not reliably/safely without testing**.

What I’d recommend:
- Build a regression corpus with text in:
  - English, German, French
  - Arabic, Hebrew
  - Hindi
  - Thai
  - Chinese, Japanese, Korean
  - emoji-heavy social text
- Compare actual break positions for:
  - character boundaries
  - word boundaries
  - sentence boundaries
  - line boundaries if used
- Only replace ICU if your tests show identical or acceptable behavior for your supported locales and Java version.

If you want, I can also give you:
1. a **risk matrix by language/script**, or  
2. a **small Java test harness** to compare ICU vs JDK `BreakIterator` on sample strings.

Of course I'm not sure how correct that is. But if it's true, we risk breaking the existing translation e.g. for Asian languages.
Maybe some of our fellow developers from India can provide test text in Hindi?

vogella · 2026-05-31T12:34:35Z

The Copilot dump you pasted is generic and over-broad for this specific change.
The platform already uses java.text.BreakIterator for text handling:

org.eclipse.ui.editors → AbstractDecoratedTextEditor
org.eclipse.ui.forms → FormUtil, TextSegment

So the JDK iterator is already the de-facto choice in the editor stack. After this PR, DefaultTextDoubleClickStrategy is the last com.ibm.icu.text.BreakIterator reference in org.eclipse.jface.text.

Happy to add a CJK/Thai sample to the test to document the behavior if that helps.

vogella · 2026-05-31T12:35:27Z

Also note the scope here: this change only affects double-click word selection

HannesWell · 2026-05-31T13:35:22Z

The Copilot dump you pasted is generic and over-broad for this specific change.

Yes, of course. But my intend was to understand if it generally is a safe replacement or if one has to pay more attention.

So the JDK iterator is already the de-facto choice in the editor stack. After this PR, DefaultTextDoubleClickStrategy is the last com.ibm.icu.text.BreakIterator reference in org.eclipse.jface.text.

Don't get me wrong, I'm glad when ICU is finally removed as dependency, but we should take care if that severely breaks existing users or not and should assert the risk and then decide if it's acceptable or not.

Happy to add a CJK/Thai sample to the test to document the behavior if that helps.

Sure, that would help. But I don't have one at hand, do you? If not I can ask around if somebody else can provide one.

vogella · 2026-05-31T13:45:09Z

I will use these:

こんにちは世界 Japanese "Hello, world"
foo 我是 bar Chinese 我是 = "I am"
foo ไทย bar Thai ไทย = "Thai"

Replaces com.ibm.icu.text.BreakIterator with java.text.BreakIterator in DefaultTextDoubleClickStrategy and drops the matching Import-Package: com.ibm.icu.text from the bundle manifest. The JDK BreakIterator exposes the same API (getWordInstance, preceding, following, isBoundary, setText, DONE) and the existing POSIX-locale workaround for '.' not being treated as a word boundary continues to work with java.text. Removes the last com.ibm.icu reference from org.eclipse.jface.text.

vogella · 2026-05-31T14:06:06Z

The scope here is narrow: this only uses getWordInstance() for double-click word selection, and the new identifier fast-path means the locale-aware iterator is only a fallback for non-ASCII text.

I added CJK/Thai tests to make it concrete.

merks · 2026-05-31T14:28:39Z

It's nice to see the added test cases!

HannesWell

The scope here is narrow: this only uses getWordInstance() for double-click word selection, and the new identifier fast-path means the locale-aware iterator is only a fallback for non-ASCII text.

I added CJK/Thai tests to make it concrete.

Thanks for the update. Then this is probably fine or at least save enough from my POV.

HannesWell · 2026-05-31T15:39:49Z

+	private static boolean isIdentifierPart(char c) {
+		return c == '_' || (c < 128 && (c >= '0' && c <= '9' || c >= 'A' && c <= 'Z' || c >= 'a' && c <= 'z'));
+	}


Is this related to Character.isJavaIdentifierPart(char) ? Or would it make sense to reuse it, although it's not identical atm.

vogella force-pushed the icu-jfacetext branch 2 times, most recently from e9be22e to 40485b6 Compare May 8, 2026 16:30

vogella marked this pull request as ready for review May 31, 2026 11:57

vogella force-pushed the icu-jfacetext branch from 40485b6 to 3855f14 Compare May 31, 2026 11:57

vogella and others added 2 commits May 31, 2026 15:48

Version bump(s) for 4.41 stream

58af83a

vogella force-pushed the icu-jfacetext branch from 58b340e to 58af83a Compare May 31, 2026 13:48

HannesWell reviewed May 31, 2026

View reviewed changes

Conversation

vogella commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

HannesWell commented May 9, 2026

Uh oh!

vogella commented May 9, 2026

Uh oh!

eclipse-platform-bot commented May 31, 2026

Uh oh!

HannesWell commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vogella commented May 31, 2026

Uh oh!

vogella commented May 31, 2026

Uh oh!

HannesWell commented May 31, 2026

Uh oh!

vogella commented May 31, 2026

Uh oh!

vogella commented May 31, 2026

Uh oh!

merks commented May 31, 2026

Uh oh!

HannesWell left a comment

Choose a reason for hiding this comment

Uh oh!

HannesWell May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented May 8, 2026 •

edited

Loading

HannesWell commented May 31, 2026 •

edited

Loading