OPENNLP-1837: Add BertTokenizer with BERT basic tokenization by krickert · Pull Request #1073 · apache/opennlp

krickert · 2026-06-11T04:07:26Z

What

New opennlp.tools.tokenize.BertTokenizer: the full BERT tokenization pipeline (basic tokenization / normalization, then wordpiece). Lower casing + accent stripping on by default for uncased models, cased models opt out via constructor flag.
Direct fixes to WordpieceTokenizer: per-character Unicode-aware punctuation splitting, whole-word unknown-token replacement for partially matched words (matching the reference implementation), and tokenizePos now throws UnsupportedOperationException instead of returning null.

Why

See OPENNLP-1837. Without basic tokenization, uncased models (including both models recommended by the opennlp-dl README) receive [UNK] for every capitalized or accented word. Measured embedding fidelity vs. the Python reference was cosine 0.09-0.57; with this fix it exceeds 0.999999.

Recommendation

The opennlp-dl components (SentenceVectorsDL, DocumentCategorizerDL, NameFinderDL) should adopt BertTokenizer as their default tokenization in a follow-up, so uncased models work correctly out of the box.

Validation

All expected token sequences in the new tests were generated with the HuggingFace tokenizers reference implementation. BertTokenizer was additionally verified byte-identical to the reference on the real bert-base-uncased vocabulary across a corpus covering capitalization, diacritics, punctuation runs, CJK, URLs and mixed whitespace (12/12 sentences).

WordpieceTokenizer performs only the wordpiece stage, so uncased models map every capitalized or accented word to the unknown token. The new BertTokenizer adds the missing normalization stage: control character cleanup, whitespace normalization, CJK isolation, optional lower casing with accent stripping, and per-character punctuation splitting. Also fixes three WordpieceTokenizer defects: punctuation runs were split as one token, partially matched words emitted prefix pieces instead of a single unknown token, and tokenizePos returned null.

Copilot

Pull request overview

This PR introduces a full BERT-compatible tokenization pipeline to OpenNLP by adding a new BertTokenizer (basic tokenization/normalization + wordpiece) and aligning WordpieceTokenizer behavior more closely with the reference BERT implementation.

Changes:

Added opennlp.tools.tokenize.BertTokenizer implementing BERT basic tokenization (whitespace/control cleanup, CJK isolation, optional lowercasing + accent stripping, punctuation isolation) followed by wordpiece tokenization.
Updated WordpieceTokenizer to split punctuation runs into single-character tokens, replace partially matchable words with a single [UNK], and make tokenizePos explicitly unsupported.
Added JUnit tests covering the new BERT pipeline and the corrected wordpiece behaviors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/WordpieceTokenizerTest.java	Adds tests for punctuation-run splitting, partial-match handling, and `tokenizePos` unsupported behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/BertTokenizerTest.java	New test suite validating BERT basic tokenization + wordpiece output against reference expectations.
opennlp-api/src/main/java/opennlp/tools/tokenize/WordpieceTokenizer.java	Adjusts punctuation handling, unknown-token behavior for partial matches, adds constructor overload, and throws on `tokenizePos`.
opennlp-api/src/main/java/opennlp/tools/tokenize/BertTokenizer.java	New tokenizer implementing the full BERT tokenization pipeline with optional uncased normalization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rzo1

Nice work — this is a well-motivated and faithful port of the reference BasicTokenizer. The codepoint-correct handling, the ß-vs-ü accent test, and the byte-identical validation against HuggingFace all give me confidence in the normalization logic. A few things I'd like to resolve before merge:

1. Breaking changes to WordpieceTokenizer need to be called out. WordpieceTokenizer is public API and opennlp-dl's AbstractDL.createTokenizer builds it directly, so three behaviors change for existing callers regardless of BertTokenizer:

punctuation runs ("...") now split into individual tokens (and Unicode punctuation now splits, where the old \p{Punct}+ was ASCII-only);
partially-matched words now collapse to a single [UNK] instead of emitting the matched prefix pieces;
tokenizePos now throws instead of returning null.

All three are correct and match the reference, but they'll change real embedding output for current opennlp-dl users. Can we document these in the changelog and confirm this lands in a minor/major (not a patch) release?

2. Dependency direction. WordpieceTokenizer.tokenize now calls BertTokenizer.isolatePunctuation, so the lower-level stage depends on the higher-level class. Works fine (same package), but it's a bit surprising. Would you consider hoisting the shared punctuation/whitespace/control predicates into a small package-private helper both classes use? Optional.

3. isControl is narrower than the reference. The reference _is_control treats all C* categories as control (cat.startswith("C")), but here we only check CONTROL (Cc) and FORMAT (Cf) — so private-use (Co) and unassigned (Cn) codepoints aren't stripped where the reference would strip them. Surrogates are moot here, but Co/Cn are a real (if rare) divergence from "byte-identical." Could we either match the reference or note the intentional difference in a comment?

Minor: the lowerCase flag couples lowercasing and accent stripping — that matches the reference default, just flagging that a cased+accent-stripping model can't be expressed. And it'd be great to file/link the follow-up JIRA for switching the *DL components to BertTokenizer, since uncased models stay broken at the default path until then.

Approving in principle pending (1) and a decision on (3).

LGTM Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Treat all C* categories as control characters matching the reference implementation, hoist shared character predicates into a package-private BertNormalization helper, validate constructor arguments, and document the WordpieceTokenizer behavior changes.

krickert · 2026-06-11T12:24:39Z

Thanks for the thorough review — all points addressed in the latest commit:

1. Breaking changes: Agreed on all three. I've added an explicit "As of OpenNLP 3.0.0" paragraph to the WordpieceTokenizer class Javadoc listing each behavior change (punctuation-run splitting, whole-word [UNK] replacement, tokenizePos throwing). Since this branch targets 3.0.0-SNAPSHOT they land in a major release; happy to also tag the JIRA for release notes if that's the convention here.

2. Dependency direction: Done - the shared character predicates (isControl, isWhitespace, isPunctuation, isCjk) and isolatePunctuation now live in a package-private BertNormalization helper used by both classes, so WordpieceTokenizer no longer depends on BertTokenizer.

3. isControl: Good catch. I verified against HuggingFace tokenizers first (it strips Co/Cn the same way the Python reference does) and widened the check to all C* categories (Cc, Cf, Cs, Co, Cn), with a new test covering U+E000 (private use) and U+FDD0 (noncharacter). The full-vocabulary parity harness now passes 13/13, including a sentence with embedded C* characters.

Minor (lowercase/accent coupling): lowerCase intentionally mirrors the reference default coupling (strip_accents follows do_lower_case unless overridden); this is now noted in the class Javadoc. A decoupled stripAccents flag is easy to add later without breaking the constructor surface if a cased+accent-stripped model ever shows up.

Also addressed the two Copilot notes: maxTokenLength is now validated (IllegalArgumentException on negative values) and the BertTokenizer constructor null-checks all three special tokens, both with tests.

Follow-up JIRA for switching the opennlp-dl components to BertTokenizer - I'll create the JIRA ticket now.

krickert · 2026-06-11T12:41:14Z

https://issues.apache.org/jira/projects/OPENNLP/issues/OPENNLP-1838 was created to handle

SentenceVectorsDL
DocumentCategorizerDL
NameFinderDL

krickert · 2026-06-12T03:28:51Z

@rzo1 LMK if I addressed everything when you have time (no rush) I'll land the other 2 PRs right now. All of these are going to help me get the grpc service end-to-end

mawiesne

Thx @krickert for the PR. I'm happy with this PR "as is".

krickert requested review from Copilot and rzo1 June 11, 2026 10:55

Copilot started reviewing on behalf of krickert June 11, 2026 10:56 View session

krickert requested review from atarora, jzonthemtn and mawiesne June 11, 2026 10:56

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread opennlp-api/src/main/java/opennlp/tools/tokenize/WordpieceTokenizer.java

Comment thread opennlp-api/src/main/java/opennlp/tools/tokenize/BertTokenizer.java

rzo1 reviewed Jun 11, 2026

View reviewed changes

krickert and others added 3 commits June 11, 2026 07:55

Potential fix for pull request finding

5db3c65

LGTM Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

krickert mentioned this pull request Jun 12, 2026

OPENNLP-1838: Adopt BertTokenizer in opennlp-dl components #1075

Merged

rzo1 approved these changes Jun 12, 2026

View reviewed changes

mawiesne assigned krickert Jun 12, 2026

mawiesne added the java Pull requests that update Java code label Jun 12, 2026

mawiesne approved these changes Jun 12, 2026

View reviewed changes

mawiesne merged commit e7e1189 into apache:main Jun 12, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1837: Add BertTokenizer with BERT basic tokenization#1073

OPENNLP-1837: Add BertTokenizer with BERT basic tokenization#1073
mawiesne merged 4 commits into
apache:mainfrom
ai-pipestream:OPENNLP-1837

krickert commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

rzo1 left a comment

Uh oh!

krickert commented Jun 11, 2026

Uh oh!

krickert commented Jun 11, 2026

Uh oh!

krickert commented Jun 12, 2026

Uh oh!

mawiesne left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

krickert commented Jun 11, 2026

What

Why

Recommendation

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

rzo1 left a comment

Choose a reason for hiding this comment

Uh oh!

krickert commented Jun 11, 2026

Uh oh!

krickert commented Jun 11, 2026

Uh oh!

krickert commented Jun 12, 2026

Uh oh!

mawiesne left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants