Skip to content

OPENNLP-1837: Add BertTokenizer with BERT basic tokenization#1073

Merged
mawiesne merged 4 commits into
apache:mainfrom
ai-pipestream:OPENNLP-1837
Jun 12, 2026
Merged

OPENNLP-1837: Add BertTokenizer with BERT basic tokenization#1073
mawiesne merged 4 commits into
apache:mainfrom
ai-pipestream:OPENNLP-1837

Conversation

@krickert

Copy link
Copy Markdown
Contributor

What

  • New opennlp.tools.tokenize.BertTokenizer: the full BERT tokenization pipeline (basic tokenization / normalization, then wordpiece). Lower casing + accent stripping on by default for uncased models, cased models opt out via constructor flag.
  • Direct fixes to WordpieceTokenizer: per-character Unicode-aware punctuation splitting, whole-word unknown-token replacement for partially matched words (matching the reference implementation), and tokenizePos now throws UnsupportedOperationException instead of returning null.

Why

See OPENNLP-1837. Without basic tokenization, uncased models (including both models recommended by the opennlp-dl README) receive [UNK] for every capitalized or accented word. Measured embedding fidelity vs. the Python reference was cosine 0.09-0.57; with this fix it exceeds 0.999999.

Recommendation

The opennlp-dl components (SentenceVectorsDL, DocumentCategorizerDL, NameFinderDL) should adopt BertTokenizer as their default tokenization in a follow-up, so uncased models work correctly out of the box.

Validation

All expected token sequences in the new tests were generated with the HuggingFace tokenizers reference implementation. BertTokenizer was additionally verified byte-identical to the reference on the real bert-base-uncased vocabulary across a corpus covering capitalization, diacritics, punctuation runs, CJK, URLs and mixed whitespace (12/12 sentences).

WordpieceTokenizer performs only the wordpiece stage, so uncased models
map every capitalized or accented word to the unknown token. The new
BertTokenizer adds the missing normalization stage: control character
cleanup, whitespace normalization, CJK isolation, optional lower casing
with accent stripping, and per-character punctuation splitting.

Also fixes three WordpieceTokenizer defects: punctuation runs were
split as one token, partially matched words emitted prefix pieces
instead of a single unknown token, and tokenizePos returned null.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a full BERT-compatible tokenization pipeline to OpenNLP by adding a new BertTokenizer (basic tokenization/normalization + wordpiece) and aligning WordpieceTokenizer behavior more closely with the reference BERT implementation.

Changes:

  • Added opennlp.tools.tokenize.BertTokenizer implementing BERT basic tokenization (whitespace/control cleanup, CJK isolation, optional lowercasing + accent stripping, punctuation isolation) followed by wordpiece tokenization.
  • Updated WordpieceTokenizer to split punctuation runs into single-character tokens, replace partially matchable words with a single [UNK], and make tokenizePos explicitly unsupported.
  • Added JUnit tests covering the new BERT pipeline and the corrected wordpiece behaviors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/WordpieceTokenizerTest.java Adds tests for punctuation-run splitting, partial-match handling, and tokenizePos unsupported behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/BertTokenizerTest.java New test suite validating BERT basic tokenization + wordpiece output against reference expectations.
opennlp-api/src/main/java/opennlp/tools/tokenize/WordpieceTokenizer.java Adjusts punctuation handling, unknown-token behavior for partial matches, adds constructor overload, and throws on tokenizePos.
opennlp-api/src/main/java/opennlp/tools/tokenize/BertTokenizer.java New tokenizer implementing the full BERT tokenization pipeline with optional uncased normalization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread opennlp-api/src/main/java/opennlp/tools/tokenize/BertTokenizer.java

@rzo1 rzo1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work — this is a well-motivated and faithful port of the reference BasicTokenizer. The codepoint-correct handling, the ß-vs-ü accent test, and the byte-identical validation against HuggingFace all give me confidence in the normalization logic. A few things I'd like to resolve before merge:

1. Breaking changes to WordpieceTokenizer need to be called out. WordpieceTokenizer is public API and opennlp-dl's AbstractDL.createTokenizer builds it directly, so three behaviors change for existing callers regardless of BertTokenizer:

  • punctuation runs ("...") now split into individual tokens (and Unicode punctuation now splits, where the old \p{Punct}+ was ASCII-only);
  • partially-matched words now collapse to a single [UNK] instead of emitting the matched prefix pieces;
  • tokenizePos now throws instead of returning null.

All three are correct and match the reference, but they'll change real embedding output for current opennlp-dl users. Can we document these in the changelog and confirm this lands in a minor/major (not a patch) release?

2. Dependency direction. WordpieceTokenizer.tokenize now calls BertTokenizer.isolatePunctuation, so the lower-level stage depends on the higher-level class. Works fine (same package), but it's a bit surprising. Would you consider hoisting the shared punctuation/whitespace/control predicates into a small package-private helper both classes use? Optional.

3. isControl is narrower than the reference. The reference _is_control treats all C* categories as control (cat.startswith("C")), but here we only check CONTROL (Cc) and FORMAT (Cf) — so private-use (Co) and unassigned (Cn) codepoints aren't stripped where the reference would strip them. Surrogates are moot here, but Co/Cn are a real (if rare) divergence from "byte-identical." Could we either match the reference or note the intentional difference in a comment?

Minor: the lowerCase flag couples lowercasing and accent stripping — that matches the reference default, just flagging that a cased+accent-stripping model can't be expressed. And it'd be great to file/link the follow-up JIRA for switching the *DL components to BertTokenizer, since uncased models stay broken at the default path until then.

Approving in principle pending (1) and a decision on (3).

krickert and others added 3 commits June 11, 2026 07:55
LGTM

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Treat all C* categories as control characters matching the reference
implementation, hoist shared character predicates into a package-private
BertNormalization helper, validate constructor arguments, and document
the WordpieceTokenizer behavior changes.
Treat all C* categories as control characters matching the reference
implementation, hoist shared character predicates into a package-private
BertNormalization helper, validate constructor arguments, and document
the WordpieceTokenizer behavior changes.
@krickert

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review — all points addressed in the latest commit:

1. Breaking changes: Agreed on all three. I've added an explicit "As of OpenNLP 3.0.0" paragraph to the WordpieceTokenizer class Javadoc listing each behavior change (punctuation-run splitting, whole-word [UNK] replacement, tokenizePos throwing). Since this branch targets 3.0.0-SNAPSHOT they land in a major release; happy to also tag the JIRA for release notes if that's the convention here.

2. Dependency direction: Done - the shared character predicates (isControl, isWhitespace, isPunctuation, isCjk) and isolatePunctuation now live in a package-private BertNormalization helper used by both classes, so WordpieceTokenizer no longer depends on BertTokenizer.

3. isControl: Good catch. I verified against HuggingFace tokenizers first (it strips Co/Cn the same way the Python reference does) and widened the check to all C* categories (Cc, Cf, Cs, Co, Cn), with a new test covering U+E000 (private use) and U+FDD0 (noncharacter). The full-vocabulary parity harness now passes 13/13, including a sentence with embedded C* characters.

Minor (lowercase/accent coupling): lowerCase intentionally mirrors the reference default coupling (strip_accents follows do_lower_case unless overridden); this is now noted in the class Javadoc. A decoupled stripAccents flag is easy to add later without breaking the constructor surface if a cased+accent-stripped model ever shows up.

Also addressed the two Copilot notes: maxTokenLength is now validated (IllegalArgumentException on negative values) and the BertTokenizer constructor null-checks all three special tokens, both with tests.

Follow-up JIRA for switching the opennlp-dl components to BertTokenizer - I'll create the JIRA ticket now.

@krickert

Copy link
Copy Markdown
Contributor Author

https://issues.apache.org/jira/projects/OPENNLP/issues/OPENNLP-1838 was created to handle

SentenceVectorsDL
DocumentCategorizerDL
NameFinderDL

@krickert

Copy link
Copy Markdown
Contributor Author

@rzo1 LMK if I addressed everything when you have time (no rush) I'll land the other 2 PRs right now. All of these are going to help me get the grpc service end-to-end

@mawiesne mawiesne added the java Pull requests that update Java code label Jun 12, 2026

@mawiesne mawiesne left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @krickert for the PR. I'm happy with this PR "as is".

@mawiesne mawiesne merged commit e7e1189 into apache:main Jun 12, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants