OPENNLP-1837: Add BertTokenizer with BERT basic tokenization#1073
Conversation
WordpieceTokenizer performs only the wordpiece stage, so uncased models map every capitalized or accented word to the unknown token. The new BertTokenizer adds the missing normalization stage: control character cleanup, whitespace normalization, CJK isolation, optional lower casing with accent stripping, and per-character punctuation splitting. Also fixes three WordpieceTokenizer defects: punctuation runs were split as one token, partially matched words emitted prefix pieces instead of a single unknown token, and tokenizePos returned null.
There was a problem hiding this comment.
Pull request overview
This PR introduces a full BERT-compatible tokenization pipeline to OpenNLP by adding a new BertTokenizer (basic tokenization/normalization + wordpiece) and aligning WordpieceTokenizer behavior more closely with the reference BERT implementation.
Changes:
- Added
opennlp.tools.tokenize.BertTokenizerimplementing BERT basic tokenization (whitespace/control cleanup, CJK isolation, optional lowercasing + accent stripping, punctuation isolation) followed by wordpiece tokenization. - Updated
WordpieceTokenizerto split punctuation runs into single-character tokens, replace partially matchable words with a single[UNK], and maketokenizePosexplicitly unsupported. - Added JUnit tests covering the new BERT pipeline and the corrected wordpiece behaviors.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/WordpieceTokenizerTest.java | Adds tests for punctuation-run splitting, partial-match handling, and tokenizePos unsupported behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/BertTokenizerTest.java | New test suite validating BERT basic tokenization + wordpiece output against reference expectations. |
| opennlp-api/src/main/java/opennlp/tools/tokenize/WordpieceTokenizer.java | Adjusts punctuation handling, unknown-token behavior for partial matches, adds constructor overload, and throws on tokenizePos. |
| opennlp-api/src/main/java/opennlp/tools/tokenize/BertTokenizer.java | New tokenizer implementing the full BERT tokenization pipeline with optional uncased normalization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
rzo1
left a comment
There was a problem hiding this comment.
Nice work — this is a well-motivated and faithful port of the reference BasicTokenizer. The codepoint-correct handling, the ß-vs-ü accent test, and the byte-identical validation against HuggingFace all give me confidence in the normalization logic. A few things I'd like to resolve before merge:
1. Breaking changes to WordpieceTokenizer need to be called out. WordpieceTokenizer is public API and opennlp-dl's AbstractDL.createTokenizer builds it directly, so three behaviors change for existing callers regardless of BertTokenizer:
- punctuation runs (
"...") now split into individual tokens (and Unicode punctuation now splits, where the old\p{Punct}+was ASCII-only); - partially-matched words now collapse to a single
[UNK]instead of emitting the matched prefix pieces; tokenizePosnow throws instead of returningnull.
All three are correct and match the reference, but they'll change real embedding output for current opennlp-dl users. Can we document these in the changelog and confirm this lands in a minor/major (not a patch) release?
2. Dependency direction. WordpieceTokenizer.tokenize now calls BertTokenizer.isolatePunctuation, so the lower-level stage depends on the higher-level class. Works fine (same package), but it's a bit surprising. Would you consider hoisting the shared punctuation/whitespace/control predicates into a small package-private helper both classes use? Optional.
3. isControl is narrower than the reference. The reference _is_control treats all C* categories as control (cat.startswith("C")), but here we only check CONTROL (Cc) and FORMAT (Cf) — so private-use (Co) and unassigned (Cn) codepoints aren't stripped where the reference would strip them. Surrogates are moot here, but Co/Cn are a real (if rare) divergence from "byte-identical." Could we either match the reference or note the intentional difference in a comment?
Minor: the lowerCase flag couples lowercasing and accent stripping — that matches the reference default, just flagging that a cased+accent-stripping model can't be expressed. And it'd be great to file/link the follow-up JIRA for switching the *DL components to BertTokenizer, since uncased models stay broken at the default path until then.
Approving in principle pending (1) and a decision on (3).
LGTM Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Treat all C* categories as control characters matching the reference implementation, hoist shared character predicates into a package-private BertNormalization helper, validate constructor arguments, and document the WordpieceTokenizer behavior changes.
Treat all C* categories as control characters matching the reference implementation, hoist shared character predicates into a package-private BertNormalization helper, validate constructor arguments, and document the WordpieceTokenizer behavior changes.
|
Thanks for the thorough review — all points addressed in the latest commit: 1. Breaking changes: Agreed on all three. I've added an explicit "As of OpenNLP 3.0.0" paragraph to the 2. Dependency direction: Done - the shared character predicates ( 3. Minor (lowercase/accent coupling): Also addressed the two Copilot notes: Follow-up JIRA for switching the |
|
https://issues.apache.org/jira/projects/OPENNLP/issues/OPENNLP-1838 was created to handle
|
|
@rzo1 LMK if I addressed everything when you have time (no rush) I'll land the other 2 PRs right now. All of these are going to help me get the grpc service end-to-end |
What
opennlp.tools.tokenize.BertTokenizer: the full BERT tokenization pipeline (basic tokenization / normalization, then wordpiece). Lower casing + accent stripping on by default for uncased models, cased models opt out via constructor flag.WordpieceTokenizer: per-character Unicode-aware punctuation splitting, whole-word unknown-token replacement for partially matched words (matching the reference implementation), andtokenizePosnow throwsUnsupportedOperationExceptioninstead of returningnull.Why
See OPENNLP-1837. Without basic tokenization, uncased models (including both models recommended by the opennlp-dl README) receive
[UNK]for every capitalized or accented word. Measured embedding fidelity vs. the Python reference was cosine 0.09-0.57; with this fix it exceeds 0.999999.Recommendation
The opennlp-dl components (
SentenceVectorsDL,DocumentCategorizerDL,NameFinderDL) should adoptBertTokenizeras their default tokenization in a follow-up, so uncased models work correctly out of the box.Validation
All expected token sequences in the new tests were generated with the HuggingFace
tokenizersreference implementation.BertTokenizerwas additionally verified byte-identical to the reference on the real bert-base-uncased vocabulary across a corpus covering capitalization, diacritics, punctuation runs, CJK, URLs and mixed whitespace (12/12 sentences).