OPENNLP-1836: Fix input encoding in SentenceVectorsDL by krickert · Pull Request #1072 · apache/opennlp

krickert · 2026-06-10T11:59:31Z

See https://issues.apache.org/jira/browse/OPENNLP-1836

SentenceVectorsDL sent an all-zero attention_mask and all-one token_type_ids to the ONNX model, so the encoder attended to nothing. This fixes the encoding to the standard single-segment BERT convention (mask=1, types=0), consistent with DocumentCategorizerDL, and additionally:

closes the OnnxTensor inputs and OrtSession.Result (native memory leak)
replaces the NPE on a vocabulary miss with a descriptive IllegalArgumentException
adds a unit test for the encoding (tokenize is now package-private static, no ONNX session needed)
updates SentenceVectorsDLEval expectations

Eval values were verified empirically: the unfixed code reproduces the previously pinned values exactly against the public sentence-transformers/all-MiniLM-L6-v2 ONNX export, and the corrected encoding produces the new pinned values (dimension 384).

Note: this is a behavioral fix - vectors persisted from the old encoding are not comparable with the corrected output and should be re-embedded.

SentenceVectorsDL sent an all-zero attention_mask and all-one token_type_ids, so the model attended to nothing. Use the standard single-segment BERT encoding (mask=1, types=0), consistent with DocumentCategorizerDL. Also close OnnxTensor/Result resources, replace the NPE on a vocabulary miss with a descriptive exception, add a unit test for the encoding, and update the eval test expectations (verified against the same MiniLM ONNX export). Vectors produced by the previous encoding are not comparable with the corrected output.

Copilot

Pull request overview

Fixes SentenceVectorsDL’s ONNX input encoding so sentence-transformer models receive standard single-segment BERT inputs (attention mask = 1 for real tokens, token type ids = 0), aligning behavior with other DL components and updating expected eval outputs accordingly.

Changes:

Corrects SentenceVectorsDL token encoding (mask/types) and improves vocabulary-miss handling with a descriptive exception.
Prevents native-memory leaks by closing ONNX tensors and OrtSession.Result.
Adds unit tests for tokenization/encoding and updates SentenceVectorsDLEval pinned vector expectations.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
opennlp-eval-tests/src/test/java/opennlp/dl/vectors/SentenceVectorsDLEval.java	Updates pinned expected vector values for the corrected encoding.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/vectors/SentenceVectorsDLTest.java	Adds unit tests validating single-segment BERT encoding and vocabulary/UNK behavior.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/vectors/SentenceVectorsDL.java	Fixes mask/types encoding, closes ONNX resources, and improves vocab-mismatch error reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

krickert · 2026-06-10T15:50:57Z

Ran copilot against this. It didn't do a bad job because it only said that the expected vs actual were reversed.

rzo1

Two call-outs:

1. Behavioral change, needs to be loud in the release notes. This changes the produced vectors. Anyone who persisted embeddings from the buggy encoding must re-embed, since old and new vectors are not comparable. The PR body already notes this; please make sure it lands in the changelog / release notes so downstream users aren't silently broken.

2. Follow-up for DocumentCategorizerDL (out of scope here). While DocumentCategorizerDL already uses the correct mask=1/types=0 encoding, it still shares the other two problems this PR fixes:

the input OnnxTensors and OrtSession.Result are not closed (same native-memory leak), and
ids[x] = vocab.get(tokens[x]) will NPE on a vocabulary miss rather than throwing a descriptive error.

Worth a separate ticket to apply the same two fixes there. Also worth porting to 2.x?

LGTM, approving.

…avadoc Add a release-note paragraph so downstream users know persisted embeddings from the previous encoding must be re-embedded. Addresses review on apache#1072.

krickert · 2026-06-11T13:17:24Z

Thanks and addressed:

1. Release notes: The breaking change is now documented in three places: the PR body, a Release Note paragraph on the OPENNLP-1836 JIRA (please paste the text below into the JIRA Release Note field), and the SentenceVectorsDL class Javadoc (Release note (OpenNLP 3.0.0) paragraph in commit 33b15e4). Targeting 3.0.0, not a patch release.

2. DocumentCategorizerDL follow-up: Filed as OPENNLP-1839 (leak + vocab-miss handling; 2.x backport of leak/error-handling only to discuss there).

Copilot: assertEquals argument order was already corrected in 15fc495.

mawiesne · 2026-06-12T11:56:42Z

@krickert Do I understand correctly, that this MR should target the 3.0.0 release and not the related M4 milestone? Pls clarify, wdyt?

rzo1 · 2026-06-12T11:58:36Z

I think it can land in M4 (because breaking can be expected). It just needs to be in the rel-notes.

krickert · 2026-06-12T12:04:27Z

I would love it if these PRs can make it in M4 - it'll make the grpc sandbox code usable and testable right away.

The pinned values are the [CLS]-position hidden state of the all-MiniLM-L6-v2 ONNX export and can be reproduced independently with the HuggingFace tokenizers and onnxruntime Python packages; the comment includes the recipe.

SentenceVectorsDL sent an all-zero attention_mask and all-one token_type_ids, so the model attended to nothing. Use the standard single-segment BERT encoding (mask=1, types=0), consistent with DocumentCategorizerDL. Also close OnnxTensor/Result resources, replace the NPE on a vocabulary miss with a descriptive exception, add a unit test for the encoding, and update the eval test expectations (verified against the same MiniLM ONNX export). Vectors produced by the previous encoding are not comparable with the corrected output. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * OPENNLP-1836 - Document breaking vector change in SentenceVectorsDL Javadoc Add a release-note paragraph so downstream users know persisted embeddings from the previous encoding must be re-embedded. Addresses review on #1072. * OPENNLP-1836 - Document provenance of expected eval vector values The pinned values are the [CLS]-position hidden state of the all-MiniLM-L6-v2 ONNX export and can be reproduced independently with the HuggingFace tokenizers and onnxruntime Python packages; the comment includes the recipe. --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> (cherry picked from commit 97c77b7)

mawiesne · 2026-06-12T14:10:49Z

🍒 -picked to opennlp-2.x

krickert requested review from Copilot, mawiesne and rzo1 June 10, 2026 12:02

Copilot started reviewing on behalf of krickert June 10, 2026 13:27 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread opennlp-eval-tests/src/test/java/opennlp/dl/vectors/SentenceVectorsDLEval.java Outdated

Potential fix for pull request finding

15fc495

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

mawiesne requested a review from jzonthemtn June 10, 2026 18:04

mawiesne changed the title ~~OPENNLP-1836 - Fix input encoding in SentenceVectorsDL~~ OPENNLP-1836: Fix input encoding in SentenceVectorsDL Jun 10, 2026

rzo1 approved these changes Jun 11, 2026

View reviewed changes

OPENNLP-1836 - Document breaking vector change in SentenceVectorsDL J…

33b15e4

…avadoc Add a release-note paragraph so downstream users know persisted embeddings from the previous encoding must be re-embedded. Addresses review on apache#1072.

This was referenced Jun 12, 2026

OPENNLP-1839: Fix native memory leak and vocabulary NPE in DocumentCategorizerDL #1074

Merged

OPENNLP-1838: Adopt BertTokenizer in opennlp-dl components #1075

Open

OPENNLP-1840: Fix native memory leak and vocabulary NPE in NameFinderDL #1076

Merged

mawiesne assigned krickert Jun 12, 2026

mawiesne approved these changes Jun 12, 2026

View reviewed changes

Comment thread opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/vectors/SentenceVectorsDL.java

Comment thread opennlp-eval-tests/src/test/java/opennlp/dl/vectors/SentenceVectorsDLEval.java

mawiesne merged commit 97c77b7 into apache:main Jun 12, 2026
9 checks passed

mawiesne added the java Pull requests that update Java code label Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1836: Fix input encoding in SentenceVectorsDL#1072

OPENNLP-1836: Fix input encoding in SentenceVectorsDL#1072
mawiesne merged 4 commits into
apache:mainfrom
ai-pipestream:OPENNLP-1836

krickert commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

krickert commented Jun 10, 2026

Uh oh!

rzo1 left a comment

Uh oh!

krickert commented Jun 11, 2026

Uh oh!

mawiesne commented Jun 12, 2026

Uh oh!

rzo1 commented Jun 12, 2026

Uh oh!

krickert commented Jun 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mawiesne commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

krickert commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

krickert commented Jun 10, 2026

Uh oh!

rzo1 left a comment

Choose a reason for hiding this comment

Uh oh!

krickert commented Jun 11, 2026

Uh oh!

mawiesne commented Jun 12, 2026

Uh oh!

rzo1 commented Jun 12, 2026

Uh oh!

krickert commented Jun 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mawiesne commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants