OPENNLP-1836: Fix input encoding in SentenceVectorsDL#1072
Conversation
SentenceVectorsDL sent an all-zero attention_mask and all-one token_type_ids, so the model attended to nothing. Use the standard single-segment BERT encoding (mask=1, types=0), consistent with DocumentCategorizerDL. Also close OnnxTensor/Result resources, replace the NPE on a vocabulary miss with a descriptive exception, add a unit test for the encoding, and update the eval test expectations (verified against the same MiniLM ONNX export). Vectors produced by the previous encoding are not comparable with the corrected output.
There was a problem hiding this comment.
Pull request overview
Fixes SentenceVectorsDL’s ONNX input encoding so sentence-transformer models receive standard single-segment BERT inputs (attention mask = 1 for real tokens, token type ids = 0), aligning behavior with other DL components and updating expected eval outputs accordingly.
Changes:
- Corrects SentenceVectorsDL token encoding (mask/types) and improves vocabulary-miss handling with a descriptive exception.
- Prevents native-memory leaks by closing ONNX tensors and
OrtSession.Result. - Adds unit tests for tokenization/encoding and updates SentenceVectorsDLEval pinned vector expectations.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| opennlp-eval-tests/src/test/java/opennlp/dl/vectors/SentenceVectorsDLEval.java | Updates pinned expected vector values for the corrected encoding. |
| opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/vectors/SentenceVectorsDLTest.java | Adds unit tests validating single-segment BERT encoding and vocabulary/UNK behavior. |
| opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/vectors/SentenceVectorsDL.java | Fixes mask/types encoding, closes ONNX resources, and improves vocab-mismatch error reporting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
Ran copilot against this. It didn't do a bad job because it only said that the expected vs actual were reversed. |
rzo1
left a comment
There was a problem hiding this comment.
Two call-outs:
1. Behavioral change, needs to be loud in the release notes. This changes the produced vectors. Anyone who persisted embeddings from the buggy encoding must re-embed, since old and new vectors are not comparable. The PR body already notes this; please make sure it lands in the changelog / release notes so downstream users aren't silently broken.
2. Follow-up for DocumentCategorizerDL (out of scope here). While DocumentCategorizerDL already uses the correct mask=1/types=0 encoding, it still shares the other two problems this PR fixes:
- the input
OnnxTensors andOrtSession.Resultare not closed (same native-memory leak), and ids[x] = vocab.get(tokens[x])will NPE on a vocabulary miss rather than throwing a descriptive error.
Worth a separate ticket to apply the same two fixes there. Also worth porting to 2.x?
LGTM, approving.
…avadoc Add a release-note paragraph so downstream users know persisted embeddings from the previous encoding must be re-embedded. Addresses review on apache#1072.
|
Thanks and addressed: 1. Release notes: The breaking change is now documented in three places: the PR body, a Release Note paragraph on the OPENNLP-1836 JIRA (please paste the text below into the JIRA Release Note field), and the 2. DocumentCategorizerDL follow-up: Filed as OPENNLP-1839 (leak + vocab-miss handling; 2.x backport of leak/error-handling only to discuss there). Copilot: |
|
@krickert Do I understand correctly, that this MR should target the 3.0.0 release and not the related M4 milestone? Pls clarify, wdyt? |
|
I think it can land in M4 (because breaking can be expected). It just needs to be in the rel-notes. |
|
I would love it if these PRs can make it in M4 - it'll make the grpc sandbox code usable and testable right away. |
The pinned values are the [CLS]-position hidden state of the all-MiniLM-L6-v2 ONNX export and can be reproduced independently with the HuggingFace tokenizers and onnxruntime Python packages; the comment includes the recipe.
SentenceVectorsDL sent an all-zero attention_mask and all-one token_type_ids, so the model attended to nothing. Use the standard single-segment BERT encoding (mask=1, types=0), consistent with DocumentCategorizerDL. Also close OnnxTensor/Result resources, replace the NPE on a vocabulary miss with a descriptive exception, add a unit test for the encoding, and update the eval test expectations (verified against the same MiniLM ONNX export). Vectors produced by the previous encoding are not comparable with the corrected output. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * OPENNLP-1836 - Document breaking vector change in SentenceVectorsDL Javadoc Add a release-note paragraph so downstream users know persisted embeddings from the previous encoding must be re-embedded. Addresses review on #1072. * OPENNLP-1836 - Document provenance of expected eval vector values The pinned values are the [CLS]-position hidden state of the all-MiniLM-L6-v2 ONNX export and can be reproduced independently with the HuggingFace tokenizers and onnxruntime Python packages; the comment includes the recipe. --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> (cherry picked from commit 97c77b7)
|
🍒 -picked to |
See https://issues.apache.org/jira/browse/OPENNLP-1836
SentenceVectorsDL sent an all-zero attention_mask and all-one token_type_ids to the ONNX model, so the encoder attended to nothing. This fixes the encoding to the standard single-segment BERT convention (mask=1, types=0), consistent with DocumentCategorizerDL, and additionally:
Eval values were verified empirically: the unfixed code reproduces the previously pinned values exactly against the public sentence-transformers/all-MiniLM-L6-v2 ONNX export, and the corrected encoding produces the new pinned values (dimension 384).
Note: this is a behavioral fix - vectors persisted from the old encoding are not comparable with the corrected output and should be re-embedded.