OPENNLP-1839: Fix native memory leak and vocabulary NPE in DocumentCategorizerDL by krickert · Pull Request #1074 · apache/opennlp

krickert · 2026-06-12T03:35:37Z

What

categorize() leaked native memory on every call: the OnnxTensor inputs and the OrtSession.Result were never closed. Tensors are now released in a finally block and the result via try-with-resources (getValue() copies into Java arrays first, so this is safe).
A token missing from the vocabulary caused vocab.get(...) to auto-unbox null into an int, throwing an opaque NullPointerException that the broad catch in categorize() swallowed into an empty score array. The mapping loop is now a testable tokenIds() helper that throws IllegalArgumentException naming the missing token, which indicates the vocabulary file does not match the model.

Why

See OPENNLP-1839. Long-running services calling categorize() repeatedly accumulate off-heap allocations until the process is killed. This applies the same resource-management pattern as the SentenceVectorsDL fix (OPENNLP-1836, #1072).

Validation

New DocumentCategorizerDLTest covers the token-id mapping and the vocabulary-miss error. All existing opennlp-dl tests pass.

…ategorizerDL Every categorize() call leaked the OnnxTensor inputs and the OrtSession.Result for each document chunk. Tensors are now closed in a finally block and the result with try-with-resources. Tokens absent from the vocabulary caused an opaque NullPointerException through auto-unboxing, which the broad catch in categorize() swallowed. The token-to-id mapping now throws IllegalArgumentException naming the missing token, indicating a vocabulary/model mismatch.

rzo1 · 2026-06-12T09:55:58Z

Might need to be back-ported to 2.x as well

…ategorizerDL (#1074) Every categorize() call leaked the OnnxTensor inputs and the OrtSession.Result for each document chunk. Tensors are now closed in a finally block and the result with try-with-resources. Tokens absent from the vocabulary caused an opaque NullPointerException through auto-unboxing, which the broad catch in categorize() swallowed. The token-to-id mapping now throws IllegalArgumentException naming the missing token, indicating a vocabulary/model mismatch. (cherry picked from commit b6af875)

mawiesne · 2026-06-12T12:45:29Z

🍒 -picked to opennlp-2.x

This was referenced Jun 12, 2026

OPENNLP-1838: Adopt BertTokenizer in opennlp-dl components #1075

Open

OPENNLP-1840: Fix native memory leak and vocabulary NPE in NameFinderDL #1076

Merged

rzo1 requested review from mawiesne and rzo1 and removed request for mawiesne June 12, 2026 09:55

rzo1 approved these changes Jun 12, 2026

View reviewed changes

mawiesne approved these changes Jun 12, 2026

View reviewed changes

mawiesne merged commit b6af875 into apache:main Jun 12, 2026
9 checks passed

mawiesne assigned krickert Jun 12, 2026

mawiesne added the java Pull requests that update Java code label Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1839: Fix native memory leak and vocabulary NPE in DocumentCategorizerDL#1074

OPENNLP-1839: Fix native memory leak and vocabulary NPE in DocumentCategorizerDL#1074
mawiesne merged 1 commit into
apache:mainfrom
ai-pipestream:OPENNLP-1839

krickert commented Jun 12, 2026

Uh oh!

rzo1 commented Jun 12, 2026

Uh oh!

Uh oh!

mawiesne commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krickert commented Jun 12, 2026

What

Why

Validation

Uh oh!

rzo1 commented Jun 12, 2026

Uh oh!

Uh oh!

mawiesne commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mawiesne commented Jun 12, 2026 •

edited

Loading