Skip to content

OPENNLP-1839: Fix native memory leak and vocabulary NPE in DocumentCategorizerDL#1074

Merged
mawiesne merged 1 commit into
apache:mainfrom
ai-pipestream:OPENNLP-1839
Jun 12, 2026
Merged

OPENNLP-1839: Fix native memory leak and vocabulary NPE in DocumentCategorizerDL#1074
mawiesne merged 1 commit into
apache:mainfrom
ai-pipestream:OPENNLP-1839

Conversation

@krickert

Copy link
Copy Markdown
Contributor

What

  • categorize() leaked native memory on every call: the OnnxTensor inputs and the OrtSession.Result were never closed. Tensors are now released in a finally block and the result via try-with-resources (getValue() copies into Java arrays first, so this is safe).
  • A token missing from the vocabulary caused vocab.get(...) to auto-unbox null into an int, throwing an opaque NullPointerException that the broad catch in categorize() swallowed into an empty score array. The mapping loop is now a testable tokenIds() helper that throws IllegalArgumentException naming the missing token, which indicates the vocabulary file does not match the model.

Why

See OPENNLP-1839. Long-running services calling categorize() repeatedly accumulate off-heap allocations until the process is killed. This applies the same resource-management pattern as the SentenceVectorsDL fix (OPENNLP-1836, #1072).

Validation

New DocumentCategorizerDLTest covers the token-id mapping and the vocabulary-miss error. All existing opennlp-dl tests pass.

…ategorizerDL

Every categorize() call leaked the OnnxTensor inputs and the
OrtSession.Result for each document chunk. Tensors are now closed in a
finally block and the result with try-with-resources.

Tokens absent from the vocabulary caused an opaque NullPointerException
through auto-unboxing, which the broad catch in categorize() swallowed.
The token-to-id mapping now throws IllegalArgumentException naming the
missing token, indicating a vocabulary/model mismatch.
@rzo1

rzo1 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Might need to be back-ported to 2.x as well

@mawiesne mawiesne merged commit b6af875 into apache:main Jun 12, 2026
9 checks passed
@mawiesne mawiesne added the java Pull requests that update Java code label Jun 12, 2026
mawiesne pushed a commit that referenced this pull request Jun 12, 2026
…ategorizerDL (#1074)

Every categorize() call leaked the OnnxTensor inputs and the
OrtSession.Result for each document chunk. Tensors are now closed in a
finally block and the result with try-with-resources.

Tokens absent from the vocabulary caused an opaque NullPointerException
through auto-unboxing, which the broad catch in categorize() swallowed.
The token-to-id mapping now throws IllegalArgumentException naming the
missing token, indicating a vocabulary/model mismatch.

(cherry picked from commit b6af875)
@mawiesne

mawiesne commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

🍒 -picked to opennlp-2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants