Skip to content

[SPARK-57183][SS] Close LRUCache on RocksDB.close() in unbounded memory mode#56234

Open
kete1987 wants to merge 1 commit into
apache:masterfrom
kete1987:SPARK-57183-rocksdb-lrucache-leak
Open

[SPARK-57183][SS] Close LRUCache on RocksDB.close() in unbounded memory mode#56234
kete1987 wants to merge 1 commit into
apache:masterfrom
kete1987:SPARK-57183-rocksdb-lrucache-leak

Conversation

@kete1987
Copy link
Copy Markdown

@kete1987 kete1987 commented May 31, 2026

What changes were proposed in this pull request?

In unbounded memory mode (the default, boundedMemoryUsage = false), RocksDBMemoryManager creates a new LRUCache per instance:

but RocksDB.close() never calls lruCache.close():

def close(): Unit = {
// Acquire DB instance lock and release at the end to allow for synchronized access
try {
closeDB()
readOptions.close()
writeOptions.close()
flushOptions.close()
nativeStats.close()
rocksDbOptions.close()
dbLogger.close()

The Java LRUCache wrapper holds a C++ shared_ptr<Cache>, so the native object is only freed when the JVM GC finalizes the wrapper — which rarely happens under low heap pressure. This causes native memory to accumulate until GC eventually runs, leading to OOM kills in long-running processes or CI runs with many RocksDB-heavy test suites.

The fix adds an explicit lruCache.close() call in RocksDB.close() for unbounded mode. In bounded mode the cache is a shared singleton managed by RocksDBMemoryManager and must not be closed per instance.

This is a separate issue from SPARK-56523 (Statistics native memory leak), which was already fixed.

Why are the changes needed?

Without explicit close(), each RocksDB instance in unbounded mode leaks one LRUCache worth of native memory (blockCacheSizeMB, default 8 MB) for as long as GC does not run. The memory is never reclaimed deterministically.

A standalone reproducer tool confirms ~8.5 MB of native memory growth per open/close cycle in leak mode vs flat memory in fixed mode:
https://github.com/kete1987/rocksdb-leak-tool

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a test in RocksDBSuite (SPARK-57183: LRUCache is closed on RocksDB.close() in unbounded memory mode) that verifies the native handle is released after close() via LRUCache.isOwningHandle().

I affirm that the contribution is my original work and that I license the work to the project under the project's open source license.

…ry mode

In unbounded memory mode (the default, boundedMemoryUsage=false),
RocksDBMemoryManager creates a new LRUCache per RocksDB instance but
RocksDB.close() never calls lruCache.close(). The Java LRUCache wrapper
holds a C++ shared_ptr<Cache>, so the native object is only freed when
the JVM GC finalizes the wrapper -- which rarely happens under low heap
pressure. Closing explicitly ensures native memory is reclaimed
deterministically when the instance is released.

In bounded mode the cache is a shared singleton managed by
RocksDBMemoryManager and must not be closed per instance.

Add a test that verifies the native handle is released after close()
in unbounded mode via LRUCache.isOwningHandle().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kete1987 kete1987 force-pushed the SPARK-57183-rocksdb-lrucache-leak branch from d4b935b to 02fd273 Compare May 31, 2026 14:35
@HeartSaVioR
Copy link
Copy Markdown
Contributor

HeartSaVioR commented Jun 1, 2026

I'll ask other folks to review the change (I'm a bit away from recent improvement of RocksDB state store provider), but I'm going to give a general suggestion.

Please do not remove the section of PR template and consider filling the section as one of the requirement/duty.

I affirm that the contribution is my original work and that I license the work to the project under the project's open source license.

This doesn't replace the requirement the PR template asks about the usage of LLM model and the clarification of the model. The template guides how to describe the model - please check it out again.

https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE

### Was this patch authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this patch, please include the
phrase: 'Generated-by: ' followed by the name of the tool and its version.
If no, write 'No'.
Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
-->

@HeartSaVioR
Copy link
Copy Markdown
Contributor

cc. @anishshri-db @micheal-o

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants