Include torch build/ABI in torch C-DLPack addon cache key by tugot17 · Pull Request #644 · apache/tvm-ffi

tugot17 · 2026-06-25T15:54:23Z

Problem

The prebuilt torch C-DLPack addon is cached under a filename derived only from torch major.minor + a coarse device string:

major, minor = torch.__version__.split(".")[:2]
device = _torch_extension_device(torch)          # "cuda" / "rocm" / "cpu"
libname = f"libtorch_c_dlpack_addon_torch{major}{minor}-{device}{suffix}"
lib_path = cache_dir / libname                   # cache_dir defaults to ~/.cache/tvm-ffi
if not lib_path.exists():
    ...build...                                  # otherwise reuse whatever is there

This key omits:

the torch patch version (2.9.0 vs 2.9.1),
the build local-version tag carried in torch.__version__ (+cu121 vs +cu124, +cpu, …),
the C++ ABI flag (torch._C._GLIBCXX_USE_CXX11_ABI).

Since the addon is a compiled extension linking libtorch's C++ ABI, two torch installs that share major.minor + device but differ in patch / CUDA toolkit / ABI resolve to the same cached .so. The addon built against the first torch is then silently reused by the second — an ABI mismatch in the DLPack bridge that surfaces as crashes, memory faults, or silently wrong tensor data, not a clean error.

This is easy to hit whenever ~/.cache/tvm-ffi is shared across environments — a shared/NFS home, or container images that mount the host home and see the same cache under different torch builds.

Reproduce

# env A
pip install torch==2.9.0+cu121 --index-url https://download.pytorch.org/whl/cu121
python -c "import tvm_ffi"     # builds ~/.cache/tvm-ffi/libtorch_c_dlpack_addon_torch29-cuda.so

# env B: same major.minor, different build/ABI, same cache dir
pip install torch==2.9.1+cu124 --index-url https://download.pytorch.org/whl/cu124
python -c "import tvm_ffi"     # REUSES the cu121-built torch29-cuda.so -> ABI mismatch

The same applies to any two builds that share torch{major}{minor}-{device}, including CPU and ROCm builds.

Fix

Fold the full torch build identity (torch.__version__, which already carries patch + +cuXXX/+rocmX.Y/+cpu) and the C++ ABI flag into the cached addon name via a short hash. Incompatible builds now get distinct cache entries, while same-build reuse is unchanged. The build subprocess receives libname from this call site, so it stays consistent automatically.

gemini-code-assist

Code Review

This pull request improves the caching mechanism for the JIT-compiled torch extension by incorporating a hash of the PyTorch version and C++11 ABI flag into the cached library filename, preventing cache collisions between ABI-incompatible builds. The reviewer suggested a clean-up to use the public torch.compiled_with_cxx11_abi() function instead of accessing the private torch._C module directly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-25T15:55:59Z

+        # major.minor + device resolve to the same cached ``.so``, and a shared cache
+        # directory (NFS home, reused container images) silently loads a mismatched
+        # addon -> crashes or wrong tensor data instead of a clean rebuild.
+        abi_id = f"{torch.__version__}|cxx11abi={int(getattr(torch._C, '_GLIBCXX_USE_CXX11_ABI', True))}"


Using the public torch.compiled_with_cxx11_abi() function is preferred over accessing the private torch._C module directly. This is also consistent with how the C++11 ABI check is performed in _build_optional_torch_c_dlpack.py.

Suggested change

abi_id = f"{torch.__version__}|cxx11abi={int(getattr(torch._C, '_GLIBCXX_USE_CXX11_ABI', True))}"

abi_id = f"{torch.__version__}|cxx11abi={int(torch.compiled_with_cxx11_abi())}"

The prebuilt torch C-DLPack addon is cached under a filename derived only from torch major.minor + device (cuda/rocm/cpu). This omits the torch patch version, the build local-version tag in torch.__version__ (+cuXXX / +rocmX.Y / +cpu), and the C++ ABI flag. Two ABI-incompatible torch builds that share major.minor + device therefore resolve to the same cached .so, so a shared cache directory (NFS home, reused container images) silently loads a mismatched addon -> crashes or wrong tensor data instead of a clean rebuild. Fold the full torch build identity and C++ ABI flag into the cached addon name so incompatible builds get distinct entries while same-build reuse still works. Signed-off-by: Piotr Mazurek <27293258+tugot17@users.noreply.github.com>

tugot17 · 2026-06-25T16:14:11Z

Thanks! Switched to the public torch.compiled_with_cxx11_abi() instead of touching torch._C directly — same value, cleaner. Updated.

tqchen · 2026-06-26T12:10:40Z

thanks, is this mainly for custom torch build? since official torch moved onto cxx11 abi already

tqchen · 2026-06-26T12:12:57Z

cc @cyx-6 please take a look, we might want to consider backward compact wrt to previous behavior, although not as urgent since pytorch already have builtin ones

tugot17 · 2026-06-26T12:37:12Z

Yeah, fair on cxx11 — for official wheels it's basically always true now, so that part's really just a guard for custom builds. Happy to drop it and key purely on torch.__version__ if you'd rather keep this minimal.

The thing that actually bit us wasn't the abi flag though, it was the version getting dropped from the name. The key is only torch{major}{minor}-{device}, so everything past major.minor disappears — the patch version and the +cu124 / +rocm7.2 build tag. We run sglang in containers on ROCm with the host ~/.cache/tvm-ffi mounted in, and when we moved between two images (different rocm/torch builds, both torch 2.9) they both resolved to the same libtorch_c_dlpack_addon_torch29-rocm.so. So the addon compiled against the old image got loaded into the new container, the ABI didn't match, and instead of erroring it just produced garbage tokens (plus the odd HSA fault) at decode. Took us way too long to trace it back to the cache. Sticking torch.__version__ in the name makes the stale entry miss and recompile, which fixes it.

On backward compat: the only effect is that one rebuild of the addon the first time after an upgrade, which is kind of the point here since the whole problem is a stale build being silently reused. Nothing breaks at runtime, the old files just sit there orphaned. Happy to add a small cleanup of the old …torch{maj}{min}-{device}.so name too if you want to avoid the clutter.

tqchen · 2026-06-26T13:29:28Z

Get it, this sounds good. For local jit being able to narrow to device version helps.

Main thing was the aot wheel for cuda also depends on the particular case, and in case of cuda it is generally api compatible(we build for cuda 13 and it will work generally), so we need to cross check and make sure that works. Good news is newer version of torch already ships with the default C dlpack extension, so we don't need to rely on jit anymore for those cases

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

tugot17 force-pushed the fix/dlpack-addon-abi-cache-key branch from 82e7f04 to a47d9a9 Compare June 25, 2026 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Include torch build/ABI in torch C-DLPack addon cache key#644

Include torch build/ABI in torch C-DLPack addon cache key#644
tugot17 wants to merge 1 commit into
apache:mainfrom
tugot17:fix/dlpack-addon-abi-cache-key

tugot17 commented Jun 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

tugot17 commented Jun 25, 2026

Uh oh!

tqchen commented Jun 26, 2026 •

edited

Loading

Uh oh!

tqchen commented Jun 26, 2026

Uh oh!

tugot17 commented Jun 26, 2026

Uh oh!

tqchen commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	abi_id = f"{torch.__version__}\|cxx11abi={int(getattr(torch._C, '_GLIBCXX_USE_CXX11_ABI', True))}"
	abi_id = f"{torch.__version__}\|cxx11abi={int(torch.compiled_with_cxx11_abi())}"

Uh oh!

Conversation

tugot17 commented Jun 25, 2026

Problem

Reproduce

Fix

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

tugot17 commented Jun 25, 2026

Uh oh!

tqchen commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen commented Jun 26, 2026

Uh oh!

tugot17 commented Jun 26, 2026

Uh oh!

tqchen commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tqchen commented Jun 26, 2026 •

edited

Loading

tqchen commented Jun 26, 2026 •

edited

Loading