Skip to content

Include torch build/ABI in torch C-DLPack addon cache key#644

Open
tugot17 wants to merge 1 commit into
apache:mainfrom
tugot17:fix/dlpack-addon-abi-cache-key
Open

Include torch build/ABI in torch C-DLPack addon cache key#644
tugot17 wants to merge 1 commit into
apache:mainfrom
tugot17:fix/dlpack-addon-abi-cache-key

Conversation

@tugot17

@tugot17 tugot17 commented Jun 25, 2026

Copy link
Copy Markdown

Problem

The prebuilt torch C-DLPack addon is cached under a filename derived only from torch major.minor + a coarse device string:

major, minor = torch.__version__.split(".")[:2]
device = _torch_extension_device(torch)          # "cuda" / "rocm" / "cpu"
libname = f"libtorch_c_dlpack_addon_torch{major}{minor}-{device}{suffix}"
lib_path = cache_dir / libname                   # cache_dir defaults to ~/.cache/tvm-ffi
if not lib_path.exists():
    ...build...                                  # otherwise reuse whatever is there

This key omits:

  • the torch patch version (2.9.0 vs 2.9.1),
  • the build local-version tag carried in torch.__version__ (+cu121 vs +cu124, +cpu, …),
  • the C++ ABI flag (torch._C._GLIBCXX_USE_CXX11_ABI).

Since the addon is a compiled extension linking libtorch's C++ ABI, two torch installs that share major.minor + device but differ in patch / CUDA toolkit / ABI resolve to the same cached .so. The addon built against the first torch is then silently reused by the second — an ABI mismatch in the DLPack bridge that surfaces as crashes, memory faults, or silently wrong tensor data, not a clean error.

This is easy to hit whenever ~/.cache/tvm-ffi is shared across environments — a shared/NFS home, or container images that mount the host home and see the same cache under different torch builds.

Reproduce

# env A
pip install torch==2.9.0+cu121 --index-url https://download.pytorch.org/whl/cu121
python -c "import tvm_ffi"     # builds ~/.cache/tvm-ffi/libtorch_c_dlpack_addon_torch29-cuda.so

# env B: same major.minor, different build/ABI, same cache dir
pip install torch==2.9.1+cu124 --index-url https://download.pytorch.org/whl/cu124
python -c "import tvm_ffi"     # REUSES the cu121-built torch29-cuda.so -> ABI mismatch

The same applies to any two builds that share torch{major}{minor}-{device}, including CPU and ROCm builds.

Fix

Fold the full torch build identity (torch.__version__, which already carries patch + +cuXXX/+rocmX.Y/+cpu) and the C++ ABI flag into the cached addon name via a short hash. Incompatible builds now get distinct cache entries, while same-build reuse is unchanged. The build subprocess receives libname from this call site, so it stays consistent automatically.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the caching mechanism for the JIT-compiled torch extension by incorporating a hash of the PyTorch version and C++11 ABI flag into the cached library filename, preventing cache collisions between ABI-incompatible builds. The reviewer suggested a clean-up to use the public torch.compiled_with_cxx11_abi() function instead of accessing the private torch._C module directly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

# major.minor + device resolve to the same cached ``.so``, and a shared cache
# directory (NFS home, reused container images) silently loads a mismatched
# addon -> crashes or wrong tensor data instead of a clean rebuild.
abi_id = f"{torch.__version__}|cxx11abi={int(getattr(torch._C, '_GLIBCXX_USE_CXX11_ABI', True))}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the public torch.compiled_with_cxx11_abi() function is preferred over accessing the private torch._C module directly. This is also consistent with how the C++11 ABI check is performed in _build_optional_torch_c_dlpack.py.

Suggested change
abi_id = f"{torch.__version__}|cxx11abi={int(getattr(torch._C, '_GLIBCXX_USE_CXX11_ABI', True))}"
abi_id = f"{torch.__version__}|cxx11abi={int(torch.compiled_with_cxx11_abi())}"

The prebuilt torch C-DLPack addon is cached under a filename derived only
from torch major.minor + device (cuda/rocm/cpu). This omits the torch patch
version, the build local-version tag in torch.__version__ (+cuXXX / +rocmX.Y
/ +cpu), and the C++ ABI flag. Two ABI-incompatible torch builds that share
major.minor + device therefore resolve to the same cached .so, so a shared
cache directory (NFS home, reused container images) silently loads a
mismatched addon -> crashes or wrong tensor data instead of a clean rebuild.

Fold the full torch build identity and C++ ABI flag into the cached addon
name so incompatible builds get distinct entries while same-build reuse
still works.

Signed-off-by: Piotr Mazurek <27293258+tugot17@users.noreply.github.com>
@tugot17

tugot17 commented Jun 25, 2026

Copy link
Copy Markdown
Author

Thanks! Switched to the public torch.compiled_with_cxx11_abi() instead of touching torch._C directly — same value, cleaner. Updated.

@tugot17 tugot17 force-pushed the fix/dlpack-addon-abi-cache-key branch from 82e7f04 to a47d9a9 Compare June 25, 2026 16:31
@tqchen

tqchen commented Jun 26, 2026

Copy link
Copy Markdown
Member

thanks, is this mainly for custom torch build? since official torch moved onto cxx11 abi already

@tqchen

tqchen commented Jun 26, 2026

Copy link
Copy Markdown
Member

cc @cyx-6 please take a look, we might want to consider backward compact wrt to previous behavior, although not as urgent since pytorch already have builtin ones

@tugot17

tugot17 commented Jun 26, 2026

Copy link
Copy Markdown
Author

Yeah, fair on cxx11 — for official wheels it's basically always true now, so that part's really just a guard for custom builds. Happy to drop it and key purely on torch.__version__ if you'd rather keep this minimal.

The thing that actually bit us wasn't the abi flag though, it was the version getting dropped from the name. The key is only torch{major}{minor}-{device}, so everything past major.minor disappears — the patch version and the +cu124 / +rocm7.2 build tag. We run sglang in containers on ROCm with the host ~/.cache/tvm-ffi mounted in, and when we moved between two images (different rocm/torch builds, both torch 2.9) they both resolved to the same libtorch_c_dlpack_addon_torch29-rocm.so. So the addon compiled against the old image got loaded into the new container, the ABI didn't match, and instead of erroring it just produced garbage tokens (plus the odd HSA fault) at decode. Took us way too long to trace it back to the cache. Sticking torch.__version__ in the name makes the stale entry miss and recompile, which fixes it.

On backward compat: the only effect is that one rebuild of the addon the first time after an upgrade, which is kind of the point here since the whole problem is a stale build being silently reused. Nothing breaks at runtime, the old files just sit there orphaned. Happy to add a small cleanup of the old …torch{maj}{min}-{device}.so name too if you want to avoid the clutter.

@tqchen

tqchen commented Jun 26, 2026

Copy link
Copy Markdown
Member

Get it, this sounds good. For local jit being able to narrow to device version helps.

Main thing was the aot wheel for cuda also depends on the particular case, and in case of cuda it is generally api compatible(we build for cuda 13 and it will work generally), so we need to cross check and make sure that works. Good news is newer version of torch already ships with the default C dlpack extension, so we don't need to rely on jit anymore for those cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants