Include torch build/ABI in torch C-DLPack addon cache key#644
Conversation
There was a problem hiding this comment.
Code Review
This pull request improves the caching mechanism for the JIT-compiled torch extension by incorporating a hash of the PyTorch version and C++11 ABI flag into the cached library filename, preventing cache collisions between ABI-incompatible builds. The reviewer suggested a clean-up to use the public torch.compiled_with_cxx11_abi() function instead of accessing the private torch._C module directly.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # major.minor + device resolve to the same cached ``.so``, and a shared cache | ||
| # directory (NFS home, reused container images) silently loads a mismatched | ||
| # addon -> crashes or wrong tensor data instead of a clean rebuild. | ||
| abi_id = f"{torch.__version__}|cxx11abi={int(getattr(torch._C, '_GLIBCXX_USE_CXX11_ABI', True))}" |
There was a problem hiding this comment.
Using the public torch.compiled_with_cxx11_abi() function is preferred over accessing the private torch._C module directly. This is also consistent with how the C++11 ABI check is performed in _build_optional_torch_c_dlpack.py.
| abi_id = f"{torch.__version__}|cxx11abi={int(getattr(torch._C, '_GLIBCXX_USE_CXX11_ABI', True))}" | |
| abi_id = f"{torch.__version__}|cxx11abi={int(torch.compiled_with_cxx11_abi())}" |
The prebuilt torch C-DLPack addon is cached under a filename derived only from torch major.minor + device (cuda/rocm/cpu). This omits the torch patch version, the build local-version tag in torch.__version__ (+cuXXX / +rocmX.Y / +cpu), and the C++ ABI flag. Two ABI-incompatible torch builds that share major.minor + device therefore resolve to the same cached .so, so a shared cache directory (NFS home, reused container images) silently loads a mismatched addon -> crashes or wrong tensor data instead of a clean rebuild. Fold the full torch build identity and C++ ABI flag into the cached addon name so incompatible builds get distinct entries while same-build reuse still works. Signed-off-by: Piotr Mazurek <27293258+tugot17@users.noreply.github.com>
|
Thanks! Switched to the public |
82e7f04 to
a47d9a9
Compare
|
thanks, is this mainly for custom torch build? since official torch moved onto cxx11 abi already |
|
cc @cyx-6 please take a look, we might want to consider backward compact wrt to previous behavior, although not as urgent since pytorch already have builtin ones |
|
Yeah, fair on cxx11 — for official wheels it's basically always true now, so that part's really just a guard for custom builds. Happy to drop it and key purely on The thing that actually bit us wasn't the abi flag though, it was the version getting dropped from the name. The key is only On backward compat: the only effect is that one rebuild of the addon the first time after an upgrade, which is kind of the point here since the whole problem is a stale build being silently reused. Nothing breaks at runtime, the old files just sit there orphaned. Happy to add a small cleanup of the old |
|
Get it, this sounds good. For local jit being able to narrow to device version helps. Main thing was the aot wheel for cuda also depends on the particular case, and in case of cuda it is generally api compatible(we build for cuda 13 and it will work generally), so we need to cross check and make sure that works. Good news is newer version of torch already ships with the default C dlpack extension, so we don't need to rely on jit anymore for those cases |
Problem
The prebuilt torch C-DLPack addon is cached under a filename derived only from torch major.minor + a coarse device string:
This key omits:
2.9.0vs2.9.1),torch.__version__(+cu121vs+cu124,+cpu, …),torch._C._GLIBCXX_USE_CXX11_ABI).Since the addon is a compiled extension linking libtorch's C++ ABI, two torch installs that share
major.minor+ device but differ in patch / CUDA toolkit / ABI resolve to the same cached.so. The addon built against the first torch is then silently reused by the second — an ABI mismatch in the DLPack bridge that surfaces as crashes, memory faults, or silently wrong tensor data, not a clean error.This is easy to hit whenever
~/.cache/tvm-ffiis shared across environments — a shared/NFS home, or container images that mount the host home and see the same cache under different torch builds.Reproduce
The same applies to any two builds that share
torch{major}{minor}-{device}, including CPU and ROCm builds.Fix
Fold the full torch build identity (
torch.__version__, which already carries patch ++cuXXX/+rocmX.Y/+cpu) and the C++ ABI flag into the cached addon name via a short hash. Incompatible builds now get distinct cache entries, while same-build reuse is unchanged. The build subprocess receiveslibnamefrom this call site, so it stays consistent automatically.