Vulkan: elided copy ops (`lift_fresh_copy`/`clone`) corrupt sibling outputs (memory-planner aliasing)

### 🐛 Describe the bug

In a multi-output graph delegated to the **Vulkan backend**, a value-preserving copy op
(`aten.lift_fresh_copy` or `aten.clone`) is elided into a buffer **alias**, but the delegate's
memory planner still models it as an independent copy. The planner therefore treats the source
tensor as dead after the copy and **reuses its buffer**, while the source is in fact still live
(as the alias). The result: an **unrelated sibling output is silently overwritten with the wrong
data** — on our repro, with an input buffer.

The individual kernels are all correct; the bug is purely in the delegate's copy-elision /
memory-planning interaction. Replacing the copy op with a real compute op that materializes a
fresh buffer makes the divergence disappear.

## Environment

- ExecuTorch: built from source (`backends/vulkan`), `1.4.0a0`
- Device: ASUS ROG Phone 6, Snapdragon 8+ Gen 1, Adreno 730 (Vulkan)
- Lowering: `to_edge_transform_and_lower(..., partitioner=[VulkanPartitioner()])`
- Control backend: portable CPU and eager — both correct.

## Reproducer

```python
import torch
from executorch.exir import to_edge_transform_and_lower, EdgeCompileConfig
from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner


# out[0] is a normal computation. out[1] is a NaN branch fed through a COPY of n0.
# The two outputs are otherwise independent.
class Bug(torch.nn.Module):
    def forward(self, x):
        n0 = torch.ops.aten.sigmoid.default(x)
        n1 = torch.ops.aten.lift_fresh_copy.default(n0)          # copy -> elided to an alias of n0
        out0 = torch.ops.aten.prod.default(
                   torch.ops.aten.acos.default(n0).to(torch.float16))
        out1 = torch.ops.aten.acosh.default(n1).to(torch.float16)  # acosh of (0,1) -> NaN
        return out0, out1


# Same graph, but the copy is a real op that the delegate cannot elide (fresh buffer):
class Control(torch.nn.Module):
    def forward(self, x):
        n0 = torch.ops.aten.sigmoid.default(x)
        n1 = torch.ops.aten.add.Tensor(n0, 0.0)                  # forces a real fresh buffer
        out0 = torch.ops.aten.prod.default(
                   torch.ops.aten.acos.default(n0).to(torch.float16))
        out1 = torch.ops.aten.acosh.default(n1).to(torch.float16)
        return out0, out1


x = torch.tensor([-0.4824913740158081, 1.1103534698486328])

for name, M in [("Bug(lift_fresh_copy)", Bug), ("Control(add 0.0)", Control)]:
    eager = M()(x)
    ep = torch.export.export(M(), (x,))
    prog = to_edge_transform_and_lower(
        ep, partitioner=[VulkanPartitioner()],
        compile_config=EdgeCompileConfig(_check_ir_validity=False),
    ).to_executorch()
    open(f"{name}.pte", "wb").write(prog.buffer)
    print(name, "eager out[0] =", float(eager[0]))   # both 0.848
    # then run the .pte on a Vulkan device and read out[0]:
```

## Expected vs actual

`out[0] = prod(acos(sigmoid(x)))` does not depend on `out[1]`. Eager and portable CPU both give
`0.848`.

| variant | how `n1` is made | out[0] on Vulkan |
|---|---|---|
| **Bug** | `lift_fresh_copy(n0)` | **`2.467`  (WRONG)** |
| **Bug** | `clone(n0)` (substitute for `lift_fresh_copy`) | **`2.467`  (WRONG)** |
| **Control** | `add(n0, 0.0)` | `0.848`  (correct) |

Eager / portable CPU: `out[0] = 0.848` in all variants.

The wrong value `2.467` arises because `out[0]`'s buffer is reused for another tensor; in the original
fuzzer graph the corrupted value is exactly `x[1] = 1.1103534698486328` (the input), i.e. `out[0]`'s
slot was reused for the input buffer.

## Root cause

Two inconsistent views of the same op inside the Vulkan delegate:

- **Lowering** elides `lift_fresh_copy` / `clone` into a buffer **alias** — `n1` *is* `n0`'s buffer.
- **Memory planning** still models them as real **copies** — it assumes the value was copied into an
  independent `n1`, so `n0` is dead after the copy and its buffer may be reused / does not overlap a
  later tensor.

So the planner frees the source early and re-packs its slot, but the elided copy means the source is
still live (consumed later via `n1`), and a sibling output gets overwritten.

This is confirmed by the control: forcing a genuine fresh buffer (`add(n0, 0.0)`, which is not elided)
removes the divergence, while both elided-copy ops (`lift_fresh_copy`, `clone`) reproduce it.

Likely also affects `aten.alias_copy` / `aten.detach_copy` / `aten.view_copy` (same elision class) —
these show up in our fuzzing as both silent mismatches and native-abort crashes.

## Suggested fix

- **Memory planner:** account for the aliasing introduced by elided copy ops (extend the source's
  live range through every consumer of the alias) so its slot is never reused while the alias is live; **or**
- **Lowering (workaround):** do not elide `lift_fresh_copy` / `clone` — materialize a real buffer.

## Notes

It is found by differential fuzzing (eager vs ExecuTorch Vulkan on-device). The portable CPU backend and
eager agree on every case above; only the Vulkan delegate diverges, and only when the copy is elided. 
We are a research team from the University of Auckland, we recently build a tool to conduct edge machine learning framework testing. We are happy to contribute more bug reports found by our tool. 


### Versions

Collecting environment information...
PyTorch version: 2.12.0+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64)
GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-28)
Clang version: 20.1.8 ( 20.1.8-2.module+el8.10.0+23372+3f2ea6fa)
CMake version: version 3.26.5
Libc version: glibc-2.28

Python version: 3.12.13 (main, Apr 16 2026, 22:51:04) [GCC 8.5.0 20210514 (Red Hat 8.5.0-28)] (64-bit runtime)
Python platform: Linux-4.18.0-553.120.1.el8_10.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: 
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 595.71.05
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
Stepping:            0
CPU MHz:             2250.000
CPU max MHz:         2250.0000
CPU min MHz:         1500.0000
BogoMIPS:            4499.98
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-63,128-191
NUMA node1 CPU(s):   64-127,192-255
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tscrep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

Versions of relevant libraries:
[pip3] executorch==1.3.1
[pip3] numpy==2.4.6
[pip3] nvidia-cublas==13.1.1.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.20.0.48
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.1
[pip3] nvidia-nccl-cu13==2.29.7
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvtx==13.0.85
[pip3] pytorch_tokenizers==1.3.0
[pip3] torch==2.12.0
[pip3] torchao==0.17.0
[pip3] triton==3.7.0
[conda] Could not collect

cc @SS-JIA @manuelcandales @digantdesai @cbilgin

variant	how `n1` is made	out[0] on Vulkan
Bug	`lift_fresh_copy(n0)`	`2.467` (WRONG)
Bug	`clone(n0)` (substitute for `lift_fresh_copy`)	`2.467` (WRONG)
Control	`add(n0, 0.0)`	`0.848` (correct)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: elided copy ops (`lift_fresh_copy`/`clone`) corrupt sibling outputs (memory-planner aliasing) #20257

🐛 Describe the bug

Environment

Reproducer

Expected vs actual

Root cause

Suggested fix

Notes

Versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Vulkan: elided copy ops (lift_fresh_copy/clone) corrupt sibling outputs (memory-planner aliasing) #20257

Description

🐛 Describe the bug

Environment

Reproducer

Expected vs actual

Root cause

Suggested fix

Notes

Versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Vulkan: elided copy ops (`lift_fresh_copy`/`clone`) corrupt sibling outputs (memory-planner aliasing) #20257