Add Apple Metal backend support#103
Conversation
Add --metal flag to enable running GPULlama3 with TornadoVM's Metal backend on macOS. This requires TornadoVM 4.0+ which ships the Metal driver (tornado.drivers.metal). Tested on Apple M1 Pro with TornadoVM 4.0.0-jdk21 Metal SDK.
|
Hey @mikepapadim - sharing my numbers and analysis 😁 I wanted to test it for the new JVM Weekly, here are my results. I will update also my repo with new TornadoVM: https://github.com/ArturSkowronski/conference-jvm-in-age-ai-2026 |
|
Hello @ArturSkowronski , thank you for your contribution! Thats great actually. Can you let me know which models you tested with the metal backend? Also, can you please sign the CLA? |
|
@mikepapadim - Here you will find the whole "benchmark" I use 😊 ArturSkowronski/conference-jvm-in-age-ai-2026#13 Model under test from my side: Llama-3.2-1B-Instruct-f16.gguf |
|
hi @ArturSkowronski, it seems that the I am testing with: ./llama-tornado --gpu --metal --model /opt/models/Llama-3.2-1B-Instruct-F16.gguf --prompt "Tell me a joke"Besides that, please sync with the latest Then I think your changes are very good and I confirm that they work with the Q8 models. So, we can merge the RP. |
Add Apple Metal backend support to the
llama-tornadolauncher, enabling GPULlama3 to run on macOS with TornadoVM's native Metal driver (shipped in TornadoVM 4.0+).Changes (launcher only, no Java code changes):
METALvariant to theBackendenum--metalCLI flag for backend selectiontornado.drivers.metal) and export list (metal-exports)The TornadoVM API is backend-agnostic, so the Java inference code works without modification - only the launcher needed updating.
Motivation
TornadoVM 4.0 shipped a Metal backend (PR #796), but
llama-tornadoonly supported--opencland--ptx. The GPULlama3 README already notes:TornadoVM 4.0 has now added it - this PR enables GPULlama3 to use it 😊
Benchmark Results
Tested on Apple M1 Pro (macOS, ARM64) with Llama-3.2-1B-Instruct-f16.gguf.
LLM Inference (GPULlama3)
VectorAdd (10M elements)
Analysis
VectorAdd: Metal is competitive with OpenCL (~46 GB/s), slightly faster on simple parallel kernels. This matches expectations - the Metal backend handles straightforward array operations well.
LLM inference: Metal is ~28x slower than OpenCL (0.23 vs 6.48 tok/s). This is consistent with the known state of the Metal backend: