Refactor/tornadovm planning by orionpapadakis · Pull Request #117 · beehive-lab/GPULlama3.java

orionpapadakis · 2026-05-28T12:59:06Z

This PR reorganizes TornadoVM execution planning around three variant axes:

model family
quantization
forward execution mode

The previous structure was mainly shaped around two axes: model family and quantization. With prefill-decode and batch-prefill-decode, execution mode becomes a third axis, which greatly increases the number of
combinations each model/quantization pair may need to support.

This refactor introduces forward plans, task-graph layouts, and model/quantization component providers so single-token, prefill-decode, and batch-prefill-decode paths can share one cleaner planning structure
instead of growing separate master-plan dispatch logic.

More specifically, the GPU inference path is now organized around four collaborating abstractions:

Layouts (*ForwardTaskGraphLayout) encode the index arithmetic for a given graph topology — for example, which integer index corresponds to the activation graph, the N layer graphs, or the logits graph. They
eliminate magic numbers and make index-dependent code self-documenting.
Components (*ForwardPlanComponents) are model-family + quantization-specific factories. Each implementation constructs the concrete TornadoVM TaskGraph objects for its model (e.g., LlamaFP16PlanComponents produces LlamaFP16FFNLayers, LogitsFP16Layer, etc.). The three component interfaces form a capability hierarchy — SingleTokenForwardPlanComponents → PrefillDecodeForwardPlanComponents → BatchPrefillDecodeForwardPlanComponents — so that Llama, which supports all three execution modes, implements one object that satisfies all three contracts.
ForwardPlans (Single/PrefillDecode/BatchPrefillDecodeForwardPlan) assemble components into an ordered ImmutableTaskGraph list and a GridScheduler. Each plan encodes the graph topology for one execution mode: N+2 graphs for single-token, N+2 for prefill-decode, 2N+3 for batch-prefill/decode. ForwardPlanFactory selects the right combination of components and plan based on quantization type, model family, and execution mode.
MasterPlans (TornadoVMMasterPlan*) own the TornadoVM execution lifecycle: they create the TornadoExecutionPlan from the ForwardPlan's graph list, handle warmup and CUDA-graph configuration, and expose the forward-pass entry points (tornadoVMForwardDecode, tornadoVMForwardPrefill, TornadoVMForwardBatchPrefill) used by the inference core. They are model-agnostic — all model-specific knowledge lives in the components layer below them.

Notes

Adds Llama Q8_0 prefill-decode support which also exhibits the necessity of this PR.
Renames task-graph abstractions for clearer roles.
Moves scheduling helpers into a dedicated TornadoVM scheduling package.
Keeps graph topology and execution behavior unchanged outside the new prefill-decode path.

Verification

use java 21 or 25
setup tornadovm
mvn clean install
llama fp16 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048
llama fp16 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode
llama fp16 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32
llama fp16 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphs
llama q8_0 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048
llama q8_0 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode
llama q8_0 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32
llama q8_0 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphs

any other model (mistral, qwen3 etc) should also pass with single-token config BUT should fail for any prefill-decode config with the following message:

WARNING: Using incubator modules: jdk.incubator.vector
Exception in thread "main" java.lang.UnsupportedOperationException: BATCH_PREFILL_DECODE not yet supported for QWEN_3 + F16
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createQwen3FP16Plan(ForwardPlanFactory.java:174)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createFP16Plan(ForwardPlanFactory.java:90)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.create(ForwardPlanFactory.java:74)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createBatchPrefillDecode(ForwardPlanFactory.java:65)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlanBatchPrefillDecode.createExecutionPlan(TornadoVMMasterPlanBatchPrefillDecode.java:70)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlanBatchPrefillDecode.<init>(TornadoVMMasterPlanBatchPrefillDecode.java:51)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlan.initializeTornadoVMPlan(TornadoVMMasterPlan.java:59)
  at org.beehive.gpullama3.model.Model.runInstructOnce(Model.java:205)
  at org.beehive.gpullama3.LlamaApp.runSingleInstruction(LlamaApp.java:18)
  at org.beehive.gpullama3.LlamaApp.main(LlamaApp.java:44)
Error: Command failed with return code 1

Reorganize TornadoVM execution planning around forward modes, model families, and quantization-specific components.

…TornadoVM components

…d `AbstractLogitsLayer` to `AbstractLogitsTaskGraph`, updating all references to improve clarity and align with naming conventions.

…ing with updated naming conventions.

…ill-decode and CUDA-graph variants

mikepapadim · 2026-05-30T10:42:52Z

+                MemorySegment tokenEmbeddings = weights.getTokenEmbeddingTable().asByteArray().getSegment();
+                int blocksPerToken = (configuration.dim() + 31) / 32;
+                long bytesPerToken = (long) blocksPerToken * 34;
+                MemorySegment.copy(tokenEmbeddings, (long) token * bytesPerToken,
+                        state.embeddingX.getSegment(), 0, bytesPerToken);
+            }


maybe this should be a method on each own. Same for the above

i agree, but imho this should be part of another pr where embeddings copy will be refactored as a distinct component that will cleanly facilitate dispatch across quantizations and plan types (single-token, prefill-decode, batch-prefill-decode) in a well-structured manner

mikepapadim · 2026-05-30T10:43:51Z

    }
+
+    // ── Q8_0 Batch Kernels ───────────────────────────────────────────────────
+


format is odd. use @Formatter: on / off of the block and pass the autoformatter

mikepapadim · 2026-05-30T10:45:51Z

+    }
+
+    @Override
+    protected String predecessorGraphName(int layerIndex) {


again formatter - use annotations eitherwise in the first autoformatitng pass it will be got flat.

mikepapadim · 2026-05-30T10:48:33Z

+    }
+
+    @Override public ActivationTaskGraph standardActivation() {
+        return new Activation("activationUpdate", state, weights, config);


maybe 'actiovationUpdate' and 'logits' strings should be in an enum or record that reuse that instead of have these Strings all over the place.

mikepapadim

LGTM, some minor changes needed.

…e-token plan

…for consistency

# Conflicts: # src/main/java/org/beehive/gpullama3/inference/state/State.java # src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanBatchPrefillDecode.java

…tate` class, removing redundancies in model-specific implementations

…pulations for batch-prefill and single-token state initialization.

…gle-token, prefill-decode, and batch-prefill inference plans

orionpapadakis added 6 commits May 28, 2026 15:36

[prf/dec]Implement prefill-decode for Llama Q8_0

45204f1

Reorganize TornadoVM execution planning and improve naming conventions

8ebf91f

Reorganize TornadoVM execution planning around forward modes, model families, and quantization-specific components.

Update naming from ActivationGraph to ActivationTaskGraph across …

4e4478a

…TornadoVM components

Rename AbstractFFNLayers to AbstractTransformerLayerTaskGraphs an…

ea478f8

…d `AbstractLogitsLayer` to `AbstractLogitsTaskGraph`, updating all references to improve clarity and align with naming conventions.

Refactor FFN layer comments to transformer-layer task graphs, align…

e20ebc5

…ing with updated naming conventions.

[ci] Add workflows for Llama-3.2-1B-Instruct Q8_0 inference with pref…

c7522d1

…ill-decode and CUDA-graph variants

orionpapadakis requested review from mairooni, mikepapadim and stratika May 28, 2026 12:59

orionpapadakis added enhancement New feature or request refactoring prefill-decode labels May 28, 2026

mikepapadim reviewed May 30, 2026

View reviewed changes

Comment thread src/main/java/org/beehive/gpullama3/tornadovm/plan/components/fp16/LlamaFP16PlanComponents.java Outdated

mikepapadim reviewed May 30, 2026

View reviewed changes

orionpapadakis added 10 commits June 5, 2026 16:14

[prf/dec] Move embedding copy to InferenceCore, in alignment to singl…

d830429

…e-token plan

[prf/dec] Update TornadoVM method naming to tornadoVMForwardDecode …

26afbd0

…for consistency

[prf/dec] Drop redundant EmbeddingPreparer

4ae2e8c

[prf/dec] Make batch-state reset model-agnostic

34cee3a

[prf/dec] Make batch-state reset model-agnostic

61c08ae

# Conflicts: # src/main/java/org/beehive/gpullama3/inference/state/State.java # src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanBatchPrefillDecode.java

[prf/dec] Consolidate batch-prefill state management into the base `S…

ce57287

…tate` class, removing redundancies in model-specific implementations

[prf/dec] Replace model-specific reset methods with direct field mani…

bd98e6e

…pulations for batch-prefill and single-token state initialization.

[prf/dec] Remove redundant buffer reset methods from State class

e215ceb

[prf/dec] Update TornadoVM method and interface naming to reflect sin…

b90fc9c

…gle-token, prefill-decode, and batch-prefill inference plans

[prf/dec] Update comments

7f1864b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/tornadovm planning#117

Refactor/tornadovm planning#117
orionpapadakis wants to merge 16 commits into
mainfrom
refactor/tornadovm-planning

orionpapadakis commented May 28, 2026 •

edited

Loading

Uh oh!

mikepapadim May 30, 2026

Uh oh!

orionpapadakis Jun 5, 2026

Uh oh!

mikepapadim May 30, 2026

Uh oh!

mikepapadim May 30, 2026

Uh oh!

Uh oh!

mikepapadim May 30, 2026

Uh oh!

mikepapadim left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		}

		// ── Q8_0 Batch Kernels ───────────────────────────────────────────────────

Conversation

orionpapadakis commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Verification

Uh oh!

mikepapadim May 30, 2026

Choose a reason for hiding this comment

Uh oh!

orionpapadakis Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

mikepapadim May 30, 2026

Choose a reason for hiding this comment

Uh oh!

mikepapadim May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mikepapadim May 30, 2026

Choose a reason for hiding this comment

Uh oh!

mikepapadim left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

orionpapadakis commented May 28, 2026 •

edited

Loading