Refactor/tornadovm planning#117
Conversation
Reorganize TornadoVM execution planning around forward modes, model families, and quantization-specific components.
…TornadoVM components
…d `AbstractLogitsLayer` to `AbstractLogitsTaskGraph`, updating all references to improve clarity and align with naming conventions.
…ing with updated naming conventions.
…ill-decode and CUDA-graph variants
| MemorySegment tokenEmbeddings = weights.getTokenEmbeddingTable().asByteArray().getSegment(); | ||
| int blocksPerToken = (configuration.dim() + 31) / 32; | ||
| long bytesPerToken = (long) blocksPerToken * 34; | ||
| MemorySegment.copy(tokenEmbeddings, (long) token * bytesPerToken, | ||
| state.embeddingX.getSegment(), 0, bytesPerToken); | ||
| } |
There was a problem hiding this comment.
maybe this should be a method on each own. Same for the above
There was a problem hiding this comment.
i agree, but imho this should be part of another pr where embeddings copy will be refactored as a distinct component that will cleanly facilitate dispatch across quantizations and plan types (single-token, prefill-decode, batch-prefill-decode) in a well-structured manner
| } | ||
|
|
||
| // ── Q8_0 Batch Kernels ─────────────────────────────────────────────────── | ||
|
|
There was a problem hiding this comment.
format is odd. use @Formatter: on / off of the block and pass the autoformatter
| } | ||
|
|
||
| @Override | ||
| protected String predecessorGraphName(int layerIndex) { |
There was a problem hiding this comment.
again formatter - use annotations eitherwise in the first autoformatitng pass it will be got flat.
| } | ||
|
|
||
| @Override public ActivationTaskGraph standardActivation() { | ||
| return new Activation("activationUpdate", state, weights, config); |
There was a problem hiding this comment.
maybe 'actiovationUpdate' and 'logits' strings should be in an enum or record that reuse that instead of have these Strings all over the place.
mikepapadim
left a comment
There was a problem hiding this comment.
LGTM, some minor changes needed.
# Conflicts: # src/main/java/org/beehive/gpullama3/inference/state/State.java # src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanBatchPrefillDecode.java
…tate` class, removing redundancies in model-specific implementations
…pulations for batch-prefill and single-token state initialization.
…gle-token, prefill-decode, and batch-prefill inference plans
This PR reorganizes TornadoVM execution planning around three variant axes:
The previous structure was mainly shaped around two axes: model family and quantization. With prefill-decode and batch-prefill-decode, execution mode becomes a third axis, which greatly increases the number of
combinations each model/quantization pair may need to support.
This refactor introduces forward plans, task-graph layouts, and model/quantization component providers so single-token, prefill-decode, and batch-prefill-decode paths can share one cleaner planning structure
instead of growing separate master-plan dispatch logic.
More specifically, the GPU inference path is now organized around four collaborating abstractions:
Layouts (
*ForwardTaskGraphLayout) encode the index arithmetic for a given graph topology — for example, which integer index corresponds to the activation graph, the N layer graphs, or the logits graph. Theyeliminate magic numbers and make index-dependent code self-documenting.
Components (
*ForwardPlanComponents) are model-family + quantization-specific factories. Each implementation constructs the concrete TornadoVM TaskGraph objects for its model (e.g.,LlamaFP16PlanComponentsproducesLlamaFP16FFNLayers,LogitsFP16Layer, etc.). The three component interfaces form a capability hierarchy —SingleTokenForwardPlanComponents→PrefillDecodeForwardPlanComponents→BatchPrefillDecodeForwardPlanComponents— so that Llama, which supports all three execution modes, implements one object that satisfies all three contracts.ForwardPlans (
Single/PrefillDecode/BatchPrefillDecodeForwardPlan) assemble components into an orderedImmutableTaskGraphlist and aGridScheduler. Each plan encodes the graph topology for one execution mode:N+2graphs for single-token,N+2for prefill-decode,2N+3for batch-prefill/decode.ForwardPlanFactoryselects the right combination of components and plan based on quantization type, model family, and execution mode.MasterPlans (
TornadoVMMasterPlan*) own the TornadoVM execution lifecycle: they create theTornadoExecutionPlanfrom the ForwardPlan's graph list, handle warmup and CUDA-graph configuration, and expose the forward-pass entry points (tornadoVMForwardDecode,tornadoVMForwardPrefill,TornadoVMForwardBatchPrefill) used by the inference core. They are model-agnostic — all model-specific knowledge lives in the components layer below them.Notes
Verification
use java 21 or 25
setup tornadovm
mvn clean installllama fp16 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048llama fp16 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decodellama fp16 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32llama fp16 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphsllama q8_0 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048llama q8_0 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decodellama q8_0 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32llama q8_0 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphsany other model (mistral, qwen3 etc) should also pass with single-token config BUT should fail for any prefill-decode config with the following message: