Skip to content

Refactor/tornadovm planning#117

Open
orionpapadakis wants to merge 16 commits into
mainfrom
refactor/tornadovm-planning
Open

Refactor/tornadovm planning#117
orionpapadakis wants to merge 16 commits into
mainfrom
refactor/tornadovm-planning

Conversation

@orionpapadakis
Copy link
Copy Markdown
Collaborator

@orionpapadakis orionpapadakis commented May 28, 2026

This PR reorganizes TornadoVM execution planning around three variant axes:

  • model family
  • quantization
  • forward execution mode

The previous structure was mainly shaped around two axes: model family and quantization. With prefill-decode and batch-prefill-decode, execution mode becomes a third axis, which greatly increases the number of
combinations each model/quantization pair may need to support.

This refactor introduces forward plans, task-graph layouts, and model/quantization component providers so single-token, prefill-decode, and batch-prefill-decode paths can share one cleaner planning structure
instead of growing separate master-plan dispatch logic.

More specifically, the GPU inference path is now organized around four collaborating abstractions:

  • Layouts (*ForwardTaskGraphLayout) encode the index arithmetic for a given graph topology — for example, which integer index corresponds to the activation graph, the N layer graphs, or the logits graph. They
    eliminate magic numbers and make index-dependent code self-documenting.

  • Components (*ForwardPlanComponents) are model-family + quantization-specific factories. Each implementation constructs the concrete TornadoVM TaskGraph objects for its model (e.g., LlamaFP16PlanComponents produces LlamaFP16FFNLayers, LogitsFP16Layer, etc.). The three component interfaces form a capability hierarchy — SingleTokenForwardPlanComponentsPrefillDecodeForwardPlanComponentsBatchPrefillDecodeForwardPlanComponents — so that Llama, which supports all three execution modes, implements one object that satisfies all three contracts.

  • ForwardPlans (Single/PrefillDecode/BatchPrefillDecodeForwardPlan) assemble components into an ordered ImmutableTaskGraph list and a GridScheduler. Each plan encodes the graph topology for one execution mode: N+2 graphs for single-token, N+2 for prefill-decode, 2N+3 for batch-prefill/decode. ForwardPlanFactory selects the right combination of components and plan based on quantization type, model family, and execution mode.

  • MasterPlans (TornadoVMMasterPlan*) own the TornadoVM execution lifecycle: they create the TornadoExecutionPlan from the ForwardPlan's graph list, handle warmup and CUDA-graph configuration, and expose the forward-pass entry points (tornadoVMForwardDecode, tornadoVMForwardPrefill, TornadoVMForwardBatchPrefill) used by the inference core. They are model-agnostic — all model-specific knowledge lives in the components layer below them.

Notes

  • Adds Llama Q8_0 prefill-decode support which also exhibits the necessity of this PR.
  • Renames task-graph abstractions for clearer roles.
  • Moves scheduling helpers into a dedicated TornadoVM scheduling package.
  • Keeps graph topology and execution behavior unchanged outside the new prefill-decode path.

Verification

  • use java 21 or 25

  • setup tornadovm

  • mvn clean install

  • llama fp16 (single-token):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048

  • llama fp16 (prefill-decode):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode

  • llama fp16 (batch-prefill-decode):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32

  • llama fp16 (batch-prefill-decode-CUDA_GRAPHS):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphs

  • llama q8_0 (single-token):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048

  • llama q8_0 (prefill-decode):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode

  • llama q8_0 (batch-prefill-decode):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32

  • llama q8_0 (batch-prefill-decode-CUDA_GRAPHS):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphs

any other model (mistral, qwen3 etc) should also pass with single-token config BUT should fail for any prefill-decode config with the following message:

WARNING: Using incubator modules: jdk.incubator.vector
Exception in thread "main" java.lang.UnsupportedOperationException: BATCH_PREFILL_DECODE not yet supported for QWEN_3 + F16
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createQwen3FP16Plan(ForwardPlanFactory.java:174)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createFP16Plan(ForwardPlanFactory.java:90)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.create(ForwardPlanFactory.java:74)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createBatchPrefillDecode(ForwardPlanFactory.java:65)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlanBatchPrefillDecode.createExecutionPlan(TornadoVMMasterPlanBatchPrefillDecode.java:70)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlanBatchPrefillDecode.<init>(TornadoVMMasterPlanBatchPrefillDecode.java:51)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlan.initializeTornadoVMPlan(TornadoVMMasterPlan.java:59)
  at org.beehive.gpullama3.model.Model.runInstructOnce(Model.java:205)
  at org.beehive.gpullama3.LlamaApp.runSingleInstruction(LlamaApp.java:18)
  at org.beehive.gpullama3.LlamaApp.main(LlamaApp.java:44)
Error: Command failed with return code 1

Reorganize TornadoVM execution planning around forward modes, model families, and quantization-specific components.
…d `AbstractLogitsLayer` to `AbstractLogitsTaskGraph`, updating all references to improve clarity and align with naming conventions.
Comment on lines +157 to +162
MemorySegment tokenEmbeddings = weights.getTokenEmbeddingTable().asByteArray().getSegment();
int blocksPerToken = (configuration.dim() + 31) / 32;
long bytesPerToken = (long) blocksPerToken * 34;
MemorySegment.copy(tokenEmbeddings, (long) token * bytesPerToken,
state.embeddingX.getSegment(), 0, bytesPerToken);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be a method on each own. Same for the above

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree, but imho this should be part of another pr where embeddings copy will be refactored as a distinct component that will cleanly facilitate dispatch across quantizations and plan types (single-token, prefill-decode, batch-prefill-decode) in a well-structured manner

}

// ── Q8_0 Batch Kernels ───────────────────────────────────────────────────

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format is odd. use @Formatter: on / off of the block and pass the autoformatter

}

@Override
protected String predecessorGraphName(int layerIndex) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again formatter - use annotations eitherwise in the first autoformatitng pass it will be got flat.

}

@Override public ActivationTaskGraph standardActivation() {
return new Activation("activationUpdate", state, weights, config);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe 'actiovationUpdate' and 'logits' strings should be in an enum or record that reuse that instead of have these Strings all over the place.

Copy link
Copy Markdown
Member

@mikepapadim mikepapadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor changes needed.

# Conflicts:
#	src/main/java/org/beehive/gpullama3/inference/state/State.java
#	src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanBatchPrefillDecode.java
…tate` class, removing redundancies in model-specific implementations
…pulations for batch-prefill and single-token state initialization.
…gle-token, prefill-decode, and batch-prefill inference plans
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants