guide/token-prediction #625

2026-07-04T13:29:58Z

giscus[bot]
Bot Jul 4, 2026

guide/token-prediction

Using token predictors to speed up the generation process in node-llama-cpp

https://node-llama-cpp.withcat.ai/guide/token-prediction

temandeveloper · 2026-07-04T13:30:01Z

temandeveloper
Jul 4, 2026 — with giscus

Hello there

I'm trying to implement Multi-Token Prediction (MTP) with gemma 4 from your repo:

Models Used:

Base Model: gemma-4-E2B-it.Q6_K.gguf
Draft Model: gemma-4-E2B-it.mtp.Q8_0.gguf

my code :

// Load main model
try {
  modelStart = await appState.llmEngine.loadModel(modelOptions)
  mainModelLoaded = true
  appState.setModelStart(modelStart)
  appState.setModelPath(data.path)
} catch (mainModelError) {
  console.error('[Backend] Failed to load main model:', mainModelError)
  throw new Error(`Failed to load main model: ${mainModelError.message}`)
}

// Load draf model
try {
  draftModelStart = await appState.llmEngine.loadModel({
    modelPath: "C:/Users/YamatoLab/Downloads/gemma-4-E2B-it.mtp.Q8_0.gguf"
  })
  draftModelLoaded = true
  appState.setDraftModelStart(draftModelStart)
} catch (embeddingError) {
  console.error('[Backend] Failed to load embedding model:', embeddingError)
  throw new Error(`Failed to load embedding model: ${embeddingError.message}`)
}

// Build LlamaContextOptions untuk setiap node
const LlamaContextOptions = {
  sequences: agentNodes.length, // total sessions needed
}

if (inferenceConfig.context_length !== undefined && inferenceConfig.context_length >= 512) {
  LlamaContextOptions.contextSize = {
    min: 512,
    max: inferenceConfig.context_length
  }
}

if (inferenceConfig.batch_size && inferenceConfig.batch_size > 0) {
  LlamaContextOptions.batchSize = inferenceConfig.batch_size
}

if (inferenceConfig.cpu_threads !== undefined && inferenceConfig.cpu_threads >= 0) {
  LlamaContextOptions.threads = inferenceConfig.cpu_threads
}

const context = await appState.modelStart.createContext(LlamaContextOptions)
// Create draf context
const draftContext = await appState.draftModelStart.createContext({
  contextSize: {
      max: 2048
  }
});
const draftContextSequence = draftContext.getSequence();

// Create session
const sessionChat = new LlamaChatSession({
  contextSequence: context.getSequence({
    tokenPredictor: new DraftSequenceTokenPredictor(draftContextSequence, {
      minTokens: 0,
      minConfidence: 0.6
    })
  }),
  autoDisposeSequence: true,
  chatWrapper: agentWrapper,
  systemPrompt
})

I got log like this:

19:39:37.347 > [node-llama-cpp] load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
19:39:37.349 > [node-llama-cpp] load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
19:39:37.443 > [node-llama-cpp] load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
19:39:45.330 > [node-llama-cpp] load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
19:39:48.199 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:48.200 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:48.201 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:48.202 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:48.202 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:49.226 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:49.226 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:50.440 > [node-llama-cpp] load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
19:39:50.441 > [node-llama-cpp] load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
19:39:50.520 > [node-llama-cpp] load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
19:39:51.758 > [Backend] Initializing draft model context...
19:39:53.149 > [node-llama-cpp] Failed simulating context resource usage. Falling back to estimation heuristic. Error: Failed to create context  
19:39:53.150 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:54.574 > [node-llama-cpp] Failed simulating context resource usage. Falling back to estimation heuristic. Error: Failed to create context  
19:39:50.520 > [node-llama-cpp] load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
19:39:51.758 > [Backend] Initializing draft model context...
19:39:53.149 > [node-llama-cpp] Failed simulating context resource usage. Falling back to estimation heuristic. Error: Failed to create context  
19:39:53.150 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:54.574 > [node-llama-cpp] Failed simulating context resource usage. Falling back to estimation heuristic. Error: Failed to create context  
19:39:54.575 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:53.150 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:54.574 > [node-llama-cpp] Failed simulating context resource usage. Falling back to estimation heuristic. Error: Failed to create context  
19:39:54.575 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:54.575 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:54.576 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:54.576 > [node-llama-cpp] llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
19:39:54.577 > [Backend] Failed to create draft model context: Error: Failed to create context
    at createContext (file:///C:/Users/YamatoLab/sapientia/node_modules/node-llama-cpp/dist/evaluator/LlamaContext/LlamaContext.js:710:27)       
    at async LlamaContext._create (file:///C:/Users/YamatoLab/sapientia/node_modules/node-llama-cpp/dist/evaluator/LlamaContext/LlamaContext.js:752:24)
    at async file:///C:/Users/YamatoLab/sapientia/node_modules/node-llama-cpp/dist/evaluator/LlamaModel/LlamaModel.js:342:24
    at async withLock (file:///C:/Users/YamatoLab/sapientia/node_modules/lifecycle-utils/dist/withLock.js:23:16)
    at async LlamaModel.createContext (file:///C:/Users/YamatoLab/sapientia/node_modules/node-llama-cpp/dist/evaluator/LlamaModel/LlamaModel.js:339:16)
    at async NodeWorkflowManager.nodeAgentsContextBuilder (file:///C:/Users/YamatoLab/sapientia/out/main/index.js:1190:27)
    at async ChatManager.openContextModel (file:///C:/Users/YamatoLab/sapientia/out/main/index.js:1508:25)
    at async ChatManager.initializeChat (file:///C:/Users/YamatoLab/sapientia/out/main/index.js:1448:9)
    at async file:///C:/Users/YamatoLab/sapientia/out/main/index.js:1800:25
    at async Session.<anonymous> (node:electron/js2c/browser_init:2:113091)

do you have any solution for this issue?

1 reply

giladgd Jul 5, 2026
Maintainer

There are new kinds of token predictors (such as MTP, EAGLE3, DFlash) that were recently added to llama.cpp, and they share some internal state between the draft context and the main one, and thus require deeper coordination between them.
At the moment, node-llama-cpp doesn't support those kinds of specialized draft contexts.

Because those draft models are becoming more standardized (models on Hugging Face now include the draft models in the same repo, like the ones I made), I plan to add full support for them, from automatically downloading the relevant draft model when you use resolveModelFile, determining whether it'd be optimal to use them on the current system (and resources state), and automatically use and configure them (unless you opt out), so you won't have to configure anything special to benefit from a speedup, and won't have to manually experiment and tune the configuration to find the best balance on your machine.
I didn't add support for this in v3.19.0 because it requires infrastructural changes that will take some time to implement, and I didn't want to postpone releasing Gemma 4 support any longer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

guide/token-prediction #625

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Uh oh!

guide/token-prediction #625

Uh oh!

giscus[bot] Bot Jul 4, 2026

guide/token-prediction

Replies: 1 comment · 1 reply

Uh oh!

temandeveloper Jul 4, 2026 — with giscus

Uh oh!

giladgd Jul 5, 2026 Maintainer

giscus[bot]
Bot Jul 4, 2026

Replies: 1 comment 1 reply

temandeveloper
Jul 4, 2026 — with giscus

giladgd Jul 5, 2026
Maintainer