Add the EAGLE-3 speculative-decoding runner (CUDA) by digantdesai · Pull Request #20156 · pytorch/executorch

digantdesai · 2026-06-09T16:03:59Z

A C++ runner that drives the speculator .pte with the shifted (vLLM-EAGLE)
scheme: the draft pairs the target hidden state at position t with token t+1, so
each round runs one target forward (target_verify) and reseeds the next draft
chain from the hidden states verify already produced -- no standalone target
decode. Greedy verification keeps output identical to greedy target decoding.
target_verify runs on stable input buffers and can be captured as a CUDA graph.

It requires the .pte metadata (fails loudly if absent) and enforces the exported
prefill range [get_min_prefill_chunk, get_max_prefill_chunk] (no chunking). The
prefill bonus token is always emitted; the speculative loop runs only when more
tokens are requested, the bonus was not EOS, and a K-token verify window fits
within get_max_seq_len (so a one-token or near-context request returns without
seeding the draft). The chat template and stop tokens are flags defaulting to
Gemma 4 IT (--chat_prefix/--chat_suffix/--stop_ids/--stop_token, --bos_id -1 to
skip) so other target/tokenizer pairs run without code changes. Device-to-host
reads are error-checked; the printed tau excludes the free prefill token.

Authored with assistance from Claude Code.

[ghstack-poisoned]

digantdesai · 2026-06-09T16:04:00Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-06-09T16:04:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20156

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 5725fee with merge base dc55469 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner (gh)
>>> Lint for examples/models/eagle3/main.cpp:
pull / test-arm-backend-no-driver (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 95db882def0ce3fdf7aaa1c2671763e165d0cc0c94347dc2517bf8b130ad0386 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Update

5725fee

[ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the EAGLE-3 speculative-decoding runner (CUDA)#20156

Add the EAGLE-3 speculative-decoding runner (CUDA)#20156
digantdesai wants to merge 1 commit into
gh/digantdesai/60/headfrom
gh/digantdesai/61/head

digantdesai commented Jun 9, 2026

Uh oh!

digantdesai commented Jun 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

digantdesai commented Jun 9, 2026

Uh oh!

digantdesai commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20156

❌ 2 New Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

digantdesai commented Jun 9, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading