Add Prefill–Decode Separation with Batched Prompt Ingestion and Logits Skipping by orionpapadakis · Pull Request #102 · beehive-lab/GPULlama3.java

orionpapadakis · 2026-04-02T17:35:46Z

This PR implements the prefill-decode concept in GPULlama3.java:

Prefill (or prompt ingestion) is the inference pass for prompt (input) tokens. Currently they are ingested sequentially (one-by-one). However, all prompt tokens are already known and also they are independent entities; hence the sequential ingestion is sub-optimal. A well known practice is to ingest prompt tokens in batches (i.e., of 32) instead of one-by-one. In addition, logits creation can be skipped for all prompt tokens but the last one as they are not used at all. The logits of the last prompt token are used to get the first generated token which in its turn feeds the decode phase.
Decode (new token generation) is the inference pass for each generated token. In contrast to prefill, this phase remains token-sequential (as it is now) because each generated token depends on the previous one.

Based on the above, this PR breaks down the prefill-decode feature implementation in 4 discrete phases:

Phase 1: CPU prefill/decode split — sequential, skip logits during prefill
Phase 2: GPU prefill/decode split — sequential, skip logits during prefill
Phase 3: CPU batched prefill (batch size B, default 32)
Phase 4: GPU batched prefill (batch size B, default 32)

Implementation

Top-level dispatch

Llama#generateTokens / #generateTokensGPU perform a three-way dispatch:

PREFILL_BATCH_SIZE > 1 → InferenceEngineWithBatchPrefillDecode (Phase 3/4)
PREFILL_BATCH_SIZE == 1 → InferenceEngineWithPrefillDecode (Phase 1/2)
(default) → InferenceEngine (standard)

Per-phase entry points

Phase	Engine	Core primitive	Execution plan
1 — CPU sequential	`InferenceEngineWithPrefillDecode#generateTokensLlama`	`InferenceCoreWithPrefillDecode#forwardJavaPrefill`	—
2 — GPU sequential	`InferenceEngineWithPrefillDecode#generateTokensGPULlama`	`InferenceCoreWithPrefillDecode#forwardTornadoVMPrefill`	`TornadoVMMasterPlanWithPrefillDecode` (N+2 graphs)
3 — CPU batched	`InferenceEngineWithBatchPrefillDecode#generateTokensLlama`	`InferenceCoreBatchPrefillDecode#batchForwardJavaPrefill`	—
4 — GPU batched	`InferenceEngineWithBatchPrefillDecode#generateTokensGPULlama`	`InferenceCoreBatchPrefillDecode#batchForwardTornadoVMPrefill`	`TornadoVMMasterPlanWithBatchPrefillDecode` (2N+3 graphs)

Phase 2 note: LlamaFP16FFNLayersPrefillDecode fixes no-CUDA-graph mode — layer 0 allocates the
KV cache via FIRST_EXECUTION, layers 1..N use named consumeFromDevice.

Phase 4 note: KV cache flows from batch prefill into decode via persistOnDevice/consumeFromDevice
across the prefill→decode graph boundary, all within a single TornadoExecutionPlan.

Quantization

Both prefill/decode plans switch on GGMLType in createExecutionPlan(): F16 proceeds, Q8_0
and others throw UnsupportedOperationException at plan-construction time, mirroring the
QuantizationPlannerFactory pattern in the standard plan.

Functional Status

All 4 phases fully implemented and verified for LLaMA, FP16, CUDA graphs and no-CUDA-graphs.

Remaining work: model coverage (Mistral, Qwen2, Qwen3, DeepSeek, Phi-3, Granite) and Q8_0
(extension points already in place in both execution plans).

Performance (RTX 5090 ROG Laptop, TornadoVM PTX, LLaMA-3.2-1B FP16, B=32)

Configuration	Short prompt (tok/s)	Speedup vs GPU std	Long prompt (tok/s)	Speedup vs GPU std
CPU standard	22.87	0.35×	26.24	0.45×
CPU prefill-decode (Phase 1)	24.43	0.38×	29.18	0.50×
CPU batched prefill-decode (Phase 3)	28.14	0.44×	42.62	0.73×
GPU standard	64.71	1.00×	58.15	1.00×
GPU prefill-decode, CUDA graphs (Phase 2)	66.55	1.03×	60.86	1.05×
GPU prefill-decode, no CUDA graphs (Phase 2)	55.60	0.86×	54.14	0.93×
GPU batched prefill-decode, CUDA graphs (Phase 4)	72.67	1.12×	90.93	1.56×
GPU batched prefill-decode, no CUDA graphs (Phase 4)	62.81	0.97×	81.18	1.40×

Long prompts benefit most (+56% GPU, +62% CPU over standard) — O(N²) prefill attention amortised over batches
Short prompt GPU gains are modest (3–12%) — single-token GeMV already saturates the GPU
CUDA graphs add ~10–12% over interpreter mode
Peak: GPU batched + CUDA graphs at 90.93 tok/s on long prompts

How to run

export SHORT_PROMPT="Tell me a joke."

export LONG_PROMPT="The history of artificial intelligence is a fascinating journey through decades of human ingenuity, theoretical breakthroughs, engineering milestones, and philosophical debates about the nature of mind and machine. It began in earnest in the mid-twentieth century, when mathematicians and engineers first dared to ask whether machines could be made to think. Alan Turing proposed his famous imitation game, a test of whether a machine could produce responses indistinguishable from those of a human. John McCarthy coined the term artificial intelligence at a seminal workshop at Dartmouth College in 1956, gathering luminaries who believed that every aspect of learning and intelligence could in principle be precisely described and simulated by a machine. The early decades were marked by optimism, as programs like the General Problem Solver and ELIZA demonstrated the potential for symbolic reasoning and natural language processing. Yet progress was uneven, and two so-called AI winters saw funding and enthusiasm collapse as ambitious promises outpaced practical results. The rise of expert systems in the 1980s briefly rekindled hope, encoding domain knowledge into rule-based engines that could assist doctors, engineers, and financial analysts. The shift to statistical and machine-learning approaches in the 1990s and 2000s proved more durable, culminating in deep learning's triumphant return with AlexNet in 2012. Since then the pace of progress has been extraordinary. Please summarise this history briefly."

#prefill-decode without cuda-graphs
./llama-tornado --gpu --ptx --model /path/to/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --no-cuda-graphs
#prefill-decode with cuda graphs
./llama-tornado --gpu --ptx --model /path/to/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode
# prefill-decode with batching without cuda-graphs
./llama-tornado --gpu --ptx --model /path/to/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --no-cuda-graphs
# prefill-decode with batching with cuda-graphs
./llama-tornado --gpu --ptx --model /path/to/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32

…configuration

…ith InferenceCoreWithPrefillDecode and InferenceEngineWithPrefillDecode

… Implements `InferenceEngineWithPrefillDecode` and `TornadoVMMasterPlanWithPrefillDecode` for batched token generation. Refactor `Llama` to support the batched prefill flag.

…king state, with cuda graphs only)

…ts to dedicated classes and packages Move `LlamaFP16BatchPrefillLayers` to `tornadovm.layers.type.fp16.prefll` and `LlamaFP16FFNLayersForUnifiedDecode` to `tornadovm.layers.type.fp16.decode`

…phs` option to ease debugging

…lamaFP16FFNLayersDecode`

…6LayersBatchPrefill`

…adoVMMasterPlanWithBatchPrefillDecode`

…d refactor task graph consumption logic Introduce `LogitsFP16LayerDecode` with KV-cache pass-through. Override `consumeFromDevice` and `persistOnDevice` in LlamaFFN layers to fix cross-graph propagation for both CUDA and interpreter modes.

…nd `batched-prefill-decode` execution paths for both CPU and GPU

…tandard, prefill-decode, and batched-prefill-decode setups.

…om TornadoVM execution paths.

…for prefill and decode paths in TornadoVM

…de plan

mikepapadim · 2026-04-16T17:17:45Z

/rerun all

github-actions · 2026-04-16T17:17:57Z

🚀 Workflow rerun started

Mode: all
Triggered by: @mikepapadim

View Actions

github-actions · 2026-04-16T17:17:58Z

✅ Workflow rerun success

View Actions

…PrefillDecode` This fixes GPU prefill-decode without batching without CUDA Graphs

…`, CPU/GPU) for standard, prefill-decode and prefill-decode with batching

…code and batched-prefill-decode paths in `Mistral`, `Phi3`, `Qwen2`, and `Qwen3` models.

…U prefill-decode and batched-prefill-decode paths

…`batched-prefill-decode` test cases for Llama 3.2 1B FP16

orionpapadakis added 6 commits March 27, 2026 16:02

[refactor] Move QuantizedLayerPlanner to layerplanner package root-level

c00aa82

[prf/dec] Add CLI options for batched prefill and prefill batch size …

8be1c05

…configuration

[prf/dec] Add CPU path for prefill/decode. Separates inference path w…

ca4744f

…ith InferenceCoreWithPrefillDecode and InferenceEngineWithPrefillDecode

[prf/dec] Add GPU path for prefill/decode with TornadoVM integration.…

d5f5629

… Implements `InferenceEngineWithPrefillDecode` and `TornadoVMMasterPlanWithPrefillDecode` for batched token generation. Refactor `Llama` to support the batched prefill flag.

[prf/dec] Batch prompt tokens during prefill phase in CPU path

fbbc41f

[prf/dec][wip] Add GPU-based prefill-decode with batched prefill (wor…

f0bca5f

…king state, with cuda graphs only)

orionpapadakis requested review from mairooni, mikepapadim and stratika April 2, 2026 17:35

orionpapadakis added the enhancement New feature or request label Apr 2, 2026

orionpapadakis added 5 commits April 3, 2026 20:23

[prf/dec][refactor] Restructure prefill-decode ExecutionPlan componen…

4152edb

…ts to dedicated classes and packages Move `LlamaFP16BatchPrefillLayers` to `tornadovm.layers.type.fp16.prefll` and `LlamaFP16FFNLayersForUnifiedDecode` to `tornadovm.layers.type.fp16.decode`

[prf/dec][dbg] Guard CUDA Graphs enable/disable behind `--no-cuda-gra…

8d2d15d

…phs` option to ease debugging

[prf/dec][refactor] Rename LlamaFP16FFNLayersForUnifiedDecode to `L…

2e51fc2

…lamaFP16FFNLayersDecode`

[prf/dec][refactor] Rename LlamaFP16BatchPrefillLayers to `LlamaFP1…

885473b

…6LayersBatchPrefill`

[prf/dec][refactor] Rename TornadoVMMasterPlanBatchPrefill to `Torn…

6128793

…adoVMMasterPlanWithBatchPrefillDecode`

mikepapadim mentioned this pull request Apr 6, 2026

Use withCudaGraphs on PTX backend #100

Open

orionpapadakis added 2 commits April 7, 2026 20:28

[prf/dec] Provide distinct support for standard, prefill-decode a…

d110335

…nd `batched-prefill-decode` execution paths for both CPU and GPU

mikepapadim mentioned this pull request Apr 12, 2026

Detailed Performance Metrics (Resolves Issue #104) #106

Closed

orionpapadakis added 4 commits April 16, 2026 17:30

[prf/dec] Refactor TornadoVM execution plans to unify GPU paths for s…

edfa6fc

…tandard, prefill-decode, and batched-prefill-decode setups.

[prf/dec][cleanup] Remove unused debug logs and commented-out code fr…

0750ed4

…om TornadoVM execution paths.

[prf/dec][refactor] Standardize task graph and grid scheduler naming …

97f2d8b

…for prefill and decode paths in TornadoVM

[prf/dec][doc] Update javadoc to reflect unified batched prefill-deco…

aa53ebe

…de plan

mikepapadim marked this pull request as ready for review April 16, 2026 17:17

orionpapadakis added 3 commits April 17, 2026 11:00

[prf/dec][fix] Restructure and fix issues in `TornadoVMMasterPlanWith…

7429a63

…PrefillDecode` This fixes GPU prefill-decode without batching without CUDA Graphs

[prf/dec] Separate inference paths (InferenceEngine, `InferenceCore…

d4329f8

…`, CPU/GPU) for standard, prefill-decode and prefill-decode with batching

[prf/dec] Add unsupported operation exceptions for CPU/GPU prefill-de…

d99a888

…code and batched-prefill-decode paths in `Mistral`, `Phi3`, `Qwen2`, and `Qwen3` models.

[prf/dec][refactor] Add unsupported exceptions for Q8_0 weights in GP…

8585dd6

…U prefill-decode and batched-prefill-decode paths

orionpapadakis changed the title ~~[WIP] Feat/prefill decode~~ Feat/prefill decode Apr 17, 2026

orionpapadakis added 2 commits April 17, 2026 15:08

[prf/dec][ci] Extend CI workflow to include GPU prefill-decode and …

807cda7

…`batched-prefill-decode` test cases for Llama 3.2 1B FP16

[ci][prf/dec] Enforce --ptx usage in GPU prefill-decode tests

fc8cebc

mikepapadim changed the title ~~Feat/prefill decode~~ Add Prefill–Decode Separation with Batched Prompt Ingestion and Logits Skipping Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Prefill–Decode Separation with Batched Prompt Ingestion and Logits Skipping #102

Add Prefill–Decode Separation with Batched Prompt Ingestion and Logits Skipping #102
orionpapadakis wants to merge 23 commits intobeehive-lab:mainfrom
orionpapadakis:feat/prefill-decode

orionpapadakis commented Apr 2, 2026 •

edited

Loading

Uh oh!

mikepapadim commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

orionpapadakis commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Top-level dispatch

Per-phase entry points

Quantization

Functional Status

Performance (RTX 5090 ROG Laptop, TornadoVM PTX, LLaMA-3.2-1B FP16, B=32)

How to run

Uh oh!

mikepapadim commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

orionpapadakis commented Apr 2, 2026 •

edited

Loading