Skip to content

Add Prefill–Decode Separation with Batched Prompt Ingestion and Logits Skipping #102

Open
orionpapadakis wants to merge 23 commits intobeehive-lab:mainfrom
orionpapadakis:feat/prefill-decode
Open

Add Prefill–Decode Separation with Batched Prompt Ingestion and Logits Skipping #102
orionpapadakis wants to merge 23 commits intobeehive-lab:mainfrom
orionpapadakis:feat/prefill-decode

Conversation

@orionpapadakis
Copy link
Copy Markdown
Collaborator

@orionpapadakis orionpapadakis commented Apr 2, 2026

This PR implements the prefill-decode concept in GPULlama3.java:

  1. Prefill (or prompt ingestion) is the inference pass for prompt (input) tokens. Currently they are ingested sequentially (one-by-one). However, all prompt tokens are already known and also they are independent entities; hence the sequential ingestion is sub-optimal. A well known practice is to ingest prompt tokens in batches (i.e., of 32) instead of one-by-one. In addition, logits creation can be skipped for all prompt tokens but the last one as they are not used at all. The logits of the last prompt token are used to get the first generated token which in its turn feeds the decode phase.

  2. Decode (new token generation) is the inference pass for each generated token. In contrast to prefill, this phase remains token-sequential (as it is now) because each generated token depends on the previous one.

Based on the above, this PR breaks down the prefill-decode feature implementation in 4 discrete phases:

  • Phase 1: CPU prefill/decode split — sequential, skip logits during prefill

  • Phase 2: GPU prefill/decode split — sequential, skip logits during prefill

  • Phase 3: CPU batched prefill (batch size B, default 32)

  • Phase 4: GPU batched prefill (batch size B, default 32)


Implementation

Top-level dispatch

Llama#generateTokens / #generateTokensGPU perform a three-way dispatch:

PREFILL_BATCH_SIZE > 1 → InferenceEngineWithBatchPrefillDecode (Phase 3/4)
PREFILL_BATCH_SIZE == 1 → InferenceEngineWithPrefillDecode (Phase 1/2)
(default) → InferenceEngine (standard)

Per-phase entry points

Phase Engine Core primitive Execution plan
1 — CPU sequential InferenceEngineWithPrefillDecode#generateTokensLlama InferenceCoreWithPrefillDecode#forwardJavaPrefill
2 — GPU sequential InferenceEngineWithPrefillDecode#generateTokensGPULlama InferenceCoreWithPrefillDecode#forwardTornadoVMPrefill TornadoVMMasterPlanWithPrefillDecode (N+2 graphs)
3 — CPU batched InferenceEngineWithBatchPrefillDecode#generateTokensLlama InferenceCoreBatchPrefillDecode#batchForwardJavaPrefill
4 — GPU batched InferenceEngineWithBatchPrefillDecode#generateTokensGPULlama InferenceCoreBatchPrefillDecode#batchForwardTornadoVMPrefill TornadoVMMasterPlanWithBatchPrefillDecode (2N+3 graphs)

Phase 2 note: LlamaFP16FFNLayersPrefillDecode fixes no-CUDA-graph mode — layer 0 allocates the
KV cache via FIRST_EXECUTION, layers 1..N use named consumeFromDevice.

Phase 4 note: KV cache flows from batch prefill into decode via persistOnDevice/consumeFromDevice
across the prefill→decode graph boundary, all within a single TornadoExecutionPlan.

Quantization

Both prefill/decode plans switch on GGMLType in createExecutionPlan(): F16 proceeds, Q8_0
and others throw UnsupportedOperationException at plan-construction time, mirroring the
QuantizationPlannerFactory pattern in the standard plan.


Functional Status

All 4 phases fully implemented and verified for LLaMA, FP16, CUDA graphs and no-CUDA-graphs.

Remaining work: model coverage (Mistral, Qwen2, Qwen3, DeepSeek, Phi-3, Granite) and Q8_0
(extension points already in place in both execution plans).


Performance (RTX 5090 ROG Laptop, TornadoVM PTX, LLaMA-3.2-1B FP16, B=32)

Configuration Short prompt (tok/s) Speedup vs GPU std Long prompt (tok/s) Speedup vs GPU std
CPU standard 22.87 0.35× 26.24 0.45×
CPU prefill-decode (Phase 1) 24.43 0.38× 29.18 0.50×
CPU batched prefill-decode (Phase 3) 28.14 0.44× 42.62 0.73×
GPU standard 64.71 1.00× 58.15 1.00×
GPU prefill-decode, CUDA graphs (Phase 2) 66.55 1.03× 60.86 1.05×
GPU prefill-decode, no CUDA graphs (Phase 2) 55.60 0.86× 54.14 0.93×
GPU batched prefill-decode, CUDA graphs (Phase 4) 72.67 1.12× 90.93 1.56×
GPU batched prefill-decode, no CUDA graphs (Phase 4) 62.81 0.97× 81.18 1.40×
  • Long prompts benefit most (+56% GPU, +62% CPU over standard) — O(N²) prefill attention amortised over batches
  • Short prompt GPU gains are modest (3–12%) — single-token GeMV already saturates the GPU
  • CUDA graphs add ~10–12% over interpreter mode
  • Peak: GPU batched + CUDA graphs at 90.93 tok/s on long prompts

How to run

export SHORT_PROMPT="Tell me a joke."
export LONG_PROMPT="The history of artificial intelligence is a fascinating journey through decades of human ingenuity, theoretical breakthroughs, engineering milestones, and philosophical debates about the nature of mind and machine. It began in earnest in the mid-twentieth century, when mathematicians and engineers first dared to ask whether machines could be made to think. Alan Turing proposed his famous imitation game, a test of whether a machine could produce responses indistinguishable from those of a human. John McCarthy coined the term artificial intelligence at a seminal workshop at Dartmouth College in 1956, gathering luminaries who believed that every aspect of learning and intelligence could in principle be precisely described and simulated by a machine. The early decades were marked by optimism, as programs like the General Problem Solver and ELIZA demonstrated the potential for symbolic reasoning and natural language processing. Yet progress was uneven, and two so-called AI winters saw funding and enthusiasm collapse as ambitious promises outpaced practical results. The rise of expert systems in the 1980s briefly rekindled hope, encoding domain knowledge into rule-based engines that could assist doctors, engineers, and financial analysts. The shift to statistical and machine-learning approaches in the 1990s and 2000s proved more durable, culminating in deep learning's triumphant return with AlexNet in 2012. Since then the pace of progress has been extraordinary. Please summarise this history briefly."
#prefill-decode without cuda-graphs
./llama-tornado --gpu --ptx --model /path/to/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --no-cuda-graphs
#prefill-decode with cuda graphs
./llama-tornado --gpu --ptx --model /path/to/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode
# prefill-decode with batching without cuda-graphs
./llama-tornado --gpu --ptx --model /path/to/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --no-cuda-graphs
# prefill-decode with batching with cuda-graphs
./llama-tornado --gpu --ptx --model /path/to/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32

…ith InferenceCoreWithPrefillDecode and InferenceEngineWithPrefillDecode
… Implements `InferenceEngineWithPrefillDecode` and `TornadoVMMasterPlanWithPrefillDecode` for batched token generation. Refactor `Llama` to support the batched prefill flag.
@orionpapadakis orionpapadakis added the enhancement New feature or request label Apr 2, 2026
…ts to dedicated classes and packages

 Move `LlamaFP16BatchPrefillLayers` to `tornadovm.layers.type.fp16.prefll` and `LlamaFP16FFNLayersForUnifiedDecode` to `tornadovm.layers.type.fp16.decode`
…d refactor task graph consumption logic

 Introduce `LogitsFP16LayerDecode` with KV-cache pass-through. Override `consumeFromDevice` and `persistOnDevice` in LlamaFFN layers to fix cross-graph propagation for both CUDA and interpreter modes.
…nd `batched-prefill-decode` execution paths for both CPU and GPU
@mikepapadim mikepapadim marked this pull request as ready for review April 16, 2026 17:17
@mikepapadim
Copy link
Copy Markdown
Member

/rerun all

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Workflow rerun started

Mode: all
Triggered by: @mikepapadim

View Actions

@github-actions
Copy link
Copy Markdown
Contributor

Workflow rerun success

View Actions

…PrefillDecode`

This fixes GPU prefill-decode without batching without CUDA Graphs
…`, CPU/GPU) for standard, prefill-decode and prefill-decode with batching
…code and batched-prefill-decode paths in `Mistral`, `Phi3`, `Qwen2`, and `Qwen3` models.
…U prefill-decode and batched-prefill-decode paths
@orionpapadakis orionpapadakis changed the title [WIP] Feat/prefill decode Feat/prefill decode Apr 17, 2026
@mikepapadim mikepapadim changed the title Feat/prefill decode Add Prefill–Decode Separation with Batched Prompt Ingestion and Logits Skipping Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants