A high-performance LLM inference engine and image generation pipeline written in C# 14 / .NET 10. Runs GGUF models on CPU (AVX2/AVX-512 SIMD) and GPU (Vulkan compute shaders or CUDA cuBLAS). Includes an OpenAI- and Anthropic-compatible API server and native pipelines for Z-Image-Turbo and FLUX.1.
Requirements: .NET 10 SDK, x86-64 CPU with AVX2.
Optional: Vulkan-capable GPU (drivers), CUDA Toolkit 11.x/12.x for NVIDIA paths,
OpenBLAS in tools/openblas/ for faster batched GEMM. Build with dotnet build -c Release.
Supported architectures: llama, llama4, olmoe, qwen3, qwen3moe, qwen35moe
(hybrid Gated-DeltaNet + attention + MoE), gemma4 (per-layer head_dim, SWA + global, PLE).
Benchmarked on AMD Zen 4 (12c/24t) + RTX 4070 Ti (12 GB), Q4_K_M, --temp 0. Prefill t/s is the
warm-cache rate at a ~1K-token prompt; decode t/s is near-zero-ctx (forward-pass iterations / time, so
thinking tokens count). Outputs verified coherent; Qwen3-8B is byte-identical to llama.cpp b8585 (60-token
greedy decode).
Tables are grouped per model, ordered fastest decode first; within each model the rows also run fastest
decode first. Each model lists a ready-to-run command with that model's recommended sampling (matching the
upstream / llama.cpp defaults). Commands use the published sharpi-cli binary — from source, swap it for
dotnet run --project src/SharpInference.Cli -c Release --.
SmolLM2 1.7B Instruct — HuggingFaceTB · 1 GB
sharpi-cli -m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf -g -1 \
--temp 0.7 --top-p 0.95 --top-k 40 -p "Write a haiku about autumn"| Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|
CUDA -g -1 |
230.6 | 268.0 | NVRTC __dp4a + Q8_1 |
Vulkan -g -1 |
83.9 | 105.8 | GLSL subgroupAdd reduce |
| CPU | 39.6 | 42.9 | AVX2 fused dequant-matvec |
OLMoE 1B-7B Instruct (MoE) — allenai · 4 GB
# greedy is unstable on OLMoE across backends — sample instead
sharpi-cli -m models/OLMoE-1B-7B-0924-Instruct-Q4_K_M.gguf -g -1 \
--temp 0.6 --top-p 0.95 -p "Explain mixture-of-experts routing in two sentences"| Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|
CUDA -g -1 |
112.8 | 126.0 | greedy varies, sampling coherent |
Vulkan -g -1 |
74.8 | 91.1 | greedy unstable across backends — use --temp 0.6 --top-p 0.95 |
| CPU | 51.7 | 60.5 | 64 experts / 8 active; per-channel QK-norm; norm_topk_prob=false |
Gemma 4 E4B-it QAT q4_0 — google · 5 GB
# Gemma 4 is not a reasoning model — the CLI auto-sets --no-thinking
sharpi-cli -m models/gemma-4-E4B_q4_0-it.gguf -g -1 -c 2048 \
--temp 1.0 --top-k 64 --top-p 0.95 --min-p 0 -p "Summarize the water cycle"| Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|
CUDA -g -1 -c 2048 |
3283 | 100.4 | QAT q4_0 (5.15 GB): ~1.4× decode vs Q8 at near-identical quality, frees ~3 GB for wider KV/context. Same gemma4 path; the shared-KV tail layers omit attn_k/attn_v/attn_k_norm (loaded conditionally). download-model.ps1 -Model gemma4-e4b-qat |
Qwen3 8B — Qwen · 5 GB
# Qwen3 reasoning defaults; add --no-thinking for the faster non-reasoning rows
sharpi-cli -m models/Qwen3-8B-Q4_K_M.gguf -g -1 \
--temp 0.6 --top-p 0.95 --top-k 20 -p "Write a quicksort in Python"| Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|
CUDA -g -1 --no-thinking |
2314 | 74.8 | reasoning suppressed; same path |
CUDA -g -1 |
2297 | 74.5 | int8 tensor-core MMQ prefill (weight read once as int8) + dp4a/Q8_1 decode matvec + CUDA-graph replay; all argmax-stable. (llama.cpp b8585 pp1008 5764 — gap is its cp.async MMQ + flash attn) |
CUDA -g -1 --tq --no-thinking |
60.6 | 70.4 | as --tq, reasoning suppressed |
CUDA -g -1 --tq |
60.6 | 70.2 | 3-bit KV → 40 960 ctx; 17 t/s @ 8K, 10 @ 16K |
Vulkan -g -1 |
28.0 | 30.9 | 11.4K auto-ctx |
Vulkan -g -1 --tq |
25.5 | 30.7 | 3-bit KV → 40 960 ctx |
| CPU | 10.0 | 13.2 | dense, no KV compression |
CPU --tq |
9.4 | 13.2 | 3-bit KV → 40 960 ctx; FastScan K+V (#34) keeps long-ctx decode ~flat (10.2 @ 3K, 9.4 @ 6K) |
Gemma 4 E4B-it Q8 — unsloth · 8 GB
sharpi-cli -m models/gemma-4-E4B-it-Q8_0.gguf -g -1 -c 2048 \
--temp 1.0 --top-k 64 --top-p 0.95 --min-p 0 -p "Summarize the water cycle"| Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|
CUDA -g -1 -c 2048 |
3444 | 70.5 | all 42 layers fit at -c 2048; KV-share alias + per-layer SWA/global split. Prefill: int8 tensor-core MMQ + tensor-core flash attention + SoA Q8_0 weight repack. Prompts >4096 use a real SWA KV ring on the chunked-flash path (fixes correctness past the 512 window). Decode: dp4a/Q8_1 + CUDA-graph replay; argmax-stable. (llama.cpp ~8475 prefill / ~78 decode) |
CUDA -g 22 -c 2048 (hybrid) |
6.7 | 7.0 | 22 GPU + 20 CPU layers. -g ≤ 22 required so the CPU shared-KV tail can read its own-KV source layers; CPU dense-FFN dominates decode (bandwidth-bound). SHARPI_CUDA_PROFILE=1 for per-phase breakdown |
| CPU | 5.0 | 5.1 | dense 42-layer gemma4: per-layer head_dim (256 SWA / 512 global), dual-RoPE, KV-share tail (18 layers), 5:1 SWA:global, logit softcap 30, PLE-256 injection (~4.2 GB mmap-resident) |
Gemma 4 12B-it QAT Q4_0 — google · 7 GB
sharpi-cli -m models/gemma-4-12b-it-qat-q4_0.gguf -g -1 -c 2048 \
--temp 1.0 --top-k 64 --top-p 0.95 --min-p 0 -p "Describe a sunset in three sentences"| Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|
CUDA -g -1 -c 2048 |
1742 | 54.1 | dense 48-layer gemma4, all bulk weights Q4_0 + tied Q6_K embedding; fits 12 GB full-offload. attention_k_eq_v (8 global MQA layers reuse K as V), per-layer KV heads (8 GQA / 1 MQA), pure V-norm. Prefill: Q4_0 int8 tensor-core MMQ over a SoA repack; decode: Q4_0 dp4a/Q8_1, argmax-stable, within ~6% of llama.cpp (57 t/s). Long context: --kv-type bf16 halves / q8_0 quarters the K/V store (fp32 kernel math, argmax-stable) + a bookkeeping-only host KV cache → -c 131072 (128K) within 12 GB (fp32 cudaMalloc-fails at 64K). At 128K, q8_0 keeps the tied embed table resident (~53 t/s) where bf16 spills it (~19 t/s). Default fp32; opt-in --kv-type bf16|q8_0 |
Qwen3-Coder 30B-A3B (MoE) — Qwen · 17 GB
sharpi-cli -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -g -1 \
--temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -p "Implement a binary search tree in C#"| Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|
CUDA -g -1 (hybrid) |
29.0 | 28.0 | 29 GPU + 19 CPU layers; routed experts stream through CudaExpertSlotManager SLRU (#72/#77). Batched-trunk prefill (#123, bit-identical; SHARPI_BATCHED_PREFILL=0 to bisect); SHARPI_EXPERT_STATS=path for hit rates |
CPU --tq |
19.6 | 22.6 | 3-bit KV; FastScan (#34) → 15.5 t/s decode @ 3.2K ctx |
| CPU | 19.8 | 22.4 | 128 experts / 8 active |
Vulkan -g -1 (hybrid) |
1.1 | 5.3 | 29 GPU + 19 CPU layers, SLRU expert cache + predictive prefetch (--no-moe-predict-prefetch to disable). Vulkan-hybrid errors on the ~1K prompt, so these are the prior short-ctx values |
Carnice (Qwen3.6-35B-A3B-MTP finetune) — mudler · 17 GB
# greedy + --no-thinking engages MTP self-speculative decoding
sharpi-cli -m models/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-I-Compact.gguf -g -1 \
--temp 0 --no-thinking -p "List three uses for a paperclip"| Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|
CUDA -g -1 --no-thinking (hybrid) |
139.2 | 26.5 | agentic finetune; 80% acceptance (bench-carnice.ps1). APEX mixed-precision (Q3_K + Q8_0 experts); Q8_KS per-32 int dots auto-enable |
Qwen3.6-35B-A3B (GDN+MoE) — unsloth · 22 GB
sharpi-cli -m models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -g -1 \
--temp 0.6 --top-p 0.95 --top-k 20 -p "Prove that the square root of 2 is irrational"| Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|
CUDA -g -1 (hybrid) |
55.1 | 23.7 | 10 attn + 30 GDN on GPU; routed MoE on CPU, shared expert GPU-overlapped. Fused GDN scan + batched SDPA, bit-identical. SHARPI_CPU_MOE=0 forces on-GPU experts |
| CPU | 11.3 | 9.3 | hybrid GDN/attn, 256 experts / 8 active. Chunk-parallel (FlashQLA) GDN prefill default-on (1.35× over the 8.4 t/s per-token scan; SHARPI_GDN_CHUNKED_PREFILL=0 to disable). MTP variants keep the byte-exact scan (FP-reorder flips the thinking-boundary token) |
Qwen3.6-35B-A3B-MTP (GDN+MoE) — unsloth · 22 GB
# SHARPI_CPU_MOE=1 keeps GDN/attn + shared expert on GPU, routed experts on CPU
SHARPI_CPU_MOE=1 sharpi-cli -m models/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf -g -1 \
--temp 0 --no-thinking -p "Write a binary search in Rust"| Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|
CUDA -g -1 --no-thinking (hybrid) |
55.7 | 21.4 | needs SHARPI_CPU_MOE=1: 30 GDN + 10 attn + shared expert on GPU, routed experts CPU mmap. 100% acceptance |
CPU --no-thinking |
9.1 | 8.5 | GDN/attn + 256-expert MoE + MTP head (#44). 100% acceptance; MoE-MTP batched verify (#45) — routed experts sequential per token, so ~MTP-off parity |
Qwen3.6-27B-MTP (GDN) — unsloth · 16 GB (Q4_K_M) / 19 GB (Q5_K_M)
sharpi-cli -m models/Qwen3.6-27B-MTP-Q4_K_M.gguf -g -1 \
--temp 0 --no-thinking -p "Explain the CAP theorem"| Backend | Size | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|---|
CUDA -g -1 --no-thinking (hybrid) |
16 GB | 7.3 | 10.4 | 22/64 dense FFN on GPU + GDN/attn KV resident, rest CPU mmap. 90% acceptance; folded k-token batched verify + GDN snapshot ring — 1.68× over MTP-off (6.4) |
CUDA -g -1 --no-thinking Q5_K_M (hybrid) |
19 GB | 5.9 | 5.5 | 13/64 FFN on GPU, 51/64 CPU mmap. 98% acceptance; batched trunk (#119) bit-identical |
CPU --no-thinking |
16 GB | 3.0 | 3.6 | dense 27B GDN/attn + native MTP head; auto MTP self-spec (#25) at greedy + --no-thinking. 90% draft acceptance; folded k-token batched verify (#30/#207) — 1.2× over MTP-off (3.0) |
CPU --no-thinking Q5_K_M |
19 GB | 2.8 | 3.5 | ~10% slower than Q4_K_M; 100% acceptance |
Llama-4 Scout 17B-16E (MoE) — meta-llama · 61 GB
# split GGUF — point -m at shard 1; CPU-only wins on a 12 GB card (drop -g)
sharpi-cli -m models/Llama-4-Scout-17B-16E-Instruct-Q4_K_M-00001-of-00002.gguf \
--temp 0.6 --top-p 0.9 -p "Summarize the plot of Hamlet"| Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|
| CPU | 2.1 | 4.3 | 48 layers, 17B active; split GGUF (not on bench machine) |
CUDA -g -1 (hybrid) |
1.2 | 2.6 | 7 GPU + 41 CPU layers — model dwarfs the 12 GB card so CPU-only wins; per-expert SLRU streaming (#72/#77) still lifts both (not on bench machine) |
Re-measured 2026-06 from warm sweeps (prefill ~1K ctx, decode near-zero ctx), each after a discarded warm-up. Vulkan rows are ~35% below their prior numbers — an unexplained regression (CUDA improved on the same box). Llama-4 Scout and Qwen3-Coder Vulkan-hybrid keep prior values (not re-runnable here).
Long-context decode uses flash-decoding (split-KV) on all CUDA paths (dense + MoE/GDN hybrids): the
per-token KV read parallelizes across SMs, so decode no longer collapses with context (Gemma 4 E4B q8
11→45 t/s @32K, up to ~2× on the hybrids). SHARPI_SPLIT_DECODE=0 reverts.
Sampling for Gemma 4 E4B-it: --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0 (Gemma 3/4 defaults).
It is not a reasoning model — the CLI auto-sets enable_thinking=false, else the template renders an
empty <think> block and output degenerates. Greedy is not recommended; use the values above.
--backend auto (default) picks CUDA when available, sizing the GPU/CPU split from VRAM via TierPlanner,
else Vulkan. For hybrid qwen35moe the CUDA backend keeps attention KV, GDN layers, and the shared expert
in VRAM; routed experts auto-select SLRU GPU cache vs CPU mmap by fit (SHARPI_CPU_MOE=0|1). On Ampere+ it
uses bf16 cuBLAS GEMM (SHARPI_CUDA_PRECISION to bisect; NVRTC kernels keep fp32 accumulators). GDN paths
store bf16 KV by default (SHARPI_KV_DTYPE=fp32).
MoE expert-cache knobs (--moe-warmpin, --moe-warmpin-after, --no-moe-predict-prefetch,
--expert-stats) are CLI-only; the server reads the equivalent SHARPI_MOE_WARMPIN*,
SHARPI_MOE_PREDICT_PREFETCH=0, SHARPI_EXPERT_STATS=<path> env vars.
On CPU, CUDA (dense + hybrid GDN), and Vulkan. After prefill it scores each prompt position by softmaxed
attention from the last W queries, keeps top-K + a recency window, and compacts the K/V ring in place;
decode is unchanged (LogicalLength stays at the prompt length so RoPE on new tokens lands correctly). GPU
paths auto-enable above ~256 MiB KV (budget min(maxSeqLen/4, 4096) floored at 1024); SHARPI_SNAPKV_BUDGET=N
forces it (=0 off), _WINDOW/_RECENCY tune the probe/keep zone (CPU is opt-in via the budget). Composes
with TurboQuant on CPU (~16× total KV). Eval: benchmarks/SnapKvEval (needle-in-haystack sweep).
CPU/Vulkan/CUDA; requires headDim ∈ {128, 256}. K-scoring and V-aggregation use a FastScan AVX2 kernel:
KV packs into 32-position tiles with 4-bit codes, a per-query i8 LUT reduces each step to a vpshufb, and
the IWHT defers to one call per kv-head. Combined K+V hot-path cost vs the prior per-block path (Ryzen 9 7900X):
| TQ positions | per-block K+V | FastScan K+V | speedup |
|---|---|---|---|
| 1 024 | 479 µs | 26 µs | 18× |
| 4 096 | 1 931 µs | 98 µs | 20× |
| 8 192 | 3 936 µs | 193 µs | 20× |
| 16 384 | 8 216 µs | 390 µs | 21× |
End-to-end gain tracks the K+V share of token cost — small at short ctx, growing with length. Qwen3-8B CPU
--tq decode drops only ~22% from 30→6050 ctx (12.0→9.4 t/s) — ~1.9× the per-block path at 6K.
Models with native MTP heads (Qwen3.6-27B-MTP, Qwen3.5/3.6 A3B-MTP, DeepSeek V3/R1) get self-speculative
decoding with no separate draft model. Engages automatically with greedy (--temp 0) + --no-thinking; the
CLI prints MTP accept: N%. Default is a folded k-token batched verify: the certain token plus a chained
draft run through one batched trunk pass, rejections rolled back via a per-token GDN snapshot ring. CLI
mirrors llama.cpp: --spec-type, --spec-draft-n-max <N> (default 1; deeper chains need
SHARPI_MTP_BATCH_MAX≥N+1, ~150 MiB VRAM/slot), --spec-draft-p-min. Off: SHARPI_DISABLE_MTP=1.
Multi-turn requests reuse the prior turn's state instead of re-prefilling. GDN-hybrid passes snapshot
recurrent state at the history boundary and restore on a prefix match; MTP runs also snapshot the MTP KV +
hidden-history, so agentic tool loops skip the per-round re-prefill. /metrics exposes
sharpi_prefill_tokens_reused_total.
Models that emit <think>...</think> are detected from their special tokens (no flag); the CLI dims the
stream. --no-thinking disables it at the template level, --hide-thinking keeps it hidden,
--max-thinking-tokens N force-closes runaway reasoning. Greedy often loops — the CLI recommends
--temp 0.6 --top-p 0.95 --top-k 20. The server emits reasoning per protocol (Anthropic thinking,
OpenAI reasoning_content).
# CPU, single-turn, greedy
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf -p "What is 2+2?" --temp 0
# Full GPU offload (auto-picks CUDA)
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-8B-Q4_K_M.gguf -p "Write a quicksort in Python" --temp 0 -g -1
# MoE on CPU with 3-bit KV compression
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --tq -p "Implement a BST in C#" --temp 0
# Speculative decoding (~2× faster at temp 0)
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-8B-Q4_K_M.gguf --draft-model models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
-p "Write a binary search in Rust" --temp 0
# API server (OpenAI /v1/chat/completions + Anthropic /v1/messages, port 5000)
# Multi-user: SHARPI_MAX_BATCH=8 enables continuous batching (CPU or CUDA backend). On CUDA,
# batched decode uses int8 tensor-core matmuls by default at N>=5 (Qwen3-8B Q4_K_M @ 4070 Ti:
# +11% to +28% aggregate t/s from N=5 to N=8); SHARPI_BATCH_DECODE_MMQ=0 forces the bit-exact
# path. Long-prompt admission prefills in SHARPI_PREFILL_CHUNK-token slices (default 256,
# 0 = blocking) interleaved with decode, packed across prompts; SHARPI_KV_BUDGET_MB caps total
# KV memory for admission backpressure (default: half of RAM). See issues #183 / #206.
SHARPI_MODEL=models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
dotnet run --project src/SharpInference.Server.Host -c ReleaseGPU flags mirror llama.cpp: -g/--ngl/--n-gpu-layers are interchangeable, and --device <0|CUDA0|none>
pins a single GPU (no multi-GPU split).
Two pipelines, auto-detected from model filename. Benchmarked on AMD Zen 4 + RTX 4070 Ti (CUDA, 4 steps, 512×512). The CLI is one-shot, so each run pays the full load + encoder warmup; the "cached" column is the steady-state cost when encoder weights stay resident (server or interactive loop after the first prompt).
| Pipeline | Components (repo • file • size) | Per-run | Cached prompt | Notes |
|---|---|---|---|---|
| Z-Image-Turbo | DiT: jayn7/Z-Image-Turbo-GGUF z_image_turbo-Q5_K_M.gguf 5.5 GBEncoder: BennyDaBall/...-AbliteratedV1 Z-Image-AbliteratedV1.Q5_K_M.gguf 2.9 GBVAE + tokenizer: Tongyi-MAI/Z-Image-Turbo vae/ tokenizer/ |
~108 s | ~30 s | Most per-run cost is text-encoder warmup (~90 s); DiT ~4 s, VAE ~18 s once hot |
| FLUX.1-schnell | DiT: city96/FLUX.1-schnell-gguf flux1-schnell-Q4_K_S.gguf ~7 GBEncoders + VAE: comfyanonymous/flux_text_encoders clip_l.safetensors + t5xxl_fp16.safetensors + ae.safetensors |
— | — | 4-step distilled; not on this bench machine |
Optional 4× upscale via Real-ESRGAN (RealESRGAN_x4plus.safetensors): runs on CUDA when available,
falls back to bicubic.
# Z-Image-Turbo (auto-detects pipeline from filename containing "z_image")
dotnet run --project src/SharpInference.Cli -c Release -- image \
-m models/z_image_turbo-Q5_K_M.gguf \
--vae models/z-image-turbo/vae \
--qwen-encoder models/Z-Image-AbliteratedV1.Q5_K_M.gguf \
--qwen-tokenizer models/z-image-turbo/tokenizer/tokenizer.json \
-p "a serene mountain lake at sunrise" -W 1024 -H 1024 --steps 4 -o landscape.png
# With 4× Real-ESRGAN upscale + blend
dotnet run --project src/SharpInference.Cli -c Release -- image \
-m models/z_image_turbo-Q5_K_M.gguf \
--vae models/z-image-turbo/vae \
--qwen-encoder models/Z-Image-AbliteratedV1.Q5_K_M.gguf \
--qwen-tokenizer models/z-image-turbo/tokenizer/tokenizer.json \
--upscaler models/RealESRGAN_x4plus.safetensors --upscale-blend 0.8 \
-p "a fox in autumn forest" -W 512 -H 512 --steps 4 -o fox.png- Architecture & algorithms: docs/SharpInference-Design.md
- All CLI flags:
sharpi-cli --help,sharpi-cli image --help - Model downloads:
scripts/download-model.ps1 -Model <smollm2|qwen3-8b|qwen3-coder-30b-a3b|llama4-scout|z-image-turbo|realesrgan-x4|…> - Tests:
dotnet test - NativeAOT publish:
dotnet publish src/SharpInference.Cli -c Release -r win-x64
Released under the MIT License.