You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The CPU int8 dot kernels are the decode bottleneck on the target 12 GB / 64 GB machine: CPU-MoE is the auto-selected MoE config there (#100, CudaHybridGdnForwardPass.cs:886-909), and the Q8_KS routed-expert dots are what the 4.9× Carnice win (SHARPI_Q3K_Q8K) runs on. More models land on this path with #215.
Problem
SimdKernels.cs implements the whole int8 dot family (DotQ3K_Q8K(S), DotQ8_0_Q8K, DotQ6K_Q8K, the _2In variants, …) on AVX2 256-bitAvx2.MultiplyAddAdjacent chains (vpmaddubsw → vpmaddwd → vpaddd, e.g. SimdKernels.cs:1956-1981). Two generations of ISA are left unused:
VNNI is absent entirely — zero hits for AvxVnni in the repo. vpdpbusd does u8×s8→s32 multiply-accumulate in one instruction, replacing the 2-3 instruction vpmaddubsw/vpmaddwd chain.
AVX-512 coverage is thin: 7 Avx512F.IsSupported gates (Q4_K/Q5_K float-path dots) vs 273 Vector256 uses; none of the int8 Q8K/Q8KS kernels have a 512-bit path. Zen 5 executes 512-bit natively; Zen 4 double-pumps but still wins on instruction count.
llama.cpp's ggml carries exactly these paths (VNNI int8 dots), so the CPU-MoE comparison against it gives away ISA for free on modern CPUs.
What .NET actually exposes (verified 2026-06, corrected from the original issue text)
AvxVnni (VEX, 128/256-bit) — available and stable on net10.0 (preview attribute was lifted in .NET 9). MultiplyWideningAndAdd = vpdpbusd, MultiplyWideningAndAddSaturate = vpdpbusds. Hardware gate is the VEX AVX-VNNI CPUID flag: Intel Alder Lake+ (incl. E-cores) and Zen 5. Zen 4 advertises only EVEX AVX512-VNNI, so expect AvxVnni.IsSupported == false there — verify with a runtime probe on a 7000-series.
512-bit VNNI is NOT in .NET 10.Avx512Vnni does not exist. The only 512-bit VNNI surface today is Avx10v1.V512 (.NET 9+), which lights up only on Intel AVX10 hardware (Granite-Rapids-class), not AMD.
AvxVnniInt8 (vpdpbssd, s8×s8 — would drop the unsigned-bias trick entirely) exists in net10.0 but its hardware (Sierra Forest / Arrow Lake / Lunar Lake) is rare in the target audience; note only.
Zen 4 escape hatch (optional, only if profiling justifies)
If Zen 4 matters and AvxVnni.IsSupported is indeed false there, the EVEX vpdpbusd can be reached via a tiny native helper (a C file compiled with -mavx512vnni, P/Invoked at MatVec/row-sweep granularity so the ~ns call cost amortizes). The repo already P/Invokes OpenBLAS/cuBLAS/NVRTC, and NativeAOT can even statically link it (DirectPInvoke). Likely NOT worth it: single-stream decode dots are DRAM-bound (see below), and Zen 4 still has the AVX2 path.
Honest expectations
Single-token decode dots are largely DRAM-bandwidth-bound (expert weights stream from RAM once per token) — VNNI may move little there. Measure before claiming.
Add VNNI variants of the Q8K/Q8KS dot inner loops behind AvxVnni.IsSupported (256-bit — Alder Lake+/Zen 5), keeping current AVX2 as fallback; same accumulation order or explicitly argmax-stable-gated, per the repo's parity discipline.
512-bit int8 paths where they exist today: Avx512BW widening variants of the same kernels gated on Vector512.IsHardwareAccelerated (helps Zen 4/5 without VNNI semantics), and optionally Avx10v1.V512 VNNI for AVX10 hardware. Revisit with AvxVnni.V512 when the project moves to .NET 11.
Parity: clone the existing Q8KS validation tests (cf. bench-q8k-validation.ps1 / SimdKernelsQ8KSTests) for each new path; 0 mismatches vs scalar reference.
Acceptance
VNNI path auto-selected on capable CPUs, kill switch env var for bisection.
Documented runtime-probe result for AvxVnni.IsSupported on Zen 4 (decides whether the native-helper question stays open).
Kernel-level speedup quantified; e2e A/B on Carnice CPU-MoE prefill + decode reported (even if decode is bandwidth-bound-null, document it).
Bit-parity/argmax-stability tests green; no regression on AVX2-only hardware.
Context
The CPU int8 dot kernels are the decode bottleneck on the target 12 GB / 64 GB machine: CPU-MoE is the auto-selected MoE config there (#100,
CudaHybridGdnForwardPass.cs:886-909), and the Q8_KS routed-expert dots are what the 4.9× Carnice win (SHARPI_Q3K_Q8K) runs on. More models land on this path with #215.Problem
SimdKernels.csimplements the whole int8 dot family (DotQ3K_Q8K(S),DotQ8_0_Q8K,DotQ6K_Q8K, the_2Invariants, …) on AVX2 256-bitAvx2.MultiplyAddAdjacentchains (vpmaddubsw → vpmaddwd → vpaddd, e.g.SimdKernels.cs:1956-1981). Two generations of ISA are left unused:AvxVnniin the repo.vpdpbusddoes u8×s8→s32 multiply-accumulate in one instruction, replacing the 2-3 instruction vpmaddubsw/vpmaddwd chain.Avx512F.IsSupportedgates (Q4_K/Q5_K float-path dots) vs 273Vector256uses; none of the int8 Q8K/Q8KS kernels have a 512-bit path. Zen 5 executes 512-bit natively; Zen 4 double-pumps but still wins on instruction count.llama.cpp's ggml carries exactly these paths (VNNI int8 dots), so the CPU-MoE comparison against it gives away ISA for free on modern CPUs.
What .NET actually exposes (verified 2026-06, corrected from the original issue text)
AvxVnni(VEX, 128/256-bit) — available and stable on net10.0 (preview attribute was lifted in .NET 9).MultiplyWideningAndAdd= vpdpbusd,MultiplyWideningAndAddSaturate= vpdpbusds. Hardware gate is the VEX AVX-VNNI CPUID flag: Intel Alder Lake+ (incl. E-cores) and Zen 5. Zen 4 advertises only EVEX AVX512-VNNI, so expectAvxVnni.IsSupported == falsethere — verify with a runtime probe on a 7000-series.Avx512Vnnidoes not exist. The only 512-bit VNNI surface today isAvx10v1.V512(.NET 9+), which lights up only on Intel AVX10 hardware (Granite-Rapids-class), not AMD.AvxVnni.V512(512-bit VPDPBUSD/VPDPWSSD/VPDPBUSDS): [API Proposal]: Add support for AVX-512 VNNI hardware instructions dotnet/runtime#86849 is api-approved and the implementation PR Add AvxVnni.V512 hardware intrinsics dotnet/runtime#128365 is approved-pending-checks as of June 2026 — on track for .NET 11, not guaranteed until merged.AvxVnniInt8(vpdpbssd, s8×s8 — would drop the unsigned-bias trick entirely) exists in net10.0 but its hardware (Sierra Forest / Arrow Lake / Lunar Lake) is rare in the target audience; note only.Zen 4 escape hatch (optional, only if profiling justifies)
If Zen 4 matters and
AvxVnni.IsSupportedis indeed false there, the EVEX vpdpbusd can be reached via a tiny native helper (a C file compiled with-mavx512vnni, P/Invoked at MatVec/row-sweep granularity so the ~ns call cost amortizes). The repo already P/Invokes OpenBLAS/cuBLAS/NVRTC, and NativeAOT can even statically link it (DirectPInvoke). Likely NOT worth it: single-stream decode dots are DRAM-bound (see below), and Zen 4 still has the AVX2 path.Honest expectations
_2In/_4Inkernels (MTP/batch-verify); (c) freeing SMT/core pressure so theParallel.Forsweeps scale. Expect ~1.5-2× kernel-level on those, prefill-visible per the perf(engine): GDN-hybrid prefill processes the prompt token-by-token (no batched prompt processing) — ~30 tok/s dominates agentic latency #110 split (routed MoE was ~36% of prefill wall).Proposed work
AvxVnni.IsSupported(256-bit — Alder Lake+/Zen 5), keeping current AVX2 as fallback; same accumulation order or explicitly argmax-stable-gated, per the repo's parity discipline.Avx512BWwidening variants of the same kernels gated onVector512.IsHardwareAccelerated(helps Zen 4/5 without VNNI semantics), and optionallyAvx10v1.V512VNNI for AVX10 hardware. Revisit withAvxVnni.V512when the project moves to .NET 11.SHARPI_PREFILL_PROFILE=1), e2e Carnice + 35B-A3B decode/prefill (bench-carnice.ps1, warm-cache protocol). Record which level the win survives to.bench-q8k-validation.ps1/SimdKernelsQ8KSTests) for each new path; 0 mismatches vs scalar reference.Acceptance
AvxVnni.IsSupportedon Zen 4 (decides whether the native-helper question stays open).References
AvxVnni.V512implementation PR, open as of 2026-06).