Skip to content

perf(cuda): llm_matvec_q3_k NVRTC + UploadRaw Q3_K/Q8_0 + TierPlanner byte accounting #100

@pekkah

Description

@pekkah

Context

Sibling of rank-1 / rank-2 issues. Targets the GPU side of Carnice mixed-precision routed-expert path.

CUDA kernels for Q3_K and Q8_0 routed-expert matvec are missing:

  • CudaBackend.cs:1187-1193 throws NotSupportedException
  • CudaTextKernels.cs only ships Q4_K / Q5_K / Q6_K / F32
  • CudaExpertSlotManager.cs:273-281 and CudaHybridGdnForwardPass.cs:2755-2762 silently dequant-to-F32 -> ~25% VRAM waste

TierPlanner.cs:197-203 also has stale Q5_K accounting and zero Q3_K/Q8_0 accounting.

Honest framing — zero Carnice lift on 12 GB cards

Adversarial verification showed: per-expert footprint is ~1.43-1.51 MB (not 880 KB) -> SLRU lands at 39-43% slot capacity on RTX 4070 Ti 12 GB -> below the 50% auto-router threshold at CudaHybridGdnForwardPass.cs:654 -> squarely in the documented regression band (:636-637: 6.1 t/s GPU SLRU vs 11.8 t/s CPU MoE on structurally identical qwen3.6-A3B).

Auto-router correctly keeps Carnice on CPU MoE on a 4070 Ti. This work is valuable for two reasons:

  1. VRAM reclaim — removes ~25% dequant-to-F32 waste on every machine.
  2. Larger-card unblock — RTX 4090 / 5090 16+ GB cards cross the 50% threshold; estimated 7-11 t/s on Carnice there.

Scope (v1)

  • New NVRTC kernel llm_matvec_q3_k in CudaTextKernels.cs modeled on llm_matvec_q5_k (Q3_K block: 32 B hmask + 64 B qs + 12 B 6-bit scales + 1 fp16 d = 110 B / 256 elem)
  • Extend UploadRaw to Q3_K and Q8_0 (today only Q4_K / Q5_K go raw; rest dequant to F32)
  • Fix TierPlanner byte accounting for Q5_K (already shipped), Q3_K, Q8_0
  • Auto-router stays on — Carnice on 4070 Ti continues to pick CPU MoE (correct decision)
  • Parity test cloned from CudaMatVecQ5KTests for the new Q3_K kernel

Out of scope (v2+)

  • CUDA llm_matvec_q8_0 kernel — Q8_0 is MTP-head-only; blocked by MTP-MoE GPU dispatch gap
  • MTP-MoE guard relaxation at CudaHybridGdnForwardPass.cs:915-918MtpForward.cs:1899-1906 has no GPU MoE branch; relaxing the throw without implementing GpuMtpMoeFfn would crash on first MTP step
  • Default-on switching for Carnice on 4070 Ti (would be a regression per :636-637)

Risks

  • TierPlanner Q5_K accounting fix retroactively changes the qwen3.6-35B-A3B-MTP (22.9 t/s today) auto-decision — must regression-bench before merging.
  • PCIe Gen4 x16 streaming at ~58% miss rate x 8 experts x 41 layers x 1.43 MB ~= 270 MB/token ~= 11 ms PCIe alone — tight but not a hard cap. Validate with a single Carnice decode under SHARPI_CPU_MOE=0 before declaring default-on for larger cards.
  • feedback_qwen36_perf_attempts warns GPU shexp-add destabilised host-stream pacing — Q3_K matvec adds adjacent stream pressure; validate jitter run-to-run.

First step

Add llm_matvec_q3_k NVRTC kernel string to CudaTextKernels.cs, modeled on llm_matvec_q5_k at :875-975. Reference SimdKernels.DotQ3K_Scalar for unpack arithmetic. Land a clone of CudaMatVecQ5KTests for Q3_K that runs the new kernel against SimdKernels.DotQ3K_Scalar on a real Carnice ffn_gate_exps tensor slice. Stop there — no dispatcher wiring, no UploadRaw change, no TierPlanner change until per-kernel parity is green.

Rank 3 of 3 from CUDA-hybrid Carnice optimization workflow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions