Context
Sibling of rank-1 / rank-2 issues. Targets the GPU side of Carnice mixed-precision routed-expert path.
CUDA kernels for Q3_K and Q8_0 routed-expert matvec are missing:
CudaBackend.cs:1187-1193 throws NotSupportedException
CudaTextKernels.cs only ships Q4_K / Q5_K / Q6_K / F32
CudaExpertSlotManager.cs:273-281 and CudaHybridGdnForwardPass.cs:2755-2762 silently dequant-to-F32 -> ~25% VRAM waste
TierPlanner.cs:197-203 also has stale Q5_K accounting and zero Q3_K/Q8_0 accounting.
Honest framing — zero Carnice lift on 12 GB cards
Adversarial verification showed: per-expert footprint is ~1.43-1.51 MB (not 880 KB) -> SLRU lands at 39-43% slot capacity on RTX 4070 Ti 12 GB -> below the 50% auto-router threshold at CudaHybridGdnForwardPass.cs:654 -> squarely in the documented regression band (:636-637: 6.1 t/s GPU SLRU vs 11.8 t/s CPU MoE on structurally identical qwen3.6-A3B).
Auto-router correctly keeps Carnice on CPU MoE on a 4070 Ti. This work is valuable for two reasons:
- VRAM reclaim — removes ~25% dequant-to-F32 waste on every machine.
- Larger-card unblock — RTX 4090 / 5090 16+ GB cards cross the 50% threshold; estimated 7-11 t/s on Carnice there.
Scope (v1)
- New NVRTC kernel
llm_matvec_q3_k in CudaTextKernels.cs modeled on llm_matvec_q5_k (Q3_K block: 32 B hmask + 64 B qs + 12 B 6-bit scales + 1 fp16 d = 110 B / 256 elem)
- Extend
UploadRaw to Q3_K and Q8_0 (today only Q4_K / Q5_K go raw; rest dequant to F32)
- Fix
TierPlanner byte accounting for Q5_K (already shipped), Q3_K, Q8_0
- Auto-router stays on — Carnice on 4070 Ti continues to pick CPU MoE (correct decision)
- Parity test cloned from
CudaMatVecQ5KTests for the new Q3_K kernel
Out of scope (v2+)
- CUDA
llm_matvec_q8_0 kernel — Q8_0 is MTP-head-only; blocked by MTP-MoE GPU dispatch gap
- MTP-MoE guard relaxation at
CudaHybridGdnForwardPass.cs:915-918 — MtpForward.cs:1899-1906 has no GPU MoE branch; relaxing the throw without implementing GpuMtpMoeFfn would crash on first MTP step
- Default-on switching for Carnice on 4070 Ti (would be a regression per :636-637)
Risks
TierPlanner Q5_K accounting fix retroactively changes the qwen3.6-35B-A3B-MTP (22.9 t/s today) auto-decision — must regression-bench before merging.
- PCIe Gen4 x16 streaming at ~58% miss rate x 8 experts x 41 layers x 1.43 MB ~= 270 MB/token ~= 11 ms PCIe alone — tight but not a hard cap. Validate with a single Carnice decode under
SHARPI_CPU_MOE=0 before declaring default-on for larger cards.
feedback_qwen36_perf_attempts warns GPU shexp-add destabilised host-stream pacing — Q3_K matvec adds adjacent stream pressure; validate jitter run-to-run.
First step
Add llm_matvec_q3_k NVRTC kernel string to CudaTextKernels.cs, modeled on llm_matvec_q5_k at :875-975. Reference SimdKernels.DotQ3K_Scalar for unpack arithmetic. Land a clone of CudaMatVecQ5KTests for Q3_K that runs the new kernel against SimdKernels.DotQ3K_Scalar on a real Carnice ffn_gate_exps tensor slice. Stop there — no dispatcher wiring, no UploadRaw change, no TierPlanner change until per-kernel parity is green.
Rank 3 of 3 from CUDA-hybrid Carnice optimization workflow.
Context
Sibling of rank-1 / rank-2 issues. Targets the GPU side of Carnice mixed-precision routed-expert path.
CUDA kernels for Q3_K and Q8_0 routed-expert matvec are missing:
CudaBackend.cs:1187-1193throwsNotSupportedExceptionCudaTextKernels.csonly ships Q4_K / Q5_K / Q6_K / F32CudaExpertSlotManager.cs:273-281andCudaHybridGdnForwardPass.cs:2755-2762silently dequant-to-F32 -> ~25% VRAM wasteTierPlanner.cs:197-203also has stale Q5_K accounting and zero Q3_K/Q8_0 accounting.Honest framing — zero Carnice lift on 12 GB cards
Adversarial verification showed: per-expert footprint is ~1.43-1.51 MB (not 880 KB) -> SLRU lands at 39-43% slot capacity on RTX 4070 Ti 12 GB -> below the 50% auto-router threshold at
CudaHybridGdnForwardPass.cs:654-> squarely in the documented regression band (:636-637: 6.1 t/s GPU SLRU vs 11.8 t/s CPU MoE on structurally identical qwen3.6-A3B).Auto-router correctly keeps Carnice on CPU MoE on a 4070 Ti. This work is valuable for two reasons:
Scope (v1)
llm_matvec_q3_kinCudaTextKernels.csmodeled onllm_matvec_q5_k(Q3_K block: 32 B hmask + 64 B qs + 12 B 6-bit scales + 1 fp16 d = 110 B / 256 elem)UploadRawto Q3_K and Q8_0 (today only Q4_K / Q5_K go raw; rest dequant to F32)TierPlannerbyte accounting for Q5_K (already shipped), Q3_K, Q8_0CudaMatVecQ5KTestsfor the new Q3_K kernelOut of scope (v2+)
llm_matvec_q8_0kernel — Q8_0 is MTP-head-only; blocked by MTP-MoE GPU dispatch gapCudaHybridGdnForwardPass.cs:915-918—MtpForward.cs:1899-1906has no GPU MoE branch; relaxing the throw without implementingGpuMtpMoeFfnwould crash on first MTP stepRisks
TierPlannerQ5_K accounting fix retroactively changes the qwen3.6-35B-A3B-MTP (22.9 t/s today) auto-decision — must regression-bench before merging.SHARPI_CPU_MOE=0before declaring default-on for larger cards.feedback_qwen36_perf_attemptswarns GPU shexp-add destabilised host-stream pacing — Q3_K matvec adds adjacent stream pressure; validate jitter run-to-run.First step
Add
llm_matvec_q3_kNVRTC kernel string toCudaTextKernels.cs, modeled onllm_matvec_q5_kat :875-975. ReferenceSimdKernels.DotQ3K_Scalarfor unpack arithmetic. Land a clone ofCudaMatVecQ5KTestsfor Q3_K that runs the new kernel againstSimdKernels.DotQ3K_Scalaron a real Carniceffn_gate_expstensor slice. Stop there — no dispatcher wiring, noUploadRawchange, noTierPlannerchange until per-kernel parity is green.Rank 3 of 3 from CUDA-hybrid Carnice optimization workflow.