perf(cuda): llm_matvec_q3_k NVRTC + UploadRaw Q3_K/Q8_0 + TierPlanner byte accounting

## Context
Sibling of rank-1 / rank-2 issues. Targets the **GPU side** of Carnice mixed-precision routed-expert path.

CUDA kernels for Q3_K and Q8_0 routed-expert matvec are **missing**:
- `CudaBackend.cs:1187-1193` throws `NotSupportedException`
- `CudaTextKernels.cs` only ships Q4_K / Q5_K / Q6_K / F32
- `CudaExpertSlotManager.cs:273-281` and `CudaHybridGdnForwardPass.cs:2755-2762` silently dequant-to-F32 -> ~25% VRAM waste

`TierPlanner.cs:197-203` also has stale Q5_K accounting and zero Q3_K/Q8_0 accounting.

## Honest framing — zero Carnice lift on 12 GB cards
Adversarial verification showed: per-expert footprint is ~1.43-1.51 MB (not 880 KB) -> SLRU lands at **39-43% slot capacity** on RTX 4070 Ti 12 GB -> below the 50% auto-router threshold at `CudaHybridGdnForwardPass.cs:654` -> squarely in the documented regression band (`:636-637`: 6.1 t/s GPU SLRU vs 11.8 t/s CPU MoE on structurally identical qwen3.6-A3B).

**Auto-router correctly keeps Carnice on CPU MoE on a 4070 Ti.** This work is valuable for two reasons:
1. **VRAM reclaim** — removes ~25% dequant-to-F32 waste on every machine.
2. **Larger-card unblock** — RTX 4090 / 5090 16+ GB cards cross the 50% threshold; estimated 7-11 t/s on Carnice there.

## Scope (v1)
- New NVRTC kernel `llm_matvec_q3_k` in `CudaTextKernels.cs` modeled on `llm_matvec_q5_k` (Q3_K block: 32 B hmask + 64 B qs + 12 B 6-bit scales + 1 fp16 d = 110 B / 256 elem)
- Extend `UploadRaw` to Q3_K and Q8_0 (today only Q4_K / Q5_K go raw; rest dequant to F32)
- Fix `TierPlanner` byte accounting for Q5_K (already shipped), Q3_K, Q8_0
- **Auto-router stays on** — Carnice on 4070 Ti continues to pick CPU MoE (correct decision)
- Parity test cloned from `CudaMatVecQ5KTests` for the new Q3_K kernel

## Out of scope (v2+)
- CUDA `llm_matvec_q8_0` kernel — Q8_0 is MTP-head-only; blocked by MTP-MoE GPU dispatch gap
- MTP-MoE guard relaxation at `CudaHybridGdnForwardPass.cs:915-918` — `MtpForward.cs:1899-1906` has no GPU MoE branch; relaxing the throw without implementing `GpuMtpMoeFfn` would crash on first MTP step
- Default-on switching for Carnice on 4070 Ti (would be a regression per :636-637)

## Risks
- `TierPlanner` Q5_K accounting fix retroactively changes the qwen3.6-35B-A3B-MTP (22.9 t/s today) auto-decision — must regression-bench before merging.
- PCIe Gen4 x16 streaming at ~58% miss rate x 8 experts x 41 layers x 1.43 MB ~= 270 MB/token ~= 11 ms PCIe alone — tight but not a hard cap. Validate with a single Carnice decode under `SHARPI_CPU_MOE=0` before declaring default-on for larger cards.
- `feedback_qwen36_perf_attempts` warns GPU shexp-add destabilised host-stream pacing — Q3_K matvec adds adjacent stream pressure; validate jitter run-to-run.

## First step
Add `llm_matvec_q3_k` NVRTC kernel string to `CudaTextKernels.cs`, modeled on `llm_matvec_q5_k` at :875-975. Reference `SimdKernels.DotQ3K_Scalar` for unpack arithmetic. Land a clone of `CudaMatVecQ5KTests` for Q3_K that runs the new kernel against `SimdKernels.DotQ3K_Scalar` on a real Carnice `ffn_gate_exps` tensor slice. Stop there — no dispatcher wiring, no `UploadRaw` change, no `TierPlanner` change until per-kernel parity is green.

Rank 3 of 3 from CUDA-hybrid Carnice optimization workflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda): llm_matvec_q3_k NVRTC + UploadRaw Q3_K/Q8_0 + TierPlanner byte accounting #100

Context

Honest framing — zero Carnice lift on 12 GB cards

Scope (v1)

Out of scope (v2+)

Risks

First step

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(cuda): llm_matvec_q3_k NVRTC + UploadRaw Q3_K/Q8_0 + TierPlanner byte accounting #100

Description

Context

Honest framing — zero Carnice lift on 12 GB cards

Scope (v1)

Out of scope (v2+)

Risks

First step

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions