Vulkan: add MatVecQ5K shader so Q5_K experts can be cached raw (currently dequantized to F32)

## Background

Cached experts should stay in their native quant format. `CudaExpertSlotManager` already does this for Q4_K / Q5_K / Q6_K via `UploadRaw` (`CudaExpertSlotManager.cs:162`). But two other paths **dequantize Q5_K to F32** on upload:

- **Vulkan `ExpertSlotManager`** — `ExpertSlotManager.cs:146` keeps Q4_K/Q6_K raw, but Q5_K falls through to the `Dequantize.ToFloat32` branch (`:156`), so it's cached at **4 bytes/element**.
- **Non-GDN `CudaHybridForwardPass`** resident upload — same pattern: Q4_K/Q6_K raw (`:1418`), Q5_K → F32 (`:1428`).

This matters concretely for Q5_K_M-quantized MoE models (and qwen35moe's `ffn_down_exps`, though that runs on the CUDA-GDN path, not Vulkan): every cached Q5_K expert is 4× its on-disk size, cutting SLRU residency.

## ⚠️ Correction (after source audit)

This is **not the one-line fix** the original issue implied. The Q5_K→F32 dequant on Vulkan is **deliberately correct today**, because:

- `VulkanBackend.MatMul(output, matrix, vector, weightDType)` only has cases for **`Float32`, `Q6_K`, and default→`Q4_K`** (`VulkanBackend.cs:1178`). **There is no Q5_K matvec pipeline.**
- There is a standalone Q5_K *dequant* shader (`Shaders.DequantQ5KM`, `:1935`) but **no fused Q5_K matvec** equivalent to `MatVecQ6K`.
- So if we uploaded Q5_K raw and tagged it `Q5_K` in `_gpuWeightDTypes`, `MatMul` would hit the `default` branch and decode it as **Q4_K** — wrong block layout (144 B vs 176 B per 256-elem block) → garbage output.

## Real scope

This is now a **Vulkan shader work item**, not a slot-manager tweak:

1. **Write a `MatVecQ5K` GLSL compute shader** mirroring `Shaders.MatVecQ6K` (per-block Q5_K dequant-dot; reuse the block-layout logic already in `DequantQ5KM`).
2. Add `_matVecQ5KPipeline` + a `case DType.Q5_K` in `VulkanBackend.MatMul` (`:1178`); dispose it alongside the others (`:1686`).
3. *Then* add `DType.Q5_K` to the raw-upload branch in `ExpertSlotManager.UploadExpertWeight:146` (and the non-GDN `CudaHybridForwardPass` resident path, or let #72 subsume that).
4. Validate Q5_K_M MoE output parity vs the current F32-dequant path on a Vulkan GPU.

## Acceptance

- [ ] `MatVecQ5K` shader produces output matching the CPU Q5_K matvec within quant tolerance.
- [ ] Q5_K experts cached raw on Vulkan (≈4× more experts per fixed VRAM budget), identical end-to-end output.
- [ ] No regression on Q4_K/Q6_K-only models.

## Note on the CUDA side

`CudaExpertSlotManager` already keeps Q5_K raw and has a working Q5_K matvec path, so **the CUDA-GDN path already does the right thing**. Only the non-GDN `CudaHybridForwardPass` resident path expands Q5_K → F32, and that path is being reworked in #72 anyway.

## Related

- `Shaders.MatVecQ6K`, `Shaders.DequantQ5KM`, `VulkanBackend.MatMul` (`:1178`).
- `CudaExpertSlotManager.cs:162` — the CUDA side already does this correctly.
- #72 (non-GDN CUDA resident path rework).
- `docs/moe-expert-offloading-research.md` §1.2, §5 (P1).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: add MatVecQ5K shader so Q5_K experts can be cached raw (currently dequantized to F32) #73

Background

⚠️ Correction (after source audit)

Real scope

Acceptance

Note on the CUDA side

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Vulkan: add MatVecQ5K shader so Q5_K experts can be cached raw (currently dequantized to F32) #73

Description

Background

⚠️ Correction (after source audit)

Real scope

Acceptance

Note on the CUDA side

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions