Skip to content

Vulkan: add MatVecQ5K shader so Q5_K experts can be cached raw (currently dequantized to F32) #73

@pekkah

Description

@pekkah

Background

Cached experts should stay in their native quant format. CudaExpertSlotManager already does this for Q4_K / Q5_K / Q6_K via UploadRaw (CudaExpertSlotManager.cs:162). But two other paths dequantize Q5_K to F32 on upload:

  • Vulkan ExpertSlotManagerExpertSlotManager.cs:146 keeps Q4_K/Q6_K raw, but Q5_K falls through to the Dequantize.ToFloat32 branch (:156), so it's cached at 4 bytes/element.
  • Non-GDN CudaHybridForwardPass resident upload — same pattern: Q4_K/Q6_K raw (:1418), Q5_K → F32 (:1428).

This matters concretely for Q5_K_M-quantized MoE models (and qwen35moe's ffn_down_exps, though that runs on the CUDA-GDN path, not Vulkan): every cached Q5_K expert is 4× its on-disk size, cutting SLRU residency.

⚠️ Correction (after source audit)

This is not the one-line fix the original issue implied. The Q5_K→F32 dequant on Vulkan is deliberately correct today, because:

  • VulkanBackend.MatMul(output, matrix, vector, weightDType) only has cases for Float32, Q6_K, and default→Q4_K (VulkanBackend.cs:1178). There is no Q5_K matvec pipeline.
  • There is a standalone Q5_K dequant shader (Shaders.DequantQ5KM, :1935) but no fused Q5_K matvec equivalent to MatVecQ6K.
  • So if we uploaded Q5_K raw and tagged it Q5_K in _gpuWeightDTypes, MatMul would hit the default branch and decode it as Q4_K — wrong block layout (144 B vs 176 B per 256-elem block) → garbage output.

Real scope

This is now a Vulkan shader work item, not a slot-manager tweak:

  1. Write a MatVecQ5K GLSL compute shader mirroring Shaders.MatVecQ6K (per-block Q5_K dequant-dot; reuse the block-layout logic already in DequantQ5KM).
  2. Add _matVecQ5KPipeline + a case DType.Q5_K in VulkanBackend.MatMul (:1178); dispose it alongside the others (:1686).
  3. Then add DType.Q5_K to the raw-upload branch in ExpertSlotManager.UploadExpertWeight:146 (and the non-GDN CudaHybridForwardPass resident path, or let Non-GDN CUDA MoE hybrid does whole-layer offload — bring to per-expert SLRU parity with the GDN path #72 subsume that).
  4. Validate Q5_K_M MoE output parity vs the current F32-dequant path on a Vulkan GPU.

Acceptance

  • MatVecQ5K shader produces output matching the CPU Q5_K matvec within quant tolerance.
  • Q5_K experts cached raw on Vulkan (≈4× more experts per fixed VRAM budget), identical end-to-end output.
  • No regression on Q4_K/Q6_K-only models.

Note on the CUDA side

CudaExpertSlotManager already keeps Q5_K raw and has a working Q5_K matvec path, so the CUDA-GDN path already does the right thing. Only the non-GDN CudaHybridForwardPass resident path expands Q5_K → F32, and that path is being reworked in #72 anyway.

Related

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions