You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cached experts should stay in their native quant format. CudaExpertSlotManager already does this for Q4_K / Q5_K / Q6_K via UploadRaw (CudaExpertSlotManager.cs:162). But two other paths dequantize Q5_K to F32 on upload:
Vulkan ExpertSlotManager — ExpertSlotManager.cs:146 keeps Q4_K/Q6_K raw, but Q5_K falls through to the Dequantize.ToFloat32 branch (:156), so it's cached at 4 bytes/element.
Non-GDN CudaHybridForwardPass resident upload — same pattern: Q4_K/Q6_K raw (:1418), Q5_K → F32 (:1428).
This matters concretely for Q5_K_M-quantized MoE models (and qwen35moe's ffn_down_exps, though that runs on the CUDA-GDN path, not Vulkan): every cached Q5_K expert is 4× its on-disk size, cutting SLRU residency.
⚠️ Correction (after source audit)
This is not the one-line fix the original issue implied. The Q5_K→F32 dequant on Vulkan is deliberately correct today, because:
VulkanBackend.MatMul(output, matrix, vector, weightDType) only has cases for Float32, Q6_K, and default→Q4_K (VulkanBackend.cs:1178). There is no Q5_K matvec pipeline.
There is a standalone Q5_K dequant shader (Shaders.DequantQ5KM, :1935) but no fused Q5_K matvec equivalent to MatVecQ6K.
So if we uploaded Q5_K raw and tagged it Q5_K in _gpuWeightDTypes, MatMul would hit the default branch and decode it as Q4_K — wrong block layout (144 B vs 176 B per 256-elem block) → garbage output.
Real scope
This is now a Vulkan shader work item, not a slot-manager tweak:
Write a MatVecQ5K GLSL compute shader mirroring Shaders.MatVecQ6K (per-block Q5_K dequant-dot; reuse the block-layout logic already in DequantQ5KM).
Add _matVecQ5KPipeline + a case DType.Q5_K in VulkanBackend.MatMul (:1178); dispose it alongside the others (:1686).
Validate Q5_K_M MoE output parity vs the current F32-dequant path on a Vulkan GPU.
Acceptance
MatVecQ5K shader produces output matching the CPU Q5_K matvec within quant tolerance.
Q5_K experts cached raw on Vulkan (≈4× more experts per fixed VRAM budget), identical end-to-end output.
No regression on Q4_K/Q6_K-only models.
Note on the CUDA side
CudaExpertSlotManager already keeps Q5_K raw and has a working Q5_K matvec path, so the CUDA-GDN path already does the right thing. Only the non-GDN CudaHybridForwardPass resident path expands Q5_K → F32, and that path is being reworked in #72 anyway.
Background
Cached experts should stay in their native quant format.
CudaExpertSlotManageralready does this for Q4_K / Q5_K / Q6_K viaUploadRaw(CudaExpertSlotManager.cs:162). But two other paths dequantize Q5_K to F32 on upload:ExpertSlotManager—ExpertSlotManager.cs:146keeps Q4_K/Q6_K raw, but Q5_K falls through to theDequantize.ToFloat32branch (:156), so it's cached at 4 bytes/element.CudaHybridForwardPassresident upload — same pattern: Q4_K/Q6_K raw (:1418), Q5_K → F32 (:1428).This matters concretely for Q5_K_M-quantized MoE models (and qwen35moe's
ffn_down_exps, though that runs on the CUDA-GDN path, not Vulkan): every cached Q5_K expert is 4× its on-disk size, cutting SLRU residency.This is not the one-line fix the original issue implied. The Q5_K→F32 dequant on Vulkan is deliberately correct today, because:
VulkanBackend.MatMul(output, matrix, vector, weightDType)only has cases forFloat32,Q6_K, and default→Q4_K(VulkanBackend.cs:1178). There is no Q5_K matvec pipeline.Shaders.DequantQ5KM,:1935) but no fused Q5_K matvec equivalent toMatVecQ6K.Q5_Kin_gpuWeightDTypes,MatMulwould hit thedefaultbranch and decode it as Q4_K — wrong block layout (144 B vs 176 B per 256-elem block) → garbage output.Real scope
This is now a Vulkan shader work item, not a slot-manager tweak:
MatVecQ5KGLSL compute shader mirroringShaders.MatVecQ6K(per-block Q5_K dequant-dot; reuse the block-layout logic already inDequantQ5KM)._matVecQ5KPipeline+ acase DType.Q5_KinVulkanBackend.MatMul(:1178); dispose it alongside the others (:1686).DType.Q5_Kto the raw-upload branch inExpertSlotManager.UploadExpertWeight:146(and the non-GDNCudaHybridForwardPassresident path, or let Non-GDN CUDA MoE hybrid does whole-layer offload — bring to per-expert SLRU parity with the GDN path #72 subsume that).Acceptance
MatVecQ5Kshader produces output matching the CPU Q5_K matvec within quant tolerance.Note on the CUDA side
CudaExpertSlotManageralready keeps Q5_K raw and has a working Q5_K matvec path, so the CUDA-GDN path already does the right thing. Only the non-GDNCudaHybridForwardPassresident path expands Q5_K → F32, and that path is being reworked in #72 anyway.Related
Shaders.MatVecQ6K,Shaders.DequantQ5KM,VulkanBackend.MatMul(:1178).CudaExpertSlotManager.cs:162— the CUDA side already does this correctly.docs/moe-expert-offloading-research.md§1.2, §5 (P1).