Skip to content

Research: mixed-precision expert cache (HOBBIT) — down-quantize cold experts below native #76

@pekkah

Description

@pekkah

Background

#73 fixes an accidental expansion (Q5_K → F32). This issue is the inverse, SOTA direction: deliberately store cold / low-importance experts at lower-than-native precision to fit more of them in the SLRU cache and cut PCIe transfer bytes.

HOBBIT (arXiv:2411.01433, built on llama.cpp) shows that for MoE inference the less important experts (low router weight, or rarely activated) tolerate aggressive low-bit quantization with negligible quality loss, while critical experts keep full precision. It combines this with token-/layer-/sequence-level prefetch + caching for significant speedups on edge devices.

For SharpInference on 12 GB this is attractive because the SLRU slot count is the binding constraint on big non-GDN MoEs: if cold experts cost (say) 2 bits instead of 4–5, we roughly double the resident expert count for the same VRAM.

Why a research spike (not direct implementation)

  1. Quality. "Negligible loss" is reported at fp16 baselines; behavior on top of our already-Q4_K_M/Q5_K GGUF weights (i.e. re-quantizing an already-quantized expert downward) is uncharacterized and could compound error.
  2. Importance signal. HOBBIT uses router-weight magnitude as the importance proxy. We have ExpertAccessProfiler (frequency) and the live router weights — need to decide which signal drives the precision tier and validate it.
  3. Kernel cost. A mixed-precision cache means the MoE matmul must dispatch per-expert by dtype (we already key matmul by dtype via _gpuWeightDTypes, so the plumbing is partly there) and we'd need a sub-Q4 dequant kernel on each backend.
  4. Interaction with Vulkan: add MatVecQ5K shader so Q5_K experts can be cached raw (currently dequantized to F32) #73 / the SLRU-parity work (Non-GDN CUDA MoE hybrid does whole-layer offload — bring to per-expert SLRU parity with the GDN path #72) / warm-pinning (Activation-aware expert caching: feed ExpertAccessProfiler into SLRU eviction + warm-pin hot experts #74). Hot experts → native quant (or pinned); cold experts → low-bit. Composition needs design.

Scope (research spike)

  1. Pick one non-GDN MoE (Mixtral 8x7B or Qwen3-30B-A3B Q4_K_M) and measure quality (HumanEval / GSM8K / perplexity) when the bottom-X% of experts (by router weight and by access frequency) are re-quantized to 2–3 bits.
  2. Decide the importance signal: router-weight vs profiler-frequency vs hybrid.
  3. Prototype a 2-bit expert dequant path on one backend (CUDA, reusing the dtype-keyed matmul dispatch).
  4. Measure resident-expert-count gain and decode-throughput vs the native-quant SLRU baseline.

Acceptance

  • Decision doc: is mixed-precision expert caching worth implementing for our models? Quality numbers + VRAM/throughput gain on ≥1 model/backend, or a documented rejection.

References

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions