You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#73 fixes an accidental expansion (Q5_K → F32). This issue is the inverse, SOTA direction: deliberately store cold / low-importance experts at lower-than-native precision to fit more of them in the SLRU cache and cut PCIe transfer bytes.
HOBBIT (arXiv:2411.01433, built on llama.cpp) shows that for MoE inference the less important experts (low router weight, or rarely activated) tolerate aggressive low-bit quantization with negligible quality loss, while critical experts keep full precision. It combines this with token-/layer-/sequence-level prefetch + caching for significant speedups on edge devices.
For SharpInference on 12 GB this is attractive because the SLRU slot count is the binding constraint on big non-GDN MoEs: if cold experts cost (say) 2 bits instead of 4–5, we roughly double the resident expert count for the same VRAM.
Why a research spike (not direct implementation)
Quality. "Negligible loss" is reported at fp16 baselines; behavior on top of our already-Q4_K_M/Q5_K GGUF weights (i.e. re-quantizing an already-quantized expert downward) is uncharacterized and could compound error.
Importance signal. HOBBIT uses router-weight magnitude as the importance proxy. We have ExpertAccessProfiler (frequency) and the live router weights — need to decide which signal drives the precision tier and validate it.
Kernel cost. A mixed-precision cache means the MoE matmul must dispatch per-expert by dtype (we already key matmul by dtype via _gpuWeightDTypes, so the plumbing is partly there) and we'd need a sub-Q4 dequant kernel on each backend.
Pick one non-GDN MoE (Mixtral 8x7B or Qwen3-30B-A3B Q4_K_M) and measure quality (HumanEval / GSM8K / perplexity) when the bottom-X% of experts (by router weight and by access frequency) are re-quantized to 2–3 bits.
Decide the importance signal: router-weight vs profiler-frequency vs hybrid.
Prototype a 2-bit expert dequant path on one backend (CUDA, reusing the dtype-keyed matmul dispatch).
Measure resident-expert-count gain and decode-throughput vs the native-quant SLRU baseline.
Acceptance
Decision doc: is mixed-precision expert caching worth implementing for our models? Quality numbers + VRAM/throughput gain on ≥1 model/backend, or a documented rejection.
Background
#73 fixes an accidental expansion (Q5_K → F32). This issue is the inverse, SOTA direction: deliberately store cold / low-importance experts at lower-than-native precision to fit more of them in the SLRU cache and cut PCIe transfer bytes.
HOBBIT (arXiv:2411.01433, built on llama.cpp) shows that for MoE inference the less important experts (low router weight, or rarely activated) tolerate aggressive low-bit quantization with negligible quality loss, while critical experts keep full precision. It combines this with token-/layer-/sequence-level prefetch + caching for significant speedups on edge devices.
For SharpInference on 12 GB this is attractive because the SLRU slot count is the binding constraint on big non-GDN MoEs: if cold experts cost (say) 2 bits instead of 4–5, we roughly double the resident expert count for the same VRAM.
Why a research spike (not direct implementation)
ExpertAccessProfiler(frequency) and the live router weights — need to decide which signal drives the precision tier and validate it._gpuWeightDTypes, so the plumbing is partly there) and we'd need a sub-Q4 dequant kernel on each backend.Scope (research spike)
Acceptance
References
Related
docs/moe-expert-offloading-research.md§3 Axis E, §5 (P0/HOBBIT note).