Research: mixed-precision expert cache (HOBBIT) — down-quantize cold experts below native

## Background

#73 fixes an accidental *expansion* (Q5_K → F32). This issue is the inverse, SOTA direction: deliberately store **cold / low-importance experts at lower-than-native precision** to fit more of them in the SLRU cache and cut PCIe transfer bytes.

**HOBBIT** (arXiv:2411.01433, built on llama.cpp) shows that for MoE inference the *less important* experts (low router weight, or rarely activated) tolerate aggressive low-bit quantization with negligible quality loss, while critical experts keep full precision. It combines this with token-/layer-/sequence-level prefetch + caching for significant speedups on edge devices.

For SharpInference on 12 GB this is attractive because the SLRU slot count is the binding constraint on big non-GDN MoEs: if cold experts cost (say) 2 bits instead of 4–5, we roughly double the resident expert count for the same VRAM.

## Why a research spike (not direct implementation)

1. **Quality.** "Negligible loss" is reported at fp16 baselines; behavior on top of our already-Q4_K_M/Q5_K GGUF weights (i.e. *re*-quantizing an already-quantized expert downward) is uncharacterized and could compound error.
2. **Importance signal.** HOBBIT uses router-weight magnitude as the importance proxy. We have `ExpertAccessProfiler` (frequency) and the live router weights — need to decide which signal drives the precision tier and validate it.
3. **Kernel cost.** A mixed-precision cache means the MoE matmul must dispatch per-expert by dtype (we already key matmul by dtype via `_gpuWeightDTypes`, so the plumbing is partly there) and we'd need a sub-Q4 dequant kernel on each backend.
4. **Interaction with #73 / the SLRU-parity work (#72) / warm-pinning (#74).** Hot experts → native quant (or pinned); cold experts → low-bit. Composition needs design.

## Scope (research spike)

1. Pick one non-GDN MoE (Mixtral 8x7B or Qwen3-30B-A3B Q4_K_M) and measure quality (HumanEval / GSM8K / perplexity) when the bottom-X% of experts (by router weight and by access frequency) are re-quantized to 2–3 bits.
2. Decide the importance signal: router-weight vs profiler-frequency vs hybrid.
3. Prototype a 2-bit expert dequant path on **one** backend (CUDA, reusing the dtype-keyed matmul dispatch).
4. Measure resident-expert-count gain and decode-throughput vs the native-quant SLRU baseline.

## Acceptance

- [ ] Decision doc: is mixed-precision expert caching worth implementing for our models? Quality numbers + VRAM/throughput gain on ≥1 model/backend, or a documented rejection.

## References

- HOBBIT: https://hf.co/papers/2411.01433
- AdapMoE (sensitivity-based expert management): https://hf.co/papers/2408.10284
- PreMoe (expert pruning/retrieval, constrained memory): https://hf.co/papers/2505.17639

## Related

- #72 (per-expert SLRU on the non-GDN CUDA path — prerequisite for this to matter there)
- #73 (stop Q5_K→F32 expansion — the opposite-direction quick fix)
- #74 (activation-aware caching — shares the importance signal)
- `docs/moe-expert-offloading-research.md` §3 Axis E, §5 (P0/HOBBIT note).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: mixed-precision expert cache (HOBBIT) — down-quantize cold experts below native #76

Background

Why a research spike (not direct implementation)

Scope (research spike)

Acceptance

References

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Research: mixed-precision expert cache (HOBBIT) — down-quantize cold experts below native #76

Description

Background

Why a research spike (not direct implementation)

Scope (research spike)

Acceptance

References

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions