Predictive expert prefetch for MoE (pre-gated routing / PreScope-style)

## Background

The SLRU expert cache + async prefetcher in `Pipeline` is **reactive**: a cache miss triggers a fetch, and the decode thread stalls on the NVMe / pinned-RAM read for the missing expert. On 12 GB cards running Mixtral / Qwen3-30B-A3B / Qwen3.6-35B-A3B-MTP, the SLRU can hold only a fraction of the routed-expert set, so miss-driven stalls dominate the MoE FFN tail at long context.

Two lines of recent work eliminate this stall by **predicting** which experts the next MoE layer will activate, so the prefetcher can start the transfer during the **current** layer's compute:

1. **Pre-gated MoE** (Microsoft, ISCA 2024) — adds a tiny pre-gate head trained jointly with the model. The N-th pre-gate predicts layer (N+1)'s expert selection. Requires fine-tuning the model.
2. **PreScope** (arXiv:2509.23638, Sep 2025) — training-free auxiliary predictor running alongside inference. No model modification.

Both fit cleanly into our existing infra: predictor output → `_expertSlotManager` reservation hint → `Pipeline` async prefetch.

## Scope

Phase 1 — **PreScope-style predictor (training-free, no fine-tune required)**:

1. Add a lightweight predictor module per MoE layer. Inputs: hidden state at layer N. Output: top-K expert indices for layer N+1.
2. Plumb predictions into the SLRU expert cache as **prefetch hints** (don't evict anything to make room until the actual gating fires; just start the read early).
3. Track prediction accuracy and prefetch hit rate via `SHARPI_TRACE_MOE=1`.

Phase 2 — **Pre-gated head (training-required, larger lift)**:

4. Add support in `ModelGraph` for an optional `pre_gate_w[layer]` weight tensor.
5. Train pre-gate heads for Qwen3.6-35B-A3B-MTP and Mixtral 8x7B (or use community-published ones if available).
6. Skip the predictor entirely when the model ships pre-gate weights.

## Acceptance

- [ ] On 35B-A3B-MTP at 12 GB VRAM, decode t/s improves by ≥ 1.3× over current SLRU-only path at 8K context.
- [ ] Prediction accuracy logged; PreScope ≥ 70 % top-K agreement target.
- [ ] No regression when MoE expert set fits entirely in cache (Qwen3-30B-A3B on 24 GB).
- [ ] Compatible with `SHARPI_CPU_MOE=1` fallback path.

## References

- Pre-gated MoE (ISCA 2024): https://www.microsoft.com/en-us/research/wp-content/uploads/2024/05/isca24_pregated_moe_camera_ready.pdf
- PreScope (arXiv:2509.23638): https://arxiv.org/html/2509.23638v1
- Survey of MoE caching/prefetch: arXiv:2511.05814

## Related

- #46 (full-GPU MoE for MTP head — pre-requisite for the on-GPU prefetch path)
- #45 (MoE batched verify — independent; both lift the MoE MTP decode)
- `Pipeline` async prefetcher; `_expertSlotManager` in CudaHybridGdnForwardPass


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predictive expert prefetch for MoE (pre-gated routing / PreScope-style) #50

Background

Scope

Acceptance

References

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Predictive expert prefetch for MoE (pre-gated routing / PreScope-style) #50

Description

Background

Scope

Acceptance

References

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions