Background
The SLRU expert cache + async prefetcher in Pipeline is reactive: a cache miss triggers a fetch, and the decode thread stalls on the NVMe / pinned-RAM read for the missing expert. On 12 GB cards running Mixtral / Qwen3-30B-A3B / Qwen3.6-35B-A3B-MTP, the SLRU can hold only a fraction of the routed-expert set, so miss-driven stalls dominate the MoE FFN tail at long context.
Two lines of recent work eliminate this stall by predicting which experts the next MoE layer will activate, so the prefetcher can start the transfer during the current layer's compute:
- Pre-gated MoE (Microsoft, ISCA 2024) — adds a tiny pre-gate head trained jointly with the model. The N-th pre-gate predicts layer (N+1)'s expert selection. Requires fine-tuning the model.
- PreScope (arXiv:2509.23638, Sep 2025) — training-free auxiliary predictor running alongside inference. No model modification.
Both fit cleanly into our existing infra: predictor output → _expertSlotManager reservation hint → Pipeline async prefetch.
Scope
Phase 1 — PreScope-style predictor (training-free, no fine-tune required):
- Add a lightweight predictor module per MoE layer. Inputs: hidden state at layer N. Output: top-K expert indices for layer N+1.
- Plumb predictions into the SLRU expert cache as prefetch hints (don't evict anything to make room until the actual gating fires; just start the read early).
- Track prediction accuracy and prefetch hit rate via
SHARPI_TRACE_MOE=1.
Phase 2 — Pre-gated head (training-required, larger lift):
- Add support in
ModelGraph for an optional pre_gate_w[layer] weight tensor.
- Train pre-gate heads for Qwen3.6-35B-A3B-MTP and Mixtral 8x7B (or use community-published ones if available).
- Skip the predictor entirely when the model ships pre-gate weights.
Acceptance
References
Related
Background
The SLRU expert cache + async prefetcher in
Pipelineis reactive: a cache miss triggers a fetch, and the decode thread stalls on the NVMe / pinned-RAM read for the missing expert. On 12 GB cards running Mixtral / Qwen3-30B-A3B / Qwen3.6-35B-A3B-MTP, the SLRU can hold only a fraction of the routed-expert set, so miss-driven stalls dominate the MoE FFN tail at long context.Two lines of recent work eliminate this stall by predicting which experts the next MoE layer will activate, so the prefetcher can start the transfer during the current layer's compute:
Both fit cleanly into our existing infra: predictor output →
_expertSlotManagerreservation hint →Pipelineasync prefetch.Scope
Phase 1 — PreScope-style predictor (training-free, no fine-tune required):
SHARPI_TRACE_MOE=1.Phase 2 — Pre-gated head (training-required, larger lift):
ModelGraphfor an optionalpre_gate_w[layer]weight tensor.Acceptance
SHARPI_CPU_MOE=1fallback path.References
Related
Pipelineasync prefetcher;_expertSlotManagerin CudaHybridGdnForwardPass