Skip to content

Predictive expert prefetch for MoE (pre-gated routing / PreScope-style) #50

@pekkah

Description

@pekkah

Background

The SLRU expert cache + async prefetcher in Pipeline is reactive: a cache miss triggers a fetch, and the decode thread stalls on the NVMe / pinned-RAM read for the missing expert. On 12 GB cards running Mixtral / Qwen3-30B-A3B / Qwen3.6-35B-A3B-MTP, the SLRU can hold only a fraction of the routed-expert set, so miss-driven stalls dominate the MoE FFN tail at long context.

Two lines of recent work eliminate this stall by predicting which experts the next MoE layer will activate, so the prefetcher can start the transfer during the current layer's compute:

  1. Pre-gated MoE (Microsoft, ISCA 2024) — adds a tiny pre-gate head trained jointly with the model. The N-th pre-gate predicts layer (N+1)'s expert selection. Requires fine-tuning the model.
  2. PreScope (arXiv:2509.23638, Sep 2025) — training-free auxiliary predictor running alongside inference. No model modification.

Both fit cleanly into our existing infra: predictor output → _expertSlotManager reservation hint → Pipeline async prefetch.

Scope

Phase 1 — PreScope-style predictor (training-free, no fine-tune required):

  1. Add a lightweight predictor module per MoE layer. Inputs: hidden state at layer N. Output: top-K expert indices for layer N+1.
  2. Plumb predictions into the SLRU expert cache as prefetch hints (don't evict anything to make room until the actual gating fires; just start the read early).
  3. Track prediction accuracy and prefetch hit rate via SHARPI_TRACE_MOE=1.

Phase 2 — Pre-gated head (training-required, larger lift):

  1. Add support in ModelGraph for an optional pre_gate_w[layer] weight tensor.
  2. Train pre-gate heads for Qwen3.6-35B-A3B-MTP and Mixtral 8x7B (or use community-published ones if available).
  3. Skip the predictor entirely when the model ships pre-gate weights.

Acceptance

  • On 35B-A3B-MTP at 12 GB VRAM, decode t/s improves by ≥ 1.3× over current SLRU-only path at 8K context.
  • Prediction accuracy logged; PreScope ≥ 70 % top-K agreement target.
  • No regression when MoE expert set fits entirely in cache (Qwen3-30B-A3B on 24 GB).
  • Compatible with SHARPI_CPU_MOE=1 fallback path.

References

Related

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions