Skip to content

Activation-aware expert caching: feed ExpertAccessProfiler into SLRU eviction + warm-pin hot experts #74

@pekkah

Description

@pekkah

Background

Our expert cache evicts by recency only (SlruCache probationary/protected segments). ExpertAccessProfiler tracks per-(layer, expert) hit/miss counts and GetTopExperts, but today it's diagnostic-onlyCudaHybridGdnForwardPass prints its stats on dispose and nothing else consumes it.

The SOTA offloading systems (MoE-Infinity, HybriMoE) show that activation-frequency-aware caching beats pure recency for MoE, because routing is skewed: a minority of experts are hot across the whole sequence, and recency churn can evict a hot expert that's about to be used again.

We already built the profiler. Two cheap ways to actually use it:

  1. Warm-pin hot experts at load. Profile the first N decode tokens (or load an offline profile), then pin the top-K experts per layer into the protected SLRU segment so they never get evicted. This is the KTransformers / MoE-Infinity "hot experts on GPU" idea, using infra we have.
  2. Frequency-biased eviction. When choosing a probationary victim, break ties (or weight the decision) by ExpertAccessProfiler frequency so high-frequency experts resist eviction even if not most-recently-used (LFU/LRU hybrid).

Scope

  1. Add an optional warm-pin pass: after K tokens, query GetTopExperts(layer, n) and Preload + mark-protected the top experts per MoE layer. Gate behind a flag (SHARPI_MOE_WARMPIN=1 or auto when cache < expert count).
  2. Extend SlruCache eviction to consult a frequency hint (optional Func<TKey,long> accessor) so eviction is recency+frequency, not pure recency.
  3. Expose effectiveness via the existing profiler dump (hit rate before/after).

Acceptance

  • On a skewed-routing workload (e.g. multilingual prompt on Mixtral, or long-context qwen35moe), overall expert cache hit rate improves measurably vs recency-only SLRU at the same cache size.
  • No regression when the full expert set fits in cache.
  • SHARPI_TRACE_MOE/profiler dump shows the warm-pin set and post-warm hit rate.

Related

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions