Background
Our expert cache evicts by recency only (SlruCache probationary/protected segments). ExpertAccessProfiler tracks per-(layer, expert) hit/miss counts and GetTopExperts, but today it's diagnostic-only — CudaHybridGdnForwardPass prints its stats on dispose and nothing else consumes it.
The SOTA offloading systems (MoE-Infinity, HybriMoE) show that activation-frequency-aware caching beats pure recency for MoE, because routing is skewed: a minority of experts are hot across the whole sequence, and recency churn can evict a hot expert that's about to be used again.
We already built the profiler. Two cheap ways to actually use it:
- Warm-pin hot experts at load. Profile the first N decode tokens (or load an offline profile), then pin the top-K experts per layer into the protected SLRU segment so they never get evicted. This is the KTransformers / MoE-Infinity "hot experts on GPU" idea, using infra we have.
- Frequency-biased eviction. When choosing a probationary victim, break ties (or weight the decision) by
ExpertAccessProfiler frequency so high-frequency experts resist eviction even if not most-recently-used (LFU/LRU hybrid).
Scope
- Add an optional warm-pin pass: after K tokens, query
GetTopExperts(layer, n) and Preload + mark-protected the top experts per MoE layer. Gate behind a flag (SHARPI_MOE_WARMPIN=1 or auto when cache < expert count).
- Extend
SlruCache eviction to consult a frequency hint (optional Func<TKey,long> accessor) so eviction is recency+frequency, not pure recency.
- Expose effectiveness via the existing profiler dump (hit rate before/after).
Acceptance
Related
Background
Our expert cache evicts by recency only (
SlruCacheprobationary/protected segments).ExpertAccessProfilertracks per-(layer, expert)hit/miss counts andGetTopExperts, but today it's diagnostic-only —CudaHybridGdnForwardPassprints its stats on dispose and nothing else consumes it.The SOTA offloading systems (MoE-Infinity, HybriMoE) show that activation-frequency-aware caching beats pure recency for MoE, because routing is skewed: a minority of experts are hot across the whole sequence, and recency churn can evict a hot expert that's about to be used again.
We already built the profiler. Two cheap ways to actually use it:
ExpertAccessProfilerfrequency so high-frequency experts resist eviction even if not most-recently-used (LFU/LRU hybrid).Scope
GetTopExperts(layer, n)andPreload+ mark-protected the top experts per MoE layer. Gate behind a flag (SHARPI_MOE_WARMPIN=1or auto when cache < expert count).SlruCacheeviction to consult a frequency hint (optionalFunc<TKey,long>accessor) so eviction is recency+frequency, not pure recency.Acceptance
SHARPI_TRACE_MOE/profiler dump shows the warm-pin set and post-warm hit rate.Related
ExpertAccessProfiler,SlruCache,ExpertCacheinSharpInference.Pipeline.docs/moe-expert-offloading-research.md§3 Axis B, §5 (P1).