Activation-aware expert caching: feed ExpertAccessProfiler into SLRU eviction + warm-pin hot experts

## Background

Our expert cache evicts by **recency only** (`SlruCache` probationary/protected segments). `ExpertAccessProfiler` tracks per-`(layer, expert)` hit/miss counts and `GetTopExperts`, but today it's **diagnostic-only** — `CudaHybridGdnForwardPass` prints its stats on dispose and nothing else consumes it.

The SOTA offloading systems (MoE-Infinity, HybriMoE) show that **activation-frequency-aware** caching beats pure recency for MoE, because routing is skewed: a minority of experts are hot across the whole sequence, and recency churn can evict a hot expert that's about to be used again.

We already built the profiler. Two cheap ways to actually use it:

1. **Warm-pin hot experts at load.** Profile the first N decode tokens (or load an offline profile), then pin the top-K experts per layer into the protected SLRU segment so they never get evicted. This is the KTransformers / MoE-Infinity "hot experts on GPU" idea, using infra we have.
2. **Frequency-biased eviction.** When choosing a probationary victim, break ties (or weight the decision) by `ExpertAccessProfiler` frequency so high-frequency experts resist eviction even if not most-recently-used (LFU/LRU hybrid).

## Scope

1. Add an optional warm-pin pass: after K tokens, query `GetTopExperts(layer, n)` and `Preload` + mark-protected the top experts per MoE layer. Gate behind a flag (`SHARPI_MOE_WARMPIN=1` or auto when cache < expert count).
2. Extend `SlruCache` eviction to consult a frequency hint (optional `Func<TKey,long>` accessor) so eviction is recency+frequency, not pure recency.
3. Expose effectiveness via the existing profiler dump (hit rate before/after).

## Acceptance

- [ ] On a skewed-routing workload (e.g. multilingual prompt on Mixtral, or long-context qwen35moe), overall expert cache hit rate improves measurably vs recency-only SLRU at the same cache size.
- [ ] No regression when the full expert set fits in cache.
- [ ] `SHARPI_TRACE_MOE`/profiler dump shows the warm-pin set and post-warm hit rate.

## Related

- `ExpertAccessProfiler`, `SlruCache`, `ExpertCache` in `SharpInference.Pipeline`.
- #50 (predictive prefetch — complementary: prediction handles cold misses, hotness handles steady-state residency).
- `docs/moe-expert-offloading-research.md` §3 Axis B, §5 (P1).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activation-aware expert caching: feed ExpertAccessProfiler into SLRU eviction + warm-pin hot experts #74

Background

Scope

Acceptance

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Activation-aware expert caching: feed ExpertAccessProfiler into SLRU eviction + warm-pin hot experts #74

Description

Background

Scope

Acceptance

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions