Skip to content

Server: surface SHARPI_CPU_MOE as a SharpInferenceServerOptions field (mirror #80) #93

@pekkah

Description

@pekkah

Background

After the server refactor (PR landing this session) most CLI MoE-cache knobs reach
the engine via options → env vars in InferenceEngineLoader.ApplyMoeEnvironment:
SHARPI_MOE_WARMPIN, SHARPI_MOE_WARMPIN_AFTER, SHARPI_MOE_PREDICT_PREFETCH,
SHARPI_EXPERT_STATS. One placement knob is conspicuously missing:
SHARPI_CPU_MOE — the all-or-nothing override that forces routed experts to
the CPU side.

It matters in practice: per the README perf table, Qwen3.6-35B-A3B-MTP only lands
its 22.9 t/s on the CUDA hybrid path when SHARPI_CPU_MOE=1. An operator running
the server has to know to export the env var before dotnet run, which defeats the
purpose of the options surface that mirrors the CLI for everything else.

Cause

SharpInferenceServerOptions has no field for this; ApplyMoeEnvironment never
writes SHARPI_CPU_MOE. The engine's CudaHybridGdnForwardPass reads the env var
directly at construction time, so the option has to be translated before model load.

Scope

  1. Add bool? CpuMoe { get; set; } to SharpInferenceServerOptions (nullable so
    default behaviour — engine auto-selects from SLRU sizing — is preserved when
    unset). XML doc cross-references the CLI's --cpu-moe (issue Add llama.cpp-named MoE placement flags (--cpu-moe / --n-cpu-moe) wrapping SHARPI_CPU_MOE #80).
  2. In InferenceEngineLoader.ApplyMoeEnvironment, when set, write
    SHARPI_CPU_MOE=0|1 early.
  3. One unit test that verifies the option round-trips into the env var ahead of
    model load (matches the existing MoE-knob coverage pattern).

Acceptance

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions