You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After the server refactor (PR landing this session) most CLI MoE-cache knobs reach
the engine via options → env vars in InferenceEngineLoader.ApplyMoeEnvironment: SHARPI_MOE_WARMPIN, SHARPI_MOE_WARMPIN_AFTER, SHARPI_MOE_PREDICT_PREFETCH, SHARPI_EXPERT_STATS. One placement knob is conspicuously missing: SHARPI_CPU_MOE — the all-or-nothing override that forces routed experts to
the CPU side.
It matters in practice: per the README perf table, Qwen3.6-35B-A3B-MTP only lands
its 22.9 t/s on the CUDA hybrid path when SHARPI_CPU_MOE=1. An operator running
the server has to know to export the env var before dotnet run, which defeats the
purpose of the options surface that mirrors the CLI for everything else.
Cause
SharpInferenceServerOptions has no field for this; ApplyMoeEnvironment never
writes SHARPI_CPU_MOE. The engine's CudaHybridGdnForwardPass reads the env var
directly at construction time, so the option has to be translated before model load.
Background
After the server refactor (PR landing this session) most CLI MoE-cache knobs reach
the engine via options → env vars in
InferenceEngineLoader.ApplyMoeEnvironment:SHARPI_MOE_WARMPIN,SHARPI_MOE_WARMPIN_AFTER,SHARPI_MOE_PREDICT_PREFETCH,SHARPI_EXPERT_STATS. One placement knob is conspicuously missing:SHARPI_CPU_MOE— the all-or-nothing override that forces routed experts tothe CPU side.
It matters in practice: per the README perf table,
Qwen3.6-35B-A3B-MTPonly landsits 22.9 t/s on the CUDA hybrid path when
SHARPI_CPU_MOE=1. An operator runningthe server has to know to export the env var before
dotnet run, which defeats thepurpose of the options surface that mirrors the CLI for everything else.
Cause
SharpInferenceServerOptionshas no field for this;ApplyMoeEnvironmentneverwrites
SHARPI_CPU_MOE. The engine'sCudaHybridGdnForwardPassreads the env vardirectly at construction time, so the option has to be translated before model load.
Scope
bool? CpuMoe { get; set; }toSharpInferenceServerOptions(nullable sodefault behaviour — engine auto-selects from SLRU sizing — is preserved when
unset). XML doc cross-references the CLI's
--cpu-moe(issue Add llama.cpp-named MoE placement flags (--cpu-moe / --n-cpu-moe) wrapping SHARPI_CPU_MOE #80).InferenceEngineLoader.ApplyMoeEnvironment, when set, writeSHARPI_CPU_MOE=0|1early.model load (matches the existing MoE-knob coverage pattern).
Acceptance
SharpInference:CpuMoe=trueinappsettings.jsonproduces the same forward-pass routing as
SHARPI_CPU_MOE=1.--cpu-moe(issue Add llama.cpp-named MoE placement flags (--cpu-moe / --n-cpu-moe) wrapping SHARPI_CPU_MOE #80) and the server option resolve to the same enginebehaviour.
Related
--cpu-moeflag wrapping the same env varsrc/SharpInference.Server/InferenceEngineLoader.cs—ApplyMoeEnvironmentsrc/SharpInference.Server/SharpInferenceServerOptions.cs— MoE knobs sectionCudaHybridGdnForwardPass(engine) — readsSHARPI_CPU_MOEat construction