Background
Surfaced while adding the MoE CLI flags in PR #77.
SharpInference already supports forcing MoE experts onto the CPU via the SHARPI_CPU_MOE env override (read in CudaHybridGdnForwardPass), but there is no CLI flag for it. llama.cpp exposes this as first-class, well-known flags:
--cpu-moe / -cmoe — keep all routed-expert weights on CPU.
--n-cpu-moe N / -ncmoe N — keep the routed experts of N layers on CPU (llama.cpp counts from the highest-numbered layers).
-ot / --override-tensor — regex tensor placement (more general; likely out of scope).
Per the project convention of matching llama.cpp arg names where an equivalent exists, we should expose --cpu-moe (and ideally --n-cpu-moe) mapping onto the existing override.
Scope
- Add
[CommandOption("--cpu-moe|-cmoe")] bool CpuMoe to RunCommand.Settings; when set, force CPU MoE (set SHARPI_CPU_MOE=1 early in Execute, same plumbing pattern as the other MoE flags).
- Add
--n-cpu-moe|-ncmoe <N> if/when the engine supports a per-layer CPU/GPU expert split count (today the override is all-or-nothing; scope --n-cpu-moe to whatever the engine can honor, or defer it with a clear note).
- Match llama.cpp semantics where feasible (document any deviation, e.g. layer-counting direction).
- Help text + README mention.
Acceptance
Related
Background
Surfaced while adding the MoE CLI flags in PR #77.
SharpInference already supports forcing MoE experts onto the CPU via the
SHARPI_CPU_MOEenv override (read inCudaHybridGdnForwardPass), but there is no CLI flag for it. llama.cpp exposes this as first-class, well-known flags:--cpu-moe/-cmoe— keep all routed-expert weights on CPU.--n-cpu-moe N/-ncmoe N— keep the routed experts of N layers on CPU (llama.cpp counts from the highest-numbered layers).-ot/--override-tensor— regex tensor placement (more general; likely out of scope).Per the project convention of matching llama.cpp arg names where an equivalent exists, we should expose
--cpu-moe(and ideally--n-cpu-moe) mapping onto the existing override.Scope
[CommandOption("--cpu-moe|-cmoe")] bool CpuMoetoRunCommand.Settings; when set, force CPU MoE (setSHARPI_CPU_MOE=1early inExecute, same plumbing pattern as the other MoE flags).--n-cpu-moe|-ncmoe <N>if/when the engine supports a per-layer CPU/GPU expert split count (today the override is all-or-nothing; scope--n-cpu-moeto whatever the engine can honor, or defer it with a clear note).Acceptance
--cpu-moeforces the CPU MoE path on a supported model, equivalent toSHARPI_CPU_MOE=1.--n-cpu-moeeither implemented to llama.cpp semantics or explicitly deferred with rationale.Related
SHARPI_CPU_MOEhandling inCudaHybridGdnForwardPass--cpu-moe/--n-cpu-moe/-ot