Skip to content

Add llama.cpp-named MoE placement flags (--cpu-moe / --n-cpu-moe) wrapping SHARPI_CPU_MOE #80

@pekkah

Description

@pekkah

Background

Surfaced while adding the MoE CLI flags in PR #77.

SharpInference already supports forcing MoE experts onto the CPU via the SHARPI_CPU_MOE env override (read in CudaHybridGdnForwardPass), but there is no CLI flag for it. llama.cpp exposes this as first-class, well-known flags:

  • --cpu-moe / -cmoe — keep all routed-expert weights on CPU.
  • --n-cpu-moe N / -ncmoe N — keep the routed experts of N layers on CPU (llama.cpp counts from the highest-numbered layers).
  • -ot / --override-tensor — regex tensor placement (more general; likely out of scope).

Per the project convention of matching llama.cpp arg names where an equivalent exists, we should expose --cpu-moe (and ideally --n-cpu-moe) mapping onto the existing override.

Scope

  1. Add [CommandOption("--cpu-moe|-cmoe")] bool CpuMoe to RunCommand.Settings; when set, force CPU MoE (set SHARPI_CPU_MOE=1 early in Execute, same plumbing pattern as the other MoE flags).
  2. Add --n-cpu-moe|-ncmoe <N> if/when the engine supports a per-layer CPU/GPU expert split count (today the override is all-or-nothing; scope --n-cpu-moe to whatever the engine can honor, or defer it with a clear note).
  3. Match llama.cpp semantics where feasible (document any deviation, e.g. layer-counting direction).
  4. Help text + README mention.

Acceptance

  • --cpu-moe forces the CPU MoE path on a supported model, equivalent to SHARPI_CPU_MOE=1.
  • Flag name/alias matches llama.cpp; behavior documented.
  • --n-cpu-moe either implemented to llama.cpp semantics or explicitly deferred with rationale.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions