Skip to content

feat(server): multimodal image input endpoint (Gemma 4 gemma4uv content blocks) #253

@pekkah

Description

@pekkah

Follow-up to #250 / PR #251. The CLI supports Gemma 4 12B image input (--image); the API server does not.

Scope

Accept image content blocks and route them through the existing vision splice:

  • /v1/messages (Anthropic): image content blocks (base64 source).
  • /v1/chat/completions (OpenAI): image_url parts (data URL / base64; URL fetch optional).
  • Decode → ImagePreprocessorGemmaUvVisionEmbedder → splice via ForwardEmbedding at the <|image|> placeholder positions (reuse the CLI's RunImagePrompt logic).
  • Multi-image per message (markers already supported in the CLI path).
  • Smoke tests in Tests.Server.

Dependencies / notes

  • SharpInference.Server would need a reference to SharpInference.Vision and an mmproj-configured engine (new option, e.g. SHARPI_MMPROJ).
  • Server currently runs the engine on its configured backend; image input needs a pass with SupportsEmbeddingInput (CPU or full CUDA today — see the hybrid/Vulkan follow-up).
  • Only gemma4uv (12B) is wired; reject image content for other models with a clear error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions