Skip to content

feat(server): multimodal image input endpoint (Gemma 4 gemma4uv) (#253)#259

Merged
pekkah merged 4 commits into
masterfrom
feat/253-server-image-input
Jun 15, 2026
Merged

feat(server): multimodal image input endpoint (Gemma 4 gemma4uv) (#253)#259
pekkah merged 4 commits into
masterfrom
feat/253-server-image-input

Conversation

@pekkah

@pekkah pekkah commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Closes #253.

Wires encoder-free Gemma 4 vision (the CLI's --image from #250) through the API
server's Anthropic (/v1/messages) and OpenAI (/v1/chat/completions) endpoints.

Design — engine owns vision, server passes bytes

  • InferenceEngine.EnableImageInput(...), SupportsImageInput, and
    GenerateImageChunksAsync(...). SplicePrefillImages projects each image
    (decode → preprocess → embed) and splices open + soft tokens + close at every
    <|image|> placeholder during a fresh prefill (mirrors the CLI's RunImagePrompt).
  • IInferenceEngine gains default interface members so existing implementers inherit
    a "no image support" stub.
  • InferenceEngineLoader opens the mmproj VisionModel when configured and rejects
    unsupported configs (non-embedding pass, batched, or a non-Gemma text model) with a clear error.

Server surface

  • New ImageContent helper: base64 / data-URL decode + full PNG decode-validation at parse
    time
    (unsupported / corrupt / decompression-bomb payloads → HTTP 400 before any response).
  • Anthropic image content blocks (source.type=base64) and OpenAI image_url /
    input_image parts (data URL). Each image is collected and replaced inline by the
    model's <|image|> placeholder, multi-image per message in document order (capped at 16/request).
  • MmprojPath option settable via SHARPI_MMPROJ env, the SharpInference:MmprojPath
    config key, or Configure(o => o.MmprojPath = ...).

Gemma 4 thinking default

Gemma 4 is not a reasoning model. The server defaults thinking off for it
(ChatTemplateRenderer.ModelDefaultsThinkingOff), matching the CLI and applied uniformly
across /v1/messages, /v1/chat/completions, and /v1/responses. An explicit request
flag still wins; SHARPI_NO_THINKING still forces off.

Review cycle

A multi-agent adversarial review of the initial diff found — and this PR fixes — two real
bugs that single-image smoke tests masked:

  • Decode position: content tokens were forwarded at tokens.Length + i instead of
    startPos + i, so multi-token image responses decoded over the spliced soft-token KV with
    wrong RoPE positions (the repetition/garbling). Fixed → output is now clean and coherent.
  • Cross-request KV reuse: the image path didn't invalidate _prevTokens, letting a later
    text request reuse image-soft-token KV. Fixed.
  • Decompression-bomb DoS: ImageIO inflated the PNG IDAT unbounded; now bounded to the
    exact decoded size.
  • Robustness: bad images / malformed JSON → clean 400 (not 500 / aborted stream); engine
    errors during generation are surfaced gracefully.

Tests / verification

  • 9 wire-format image tests + MmprojPath configurability + ImageIO bomb-cap tests.
  • Full solution builds clean in Release (TreatWarningsAsErrors); Tests.Server green
    (the 3 WithTools failures are a local-appsettings artifact, unrelated — they fail on
    master too).
  • End-to-end on a 4070 Ti with the 12B gemma4uv model + its mmproj: both endpoints
    return clean, coherent descriptions of a real image (e.g. "A serene, misty landscape
    features a calm lake reflecting a mountain range under a clear blue sky."
    ).

Deferred (noted, out of scope)

  • /v1/responses does not yet accept image content (text-only; pre-existing).
  • The Gemma 4 image token-id fallbacks are duplicated with the CLI (intentional, by the
    byte-identical-copy design).

🤖 Generated with Claude Code

Wire encoder-free Gemma 4 vision (issue #250's CLI feature) through the API
server's Anthropic and OpenAI chat endpoints.

Engine owns the vision pipeline; the server only decodes content blocks to
bytes and hands them over:
- InferenceEngine.EnableImageInput / SupportsImageInput /
  GenerateImageChunksAsync; SplicePrefillImages projects each image and splices
  open + soft tokens + close at every <|image|> placeholder (mirrors the CLI's
  RunImagePrompt loop).
- IInferenceEngine grows default interface members so other implementers
  inherit a "no image support" stub.
- InferenceEngineLoader opens the mmproj VisionModel when configured and rejects
  unsupported passes (non-embedding / batched) with a clear error.

Server surface:
- New ImageContent helper: base64 / data-URL decode + PNG-signature validation.
- Anthropic image content blocks (source.type=base64) and OpenAI image_url /
  input_image parts (data URL); each image is collected and replaced inline by
  the model's <|image|> placeholder.
- MmprojPath option settable via SHARPI_MMPROJ env, the SharpInference:MmprojPath
  config key, or Configure(...).

Gemma 4 is not a reasoning model: the server now defaults thinking off for it
(ChatTemplateRenderer.ModelDefaultsThinkingOff), matching the CLI. Leaving
thinking on made the model ramble in the <|channel> thought channel and go
out-of-distribution on image input. An explicit request flag still wins.

Tests: 7 wire-format image tests + an MmprojPath configurability test. Verified
end-to-end on a 4070 Ti with the 12B gemma4uv model + its mmproj: both
endpoints correctly describe a real image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements multimodal image input support (issue #253) for Gemma 4 models, allowing the engine to project images into soft-token blocks and splice them during prompt prefill. It updates the Anthropic and OpenAI endpoints to parse and validate base64-encoded PNGs or data URLs, routing them to the new image-aware generation path. The feedback focuses on enhancing robustness and defensive programming, specifically recommending validation checks on JSON elements during payload parsing in ExtractTextAndImages, ParseAnthropicImageBlock, and FlattenContent to prevent unhandled exceptions (HTTP 500 errors). Additionally, it is suggested to verify that image byte arrays are not null or empty before loading them into memory streams.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +576 to +599
private static string ExtractTextAndImages(JsonElement? contentEl, List<byte[]> images)
{
if (contentEl is null) return "";
var el = contentEl.Value;
if (el.ValueKind == JsonValueKind.String) return el.GetString() ?? "";
if (el.ValueKind != JsonValueKind.Array) return "";

var sb = new StringBuilder();
foreach (var block in el.EnumerateArray())
{
var type = block.TryGetProperty("type", out var t) ? t.GetString() : null;
if (type == "text")
{
if (block.TryGetProperty("text", out var tv))
sb.Append(tv.GetString());
}
else if (type == "image")
{
images.Add(ParseAnthropicImageBlock(block));
sb.Append(ImageContent.Placeholder);
}
}
return sb.ToString();
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Robustness and Defensive Programming: The ExtractTextAndImages method iterates over el.EnumerateArray(). If the client sends a malformed JSON payload where the array contains non-object elements (e.g., strings or numbers), or if the type or text properties are not strings, calling block.TryGetProperty or t.GetString() will throw an unhandled InvalidOperationException. This will result in an HTTP 500 Internal Server Error instead of a clean HTTP 400 Bad Request. We should defensively check the ValueKind of the elements and properties before accessing them.

    private static string ExtractTextAndImages(JsonElement? contentEl, List<byte[]> images)
    {
        if (contentEl is null) return "";
        var el = contentEl.Value;
        if (el.ValueKind == JsonValueKind.String) return el.GetString() ?? "";
        if (el.ValueKind != JsonValueKind.Array) return "";

        var sb = new StringBuilder();
        foreach (var block in el.EnumerateArray())
        {
            if (block.ValueKind != JsonValueKind.Object)
                continue;
            var type = block.TryGetProperty("type", out var t) && t.ValueKind == JsonValueKind.String ? t.GetString() : null;
            if (type == "text")
            { 
                if (block.TryGetProperty("text", out var tv) && tv.ValueKind == JsonValueKind.String)
                    sb.Append(tv.GetString());
            }
            else if (type == "image")
            { 
                images.Add(ParseAnthropicImageBlock(block));
                sb.Append(ImageContent.Placeholder);
            }
        }
        return sb.ToString();
    }

Comment on lines +601 to +612
private static byte[] ParseAnthropicImageBlock(JsonElement block)
{
if (!block.TryGetProperty("source", out var source))
throw new ImageContentException("image content block is missing 'source'.");
var srcType = source.TryGetProperty("type", out var st) ? st.GetString() : null;
if (srcType != "base64")
throw new ImageContentException(
$"unsupported image source type '{srcType}'; only base64 image sources are supported (URL fetch is not).");
if (!source.TryGetProperty("data", out var data) || data.ValueKind != JsonValueKind.String)
throw new ImageContentException("image source is missing base64 'data'.");
return ImageContent.FromBase64(data.GetString() ?? "");
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Robustness and Defensive Programming: In ParseAnthropicImageBlock, if the source property is not a JSON object, or if the type property is not a string, calling source.TryGetProperty or st.GetString() will throw an unhandled InvalidOperationException, leading to an HTTP 500 error. We should defensively check the ValueKind of source and st to ensure they are of the expected types, throwing an ImageContentException (which is cleanly caught and mapped to HTTP 400) if they are not.

    private static byte[] ParseAnthropicImageBlock(JsonElement block)
    {
        if (!block.TryGetProperty("source", out var source) || source.ValueKind != JsonValueKind.Object)
            throw new ImageContentException("image content block is missing a valid 'source' object.");
        var srcType = source.TryGetProperty("type", out var st) && st.ValueKind == JsonValueKind.String ? st.GetString() : null;
        if (srcType != "base64")
            throw new ImageContentException(
                $"unsupported image source type '{srcType}'; only base64 image sources are supported (URL fetch is not).");
        if (!source.TryGetProperty("data", out var data) || data.ValueKind != JsonValueKind.String)
            throw new ImageContentException("image source is missing base64 'data'.");
        return ImageContent.FromBase64(data.GetString() ?? "");
    }

Comment on lines +458 to +489
private static string FlattenContent(JsonElement? contentEl, List<byte[]> images)
{
if (contentEl is null) return "";
var el = contentEl.Value;
if (el.ValueKind == JsonValueKind.String) return el.GetString() ?? "";
if (el.ValueKind != JsonValueKind.Array) return "";

var sb = new StringBuilder();
foreach (var part in el.EnumerateArray())
{
var type = part.TryGetProperty("type", out var t) ? t.GetString() : null;
if (type == "text")
{
if (part.TryGetProperty("text", out var tv))
sb.Append(tv.GetString());
}
else if (type is "image_url" or "input_image")
{
// image_url is either {url:"data:..."} (chat completions) or a bare string url.
string url = "";
if (part.TryGetProperty("image_url", out var iu))
url = iu.ValueKind == JsonValueKind.String
? iu.GetString() ?? ""
: iu.TryGetProperty("url", out var u) ? u.GetString() ?? "" : "";
if (string.IsNullOrEmpty(url))
throw new ImageContentException($"{type} content part is missing a base64 'image_url'.");
images.Add(ImageContent.FromDataUrl(url));
sb.Append(ImageContent.Placeholder);
}
}
return sb.ToString();
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Robustness and Defensive Programming: In FlattenContent, if the client sends a malformed JSON payload where the array contains non-object elements, or if the type, text, or image_url properties are not of the expected JSON types, calling part.TryGetProperty, t.GetString(), or iu.TryGetProperty will throw an unhandled InvalidOperationException. This will result in an HTTP 500 Internal Server Error instead of a clean HTTP 400 Bad Request. We should defensively check the ValueKind of the elements and properties before accessing them.

    private static string FlattenContent(JsonElement? contentEl, List<byte[]> images)
    {
        if (contentEl is null) return "";
        var el = contentEl.Value;
        if (el.ValueKind == JsonValueKind.String) return el.GetString() ?? "";
        if (el.ValueKind != JsonValueKind.Array) return "";

        var sb = new StringBuilder();
        foreach (var part in el.EnumerateArray())
        {
            if (part.ValueKind != JsonValueKind.Object)
                continue;
            var type = part.TryGetProperty("type", out var t) && t.ValueKind == JsonValueKind.String ? t.GetString() : null;
            if (type == "text")
            { 
                if (part.TryGetProperty("text", out var tv) && tv.ValueKind == JsonValueKind.String)
                    sb.Append(tv.GetString());
            }
            else if (type is "image_url" or "input_image")
            { 
                // image_url is either {url:"data:..."} (chat completions) or a bare string url.
                string url = "";
                if (part.TryGetProperty("image_url", out var iu))
                {
                    if (iu.ValueKind == JsonValueKind.String)
                    {
                        url = iu.GetString() ?? "";
                    }
                    else if (iu.ValueKind == JsonValueKind.Object)
                    {
                        url = iu.TryGetProperty("url", out var u) && u.ValueKind == JsonValueKind.String ? u.GetString() ?? "" : "";
                    }
                }
                if (string.IsNullOrEmpty(url))
                    throw new ImageContentException($"{type} content part is missing a base64 'image_url'.");
                images.Add(ImageContent.FromDataUrl(url));
                sb.Append(ImageContent.Placeholder);
            }
        }
        return sb.ToString();
    }

Comment on lines +195 to +201
var blocks = new (float[] Soft, int NTok)[imageBytes.Count];
for (int i = 0; i < imageBytes.Count; i++)
{
ct.ThrowIfCancellationRequested();
byte[] rgb;
int w, h;
using (var ms = new MemoryStream(imageBytes[i], writable: false))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Defensive Programming: Ensure that each element in imageBytes is not null or empty before attempting to load it into a MemoryStream. If a null or empty byte array is passed, it will throw an unhandled ArgumentNullException or cause issues in ImageIO.LoadRgb.

        var blocks = new (float[] Soft, int NTok)[imageBytes.Count];
        for (int i = 0; i < imageBytes.Count; i++)
        {
            ct.ThrowIfCancellationRequested();
            if (imageBytes[i] is null || imageBytes[i].Length == 0)
                throw new ArgumentException($"Image at index {i} is null or empty.", nameof(imageBytes));
            byte[] rgb;
            int w, h;
            using (var ms = new MemoryStream(imageBytes[i], writable: false))

pekkah and others added 3 commits June 15, 2026 22:18
…obustness

Multi-agent review of the #253 diff surfaced two real correctness bugs, a DoS
vector, and a cluster of untrusted-input robustness gaps. All fixed:

Critical correctness:
- Decode loop forwarded content tokens at `tokens.Length + i` instead of
  `startPos + i`. For images startPos is hundreds of positions larger (each
  placeholder expands to open + soft tokens + close), so every multi-token
  image response decoded on top of the spliced soft-token KV with wrong RoPE
  positions — the repetition/garbling seen in testing. Two of three forwards
  had been updated in the GenerateCore refactor; this fixes the third.
- The image path never invalidated _prevTokens, so a later text request could
  match the stale sequence in FindCacheablePrefix and TruncateTo into a cache
  holding image-soft-token KV (silent cross-request corruption). Null it before
  the fresh image prefill; also covers a mid-prefill throw.

Security:
- ImageIO inflated the PNG IDAT zlib stream unbounded before any size check —
  a tiny decompression bomb could OOM the process, now network-reachable via
  the image endpoints. Bound the read to the exact decoded size a non-interlaced
  PNG must produce; the remainder of the stream is never drained.

Robustness (untrusted input → clean 400, not 500 / aborted stream):
- ImageContent now fully decode-validates at parse time, so an unsupported /
  corrupt PNG (interlaced, 16-bit, truncated) is a clean 400 before any
  response bytes rather than a 500/stream-abort deep in prefill.
- Guard JsonElement.TryGetProperty against non-object content-array elements
  and non-object image sources (would otherwise throw → 500).
- Wrap the engine enumeration in both endpoints: non-streaming returns a
  structured error; streaming terminates the SSE stream cleanly.
- Cap images per request (16) as a DoS guard.

Consistency / altitude:
- /v1/responses now honors ModelDefaultsThinkingOff + DisableThinking like the
  chat endpoints (Gemma 4 was rendering thinking on there).
- Reject a non-Gemma text model + gemma4uv mmproj at load (CPU pass reports
  SupportsEmbeddingInput for every arch).
- Hoist the duplicated image-dispatch helper to ImageContent.Generate.

Tests: real decodable PNG fixture; undecodable-PNG → 400; too-many-images → 400;
ImageIO bomb-cap + short-IDAT tests. Re-verified end-to-end on the 12B gemma4uv
model: both endpoints now produce clean, non-repetitive image descriptions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ture

The three AnthropicMessages_WithTools_* tests built a WebApplicationFactory<Program>
without pinning a ChatTemplateRenderer, so the default renderer was constructed from
the bound Architecture option. WebApplicationFactory loads the Host's (gitignored)
appsettings.Local.json, so a developer who set Architecture to a non-qwen value got a
tool-call adapter that can't parse <tool_call> — the tests passed in CI but failed
locally. Pin the qwen adapter via a shared ToolHostFactory helper so they are
deterministic regardless of local config.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…r mapping

Follow-up to the #259 review fixes; a second adversarial pass found two
fix-introduced regressions and an observability gap:

- The per-request image cap was checked AFTER BuildMessageList had already
  decoded every image, so it didn't bound the work it documents. Enforce it in
  the content-flatten loop (ImageContent.CheckCap) before each decode, so an
  over-cap request is rejected having decoded at most the cap.
- ImageContent.ValidatePng caught only InvalidDataException/NotSupportedException,
  but a crafted PNG (e.g. an IHDR chunk with length < 13) made ImageIO throw
  IndexOutOfRangeException, which escaped to a bare 500. Widen the catch to map
  every decode failure (except OOM/StackOverflow) to a clean 400, and harden
  ImageIO to reject a short IHDR chunk with InvalidDataException at the source.
- The streaming swallow-catches discarded the exception with no trace; log it so
  a genuine mid-stream failure is diagnosable (the client still sees a clean
  terminal frame, which is unavoidable once the SSE response is committed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah pekkah merged commit a5f5d01 into master Jun 15, 2026
@pekkah pekkah deleted the feat/253-server-image-input branch June 15, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(server): multimodal image input endpoint (Gemma 4 gemma4uv content blocks)

1 participant