feat(server): multimodal image input endpoint (Gemma 4 gemma4uv) (#253)#259
Conversation
Wire encoder-free Gemma 4 vision (issue #250's CLI feature) through the API server's Anthropic and OpenAI chat endpoints. Engine owns the vision pipeline; the server only decodes content blocks to bytes and hands them over: - InferenceEngine.EnableImageInput / SupportsImageInput / GenerateImageChunksAsync; SplicePrefillImages projects each image and splices open + soft tokens + close at every <|image|> placeholder (mirrors the CLI's RunImagePrompt loop). - IInferenceEngine grows default interface members so other implementers inherit a "no image support" stub. - InferenceEngineLoader opens the mmproj VisionModel when configured and rejects unsupported passes (non-embedding / batched) with a clear error. Server surface: - New ImageContent helper: base64 / data-URL decode + PNG-signature validation. - Anthropic image content blocks (source.type=base64) and OpenAI image_url / input_image parts (data URL); each image is collected and replaced inline by the model's <|image|> placeholder. - MmprojPath option settable via SHARPI_MMPROJ env, the SharpInference:MmprojPath config key, or Configure(...). Gemma 4 is not a reasoning model: the server now defaults thinking off for it (ChatTemplateRenderer.ModelDefaultsThinkingOff), matching the CLI. Leaving thinking on made the model ramble in the <|channel> thought channel and go out-of-distribution on image input. An explicit request flag still wins. Tests: 7 wire-format image tests + an MmprojPath configurability test. Verified end-to-end on a 4070 Ti with the 12B gemma4uv model + its mmproj: both endpoints correctly describe a real image. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements multimodal image input support (issue #253) for Gemma 4 models, allowing the engine to project images into soft-token blocks and splice them during prompt prefill. It updates the Anthropic and OpenAI endpoints to parse and validate base64-encoded PNGs or data URLs, routing them to the new image-aware generation path. The feedback focuses on enhancing robustness and defensive programming, specifically recommending validation checks on JSON elements during payload parsing in ExtractTextAndImages, ParseAnthropicImageBlock, and FlattenContent to prevent unhandled exceptions (HTTP 500 errors). Additionally, it is suggested to verify that image byte arrays are not null or empty before loading them into memory streams.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| private static string ExtractTextAndImages(JsonElement? contentEl, List<byte[]> images) | ||
| { | ||
| if (contentEl is null) return ""; | ||
| var el = contentEl.Value; | ||
| if (el.ValueKind == JsonValueKind.String) return el.GetString() ?? ""; | ||
| if (el.ValueKind != JsonValueKind.Array) return ""; | ||
|
|
||
| var sb = new StringBuilder(); | ||
| foreach (var block in el.EnumerateArray()) | ||
| { | ||
| var type = block.TryGetProperty("type", out var t) ? t.GetString() : null; | ||
| if (type == "text") | ||
| { | ||
| if (block.TryGetProperty("text", out var tv)) | ||
| sb.Append(tv.GetString()); | ||
| } | ||
| else if (type == "image") | ||
| { | ||
| images.Add(ParseAnthropicImageBlock(block)); | ||
| sb.Append(ImageContent.Placeholder); | ||
| } | ||
| } | ||
| return sb.ToString(); | ||
| } |
There was a problem hiding this comment.
Robustness and Defensive Programming: The ExtractTextAndImages method iterates over el.EnumerateArray(). If the client sends a malformed JSON payload where the array contains non-object elements (e.g., strings or numbers), or if the type or text properties are not strings, calling block.TryGetProperty or t.GetString() will throw an unhandled InvalidOperationException. This will result in an HTTP 500 Internal Server Error instead of a clean HTTP 400 Bad Request. We should defensively check the ValueKind of the elements and properties before accessing them.
private static string ExtractTextAndImages(JsonElement? contentEl, List<byte[]> images)
{
if (contentEl is null) return "";
var el = contentEl.Value;
if (el.ValueKind == JsonValueKind.String) return el.GetString() ?? "";
if (el.ValueKind != JsonValueKind.Array) return "";
var sb = new StringBuilder();
foreach (var block in el.EnumerateArray())
{
if (block.ValueKind != JsonValueKind.Object)
continue;
var type = block.TryGetProperty("type", out var t) && t.ValueKind == JsonValueKind.String ? t.GetString() : null;
if (type == "text")
{
if (block.TryGetProperty("text", out var tv) && tv.ValueKind == JsonValueKind.String)
sb.Append(tv.GetString());
}
else if (type == "image")
{
images.Add(ParseAnthropicImageBlock(block));
sb.Append(ImageContent.Placeholder);
}
}
return sb.ToString();
}| private static byte[] ParseAnthropicImageBlock(JsonElement block) | ||
| { | ||
| if (!block.TryGetProperty("source", out var source)) | ||
| throw new ImageContentException("image content block is missing 'source'."); | ||
| var srcType = source.TryGetProperty("type", out var st) ? st.GetString() : null; | ||
| if (srcType != "base64") | ||
| throw new ImageContentException( | ||
| $"unsupported image source type '{srcType}'; only base64 image sources are supported (URL fetch is not)."); | ||
| if (!source.TryGetProperty("data", out var data) || data.ValueKind != JsonValueKind.String) | ||
| throw new ImageContentException("image source is missing base64 'data'."); | ||
| return ImageContent.FromBase64(data.GetString() ?? ""); | ||
| } |
There was a problem hiding this comment.
Robustness and Defensive Programming: In ParseAnthropicImageBlock, if the source property is not a JSON object, or if the type property is not a string, calling source.TryGetProperty or st.GetString() will throw an unhandled InvalidOperationException, leading to an HTTP 500 error. We should defensively check the ValueKind of source and st to ensure they are of the expected types, throwing an ImageContentException (which is cleanly caught and mapped to HTTP 400) if they are not.
private static byte[] ParseAnthropicImageBlock(JsonElement block)
{
if (!block.TryGetProperty("source", out var source) || source.ValueKind != JsonValueKind.Object)
throw new ImageContentException("image content block is missing a valid 'source' object.");
var srcType = source.TryGetProperty("type", out var st) && st.ValueKind == JsonValueKind.String ? st.GetString() : null;
if (srcType != "base64")
throw new ImageContentException(
$"unsupported image source type '{srcType}'; only base64 image sources are supported (URL fetch is not).");
if (!source.TryGetProperty("data", out var data) || data.ValueKind != JsonValueKind.String)
throw new ImageContentException("image source is missing base64 'data'.");
return ImageContent.FromBase64(data.GetString() ?? "");
}| private static string FlattenContent(JsonElement? contentEl, List<byte[]> images) | ||
| { | ||
| if (contentEl is null) return ""; | ||
| var el = contentEl.Value; | ||
| if (el.ValueKind == JsonValueKind.String) return el.GetString() ?? ""; | ||
| if (el.ValueKind != JsonValueKind.Array) return ""; | ||
|
|
||
| var sb = new StringBuilder(); | ||
| foreach (var part in el.EnumerateArray()) | ||
| { | ||
| var type = part.TryGetProperty("type", out var t) ? t.GetString() : null; | ||
| if (type == "text") | ||
| { | ||
| if (part.TryGetProperty("text", out var tv)) | ||
| sb.Append(tv.GetString()); | ||
| } | ||
| else if (type is "image_url" or "input_image") | ||
| { | ||
| // image_url is either {url:"data:..."} (chat completions) or a bare string url. | ||
| string url = ""; | ||
| if (part.TryGetProperty("image_url", out var iu)) | ||
| url = iu.ValueKind == JsonValueKind.String | ||
| ? iu.GetString() ?? "" | ||
| : iu.TryGetProperty("url", out var u) ? u.GetString() ?? "" : ""; | ||
| if (string.IsNullOrEmpty(url)) | ||
| throw new ImageContentException($"{type} content part is missing a base64 'image_url'."); | ||
| images.Add(ImageContent.FromDataUrl(url)); | ||
| sb.Append(ImageContent.Placeholder); | ||
| } | ||
| } | ||
| return sb.ToString(); | ||
| } |
There was a problem hiding this comment.
Robustness and Defensive Programming: In FlattenContent, if the client sends a malformed JSON payload where the array contains non-object elements, or if the type, text, or image_url properties are not of the expected JSON types, calling part.TryGetProperty, t.GetString(), or iu.TryGetProperty will throw an unhandled InvalidOperationException. This will result in an HTTP 500 Internal Server Error instead of a clean HTTP 400 Bad Request. We should defensively check the ValueKind of the elements and properties before accessing them.
private static string FlattenContent(JsonElement? contentEl, List<byte[]> images)
{
if (contentEl is null) return "";
var el = contentEl.Value;
if (el.ValueKind == JsonValueKind.String) return el.GetString() ?? "";
if (el.ValueKind != JsonValueKind.Array) return "";
var sb = new StringBuilder();
foreach (var part in el.EnumerateArray())
{
if (part.ValueKind != JsonValueKind.Object)
continue;
var type = part.TryGetProperty("type", out var t) && t.ValueKind == JsonValueKind.String ? t.GetString() : null;
if (type == "text")
{
if (part.TryGetProperty("text", out var tv) && tv.ValueKind == JsonValueKind.String)
sb.Append(tv.GetString());
}
else if (type is "image_url" or "input_image")
{
// image_url is either {url:"data:..."} (chat completions) or a bare string url.
string url = "";
if (part.TryGetProperty("image_url", out var iu))
{
if (iu.ValueKind == JsonValueKind.String)
{
url = iu.GetString() ?? "";
}
else if (iu.ValueKind == JsonValueKind.Object)
{
url = iu.TryGetProperty("url", out var u) && u.ValueKind == JsonValueKind.String ? u.GetString() ?? "" : "";
}
}
if (string.IsNullOrEmpty(url))
throw new ImageContentException($"{type} content part is missing a base64 'image_url'.");
images.Add(ImageContent.FromDataUrl(url));
sb.Append(ImageContent.Placeholder);
}
}
return sb.ToString();
}| var blocks = new (float[] Soft, int NTok)[imageBytes.Count]; | ||
| for (int i = 0; i < imageBytes.Count; i++) | ||
| { | ||
| ct.ThrowIfCancellationRequested(); | ||
| byte[] rgb; | ||
| int w, h; | ||
| using (var ms = new MemoryStream(imageBytes[i], writable: false)) |
There was a problem hiding this comment.
Defensive Programming: Ensure that each element in imageBytes is not null or empty before attempting to load it into a MemoryStream. If a null or empty byte array is passed, it will throw an unhandled ArgumentNullException or cause issues in ImageIO.LoadRgb.
var blocks = new (float[] Soft, int NTok)[imageBytes.Count];
for (int i = 0; i < imageBytes.Count; i++)
{
ct.ThrowIfCancellationRequested();
if (imageBytes[i] is null || imageBytes[i].Length == 0)
throw new ArgumentException($"Image at index {i} is null or empty.", nameof(imageBytes));
byte[] rgb;
int w, h;
using (var ms = new MemoryStream(imageBytes[i], writable: false))…obustness Multi-agent review of the #253 diff surfaced two real correctness bugs, a DoS vector, and a cluster of untrusted-input robustness gaps. All fixed: Critical correctness: - Decode loop forwarded content tokens at `tokens.Length + i` instead of `startPos + i`. For images startPos is hundreds of positions larger (each placeholder expands to open + soft tokens + close), so every multi-token image response decoded on top of the spliced soft-token KV with wrong RoPE positions — the repetition/garbling seen in testing. Two of three forwards had been updated in the GenerateCore refactor; this fixes the third. - The image path never invalidated _prevTokens, so a later text request could match the stale sequence in FindCacheablePrefix and TruncateTo into a cache holding image-soft-token KV (silent cross-request corruption). Null it before the fresh image prefill; also covers a mid-prefill throw. Security: - ImageIO inflated the PNG IDAT zlib stream unbounded before any size check — a tiny decompression bomb could OOM the process, now network-reachable via the image endpoints. Bound the read to the exact decoded size a non-interlaced PNG must produce; the remainder of the stream is never drained. Robustness (untrusted input → clean 400, not 500 / aborted stream): - ImageContent now fully decode-validates at parse time, so an unsupported / corrupt PNG (interlaced, 16-bit, truncated) is a clean 400 before any response bytes rather than a 500/stream-abort deep in prefill. - Guard JsonElement.TryGetProperty against non-object content-array elements and non-object image sources (would otherwise throw → 500). - Wrap the engine enumeration in both endpoints: non-streaming returns a structured error; streaming terminates the SSE stream cleanly. - Cap images per request (16) as a DoS guard. Consistency / altitude: - /v1/responses now honors ModelDefaultsThinkingOff + DisableThinking like the chat endpoints (Gemma 4 was rendering thinking on there). - Reject a non-Gemma text model + gemma4uv mmproj at load (CPU pass reports SupportsEmbeddingInput for every arch). - Hoist the duplicated image-dispatch helper to ImageContent.Generate. Tests: real decodable PNG fixture; undecodable-PNG → 400; too-many-images → 400; ImageIO bomb-cap + short-IDAT tests. Re-verified end-to-end on the 12B gemma4uv model: both endpoints now produce clean, non-repetitive image descriptions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ture The three AnthropicMessages_WithTools_* tests built a WebApplicationFactory<Program> without pinning a ChatTemplateRenderer, so the default renderer was constructed from the bound Architecture option. WebApplicationFactory loads the Host's (gitignored) appsettings.Local.json, so a developer who set Architecture to a non-qwen value got a tool-call adapter that can't parse <tool_call> — the tests passed in CI but failed locally. Pin the qwen adapter via a shared ToolHostFactory helper so they are deterministic regardless of local config. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…r mapping Follow-up to the #259 review fixes; a second adversarial pass found two fix-introduced regressions and an observability gap: - The per-request image cap was checked AFTER BuildMessageList had already decoded every image, so it didn't bound the work it documents. Enforce it in the content-flatten loop (ImageContent.CheckCap) before each decode, so an over-cap request is rejected having decoded at most the cap. - ImageContent.ValidatePng caught only InvalidDataException/NotSupportedException, but a crafted PNG (e.g. an IHDR chunk with length < 13) made ImageIO throw IndexOutOfRangeException, which escaped to a bare 500. Widen the catch to map every decode failure (except OOM/StackOverflow) to a clean 400, and harden ImageIO to reject a short IHDR chunk with InvalidDataException at the source. - The streaming swallow-catches discarded the exception with no trace; log it so a genuine mid-stream failure is diagnosable (the client still sees a clean terminal frame, which is unavoidable once the SSE response is committed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Closes #253.
Wires encoder-free Gemma 4 vision (the CLI's
--imagefrom #250) through the APIserver's Anthropic (
/v1/messages) and OpenAI (/v1/chat/completions) endpoints.Design — engine owns vision, server passes bytes
InferenceEngine.EnableImageInput(...),SupportsImageInput, andGenerateImageChunksAsync(...).SplicePrefillImagesprojects each image(decode → preprocess → embed) and splices
open + soft tokens + closeat every<|image|>placeholder during a fresh prefill (mirrors the CLI'sRunImagePrompt).IInferenceEnginegains default interface members so existing implementers inherita "no image support" stub.
InferenceEngineLoaderopens the mmprojVisionModelwhen configured and rejectsunsupported configs (non-embedding pass, batched, or a non-Gemma text model) with a clear error.
Server surface
ImageContenthelper: base64 / data-URL decode + full PNG decode-validation at parsetime (unsupported / corrupt / decompression-bomb payloads → HTTP 400 before any response).
imagecontent blocks (source.type=base64) and OpenAIimage_url/input_imageparts (data URL). Each image is collected and replaced inline by themodel's
<|image|>placeholder, multi-image per message in document order (capped at 16/request).MmprojPathoption settable viaSHARPI_MMPROJenv, theSharpInference:MmprojPathconfig key, or
Configure(o => o.MmprojPath = ...).Gemma 4 thinking default
Gemma 4 is not a reasoning model. The server defaults thinking off for it
(
ChatTemplateRenderer.ModelDefaultsThinkingOff), matching the CLI and applied uniformlyacross
/v1/messages,/v1/chat/completions, and/v1/responses. An explicit requestflag still wins;
SHARPI_NO_THINKINGstill forces off.Review cycle
A multi-agent adversarial review of the initial diff found — and this PR fixes — two real
bugs that single-image smoke tests masked:
tokens.Length + iinstead ofstartPos + i, so multi-token image responses decoded over the spliced soft-token KV withwrong RoPE positions (the repetition/garbling). Fixed → output is now clean and coherent.
_prevTokens, letting a latertext request reuse image-soft-token KV. Fixed.
ImageIOinflated the PNG IDAT unbounded; now bounded to theexact decoded size.
errors during generation are surfaced gracefully.
Tests / verification
MmprojPathconfigurability +ImageIObomb-cap tests.TreatWarningsAsErrors); Tests.Server green(the 3
WithToolsfailures are a local-appsettingsartifact, unrelated — they fail onmaster too).
gemma4uvmodel + its mmproj: both endpointsreturn clean, coherent descriptions of a real image (e.g. "A serene, misty landscape
features a calm lake reflecting a mountain range under a clear blue sky.").
Deferred (noted, out of scope)
/v1/responsesdoes not yet accept image content (text-only; pre-existing).byte-identical-copy design).
🤖 Generated with Claude Code