feat(vision): encoder-free Gemma 4 image input (CPU + CUDA) (#250)#251
Conversation
Adds CPU image→text for Gemma 4 12B via the encoder-free "gemma4uv" projector — raw image patches projected straight into the LLM embedding space, no SigLIP ViT. New SharpInference.Vision project: - VisionModel: loads/validates the gemma4uv mmproj GGUF - GemmaUvVisionEmbedder: im2col(48) -> LN×3 -> linear -> RMSNorm -> bf16 input_projection; parity vs a numpy oracle (cosine > 0.9995) - ImageIO: minimal PNG decoder; ImagePreprocessor: calc_size_preserved_ratio + align-corners bilinear, mirroring mtmd-image.cpp Decoder seam: ForwardPass.ForwardEmbedding injects a precomputed embedding (no sqrt(d) scale per gemma4.cpp; PLE via padding token 0), gated by IForwardPass.SupportsEmbeddingInput (default-throw elsewhere — CPU only). CLI: --image (repeatable) + --mmproj. Images are referenced by <image> markers in -p (left-to-right) or prepended when omitted; the prompt is rendered through the model's own chat template and each <|image|> placeholder (258880) is expanded with its soft tokens wrapped in <|image>(255999) … <image|>(258882). Verified end to end: a solid-red image decodes "Red" (logit 26 vs 19), then stops cleanly on <turn|>. The earlier garbage output was a prompt- format bug — Gemma 4 uses <|turn>/<turn|>, not Gemma 3's <start_of_turn>; causal attention is sufficient (no bidirectional image span needed). download-model.ps1 also pulls the mmproj alongside the main GGUF. GPU (CUDA/Vulkan) ForwardEmbedding, audio, and the server path remain follow-ups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements SupportsEmbeddingInput/ForwardEmbedding on CudaForwardPass —
the GPU mirror of the CPU seam. Uploads the precomputed vision soft token
into the device hidden buffer, skips the sqrt(d) embedding scale (raw
embeddings arrive final, gemma4.cpp:182), builds PLE from the padding
token (id 0), and runs the same device region (and CUDA graph) as text
decode. Gemma 4 only; partial-offload hybrids and Vulkan still throw.
CLI: --image now works under full CUDA offload (-g -1). RunImagePrompt
routes through IForwardPass (gpuFwd ?? cpu fwd); the CPU-only gate is
dropped and replaced by a backend-aware SupportsEmbeddingInput check with
an actionable message for unsupported paths.
Verified on a 4070 Ti: the red image decodes "Red" at logit 25.87 (CPU
was 26.08 — faithful), and a two-image marker prompt ("first <image> …
second <image>") answers "red and blue" in order, at ~50 t/s prefill.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces multimodal vision support for Gemma 4 models by implementing the encoder-free gemma4uv image projector. It adds a new SharpInference.Vision project containing a custom PNG decoder, an image preprocessor, and the GemmaUvVisionEmbedder to project images into soft-token embeddings, while updating the CLI and forward pass backends to support splicing these embeddings. The review feedback highlights several improvement opportunities, including addressing potential integer overflow and validating image dimensions in the PNG decoder, optimizing bilinear interpolation in the preprocessor, validating the input buffer length in the vision embedder, catching more specific file I/O exceptions, and expanding special token compatibility for other Gemma versions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if (dataOff + clen + 4 > len) | ||
| throw new InvalidDataException($"Truncated PNG chunk '{type}'."); |
There was a problem hiding this comment.
Potential integer overflow and out-of-bounds read/crash when validating chunk length clen in the PNG decoder. If clen is negative or extremely large, dataOff + clen + 4 can overflow and bypass the bounds check. Using clen > len - pos - 12 is safe from overflow.
if (clen < 0 || clen > len - pos - 12)
throw new InvalidDataException($"Truncated or invalid PNG chunk '{type}' with length {clen}.");| case "IHDR": | ||
| w = ReadUInt32BE(png, dataOff); | ||
| h = ReadUInt32BE(png, dataOff + 4); | ||
| bitDepth = png[dataOff + 8]; | ||
| colorType = png[dataOff + 9]; | ||
| interlace = png[dataOff + 12]; | ||
| seenIhdr = true; |
There was a problem hiding this comment.
Malformed PNG files with 0, negative, or extremely large dimensions can cause division-by-zero, NaN propagation, and negative array allocations in downstream preprocessing, or out-of-memory denial of service. Adding a validation check for dimensions prevents these issues.
case "IHDR":
w = ReadUInt32BE(png, dataOff);
h = ReadUInt32BE(png, dataOff + 4);
if (w <= 0 || h <= 0 || w > 16384 || h > 16384)
throw new InvalidDataException($"Invalid or unsupported PNG dimensions: {w}x{h}.");
bitDepth = png[dataOff + 8];
colorType = png[dataOff + 9];
interlace = png[dataOff + 12];
seenIhdr = true;
break;| for (int x = 0; x < w; x++) | ||
| { | ||
| float px = x * xRatio; | ||
| int x0 = Math.Min((int)px, srcWidth - 1); | ||
| int x1 = Math.Min(x0 + 1, srcWidth - 1); | ||
| float xf = px - x0; | ||
|
|
||
| int i00 = (y0 * srcWidth + x0) * 3, i10 = (y0 * srcWidth + x1) * 3; | ||
| int i01 = (y1 * srcWidth + x0) * 3, i11 = (y1 * srcWidth + x1) * 3; |
There was a problem hiding this comment.
Avoid redundant multiplications in the inner loop of bilinear interpolation. y0 * srcWidth and y1 * srcWidth can be calculated once in the outer loop to improve performance.
int y0Offset = y0 * srcWidth;
int y1Offset = y1 * srcWidth;
for (int x = 0; x < w; x++)
{
float px = x * xRatio;
int x0 = Math.Min((int)px, srcWidth - 1);
int x1 = Math.Min(x0 + 1, srcWidth - 1);
float xf = px - x0;
int i00 = (y0Offset + x0) * 3, i10 = (y0Offset + x1) * 3;
int i01 = (y1Offset + x0) * 3, i11 = (y1Offset + x1) * 3;| if (width % _patch != 0 || height % _patch != 0) | ||
| throw new ArgumentException($"image dims ({width}x{height}) must be multiples of {_patch}."); |
There was a problem hiding this comment.
Validate chw.Length before accessing it. If the input span is smaller than 3 * height * width, it will throw an unhandled IndexOutOfRangeException.
if (width % _patch != 0 || height % _patch != 0)
throw new ArgumentException($"image dims ({width}x{height}) must be multiples of {_patch}.");
if (chw.Length < 3 * height * width)
throw new ArgumentException($"chw length ({chw.Length}) is too small for dims {width}x{height}.");| { | ||
| rgb = ImageIO.LoadRgb(imagePaths[i], out srcW, out srcH); | ||
| } | ||
| catch (Exception ex) when (ex is IOException or NotSupportedException or InvalidDataException) |
There was a problem hiding this comment.
Catch specific exceptions such as UnauthorizedAccessException and SecurityException in addition to IOException to ensure precise error handling and graceful application exits when permission issues occur.
catch (Exception ex) when (ex is IOException or NotSupportedException or InvalidDataException or UnauthorizedAccessException or System.Security.SecurityException)References
- When performing file I/O operations, catch specific exceptions (such as IOException, UnauthorizedAccessException, SecurityException, and NotSupportedException) instead of a generic Exception to ensure precise error handling and graceful application exits.
| int imgOpen = tok.SpecialTokens.TryGetValue("<|image>", out var o) ? o : 255999; | ||
| int imgClose = tok.SpecialTokens.TryGetValue("<image|>", out var c) ? c : 258882; |
There was a problem hiding this comment.
Support both Gemma 4's <|image|>/<image|> and Gemma 3's <img>/<end_of_image> special tokens to ensure compatibility across different Gemma multimodal model versions.
int imgOpen = tok.SpecialTokens.TryGetValue("<|image|>", out var o) ? o :
tok.SpecialTokens.TryGetValue("<img>", out var o2) ? o2 : 255999;
int imgClose = tok.SpecialTokens.TryGetValue("<image|>", out var c) ? c :
tok.SpecialTokens.TryGetValue("<end_of_image>", out var c2) ? c2 : 258882;…ract (#250) Addresses PR #251 review (cycle 1): - ImageIO: reject negative chunk lengths and non-positive/absurd IHDR dimensions as InvalidDataException (signed ReadUInt32BE could wrap past the bounds check into an alloc crash that escaped the CLI's catch filter) - VisionModel.Open: validate the two raw-pointer tensor dtypes (v.patch_embd.weight=F32, mm.input_projection.weight=BF16) so a divergent mmproj export fails loudly instead of misreading bytes into garbage - IForwardPass.ForwardEmbedding: document the PLE-token-0 contract alongside the sqrt(d)-skip so future backend mirrors stay faithful - tests: malformed-PNG regression (bad signature, zero dimensions) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…le 2) Belt-and-suspenders: (long)dataOff + clen + 4 so a chunk length near int.MaxValue (only reachable with a ~2 GB in-memory PNG) can't wrap negative and defeat the truncation check. Closes the last review nit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
#250/#126) - docs/SharpInference-Design.md: new §8.3 documenting the shipped 12B encoder-free gemma4uv image path (projector forward, ForwardEmbedding splice, CLI prompt wiring), with an explicit note that E4B is a different, unsupported architecture. - docs/gemma4-e4b-vision-plan.md: retire the verification debt. Dumped the real E4B mmproj header — it is NOT MobileNet-V5/gemma3n as the plan assumed. E4B ships a gemma4v transformer ViT vision encoder (16 blocks, 768-dim, 12 heads, QK-norm, GeGLU, conv patch-embed) + a gemma4a audio encoder. The plan's conclusion (encoder-full, separate from the 12B) holds; the specifics are corrected. - download-model.ps1: gemma4-e4b-qat now also pulls the ~992 MB mmproj, with accurate labeling (gemma4v ViT + gemma4a, not the 12B gemma4uv). Confirms Gemma 4 E4B and 12B use distinct vision architectures; #250's path covers only the 12B. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…oist (#250) PR #251 Gemini code-assist findings: - GemmaUvVisionEmbedder.Forward: validate chw.Length >= 3*h*w before reading (clear ArgumentException instead of IndexOutOfRange on a short span) - ImagePreprocessor: hoist y0*srcWidth / y1*srcWidth out of the bilinear inner loop - RunImagePrompt: broaden the image-read catch to UnauthorizedAccessException / SecurityException, matching the prompt-file read path The two security-high PNG findings (chunk-length overflow, IHDR dimension bounds) were already fixed in the earlier review cycles ((long) chunk math + the (long)w*h > 64M dimension guard). The suggestion to add Gemma 3 <img>/<end_of_image> token fallbacks was declined: this path is Gemma 4-only (arch-gated) and Gemma 3 uses a different encoder-full vision architecture the splice can't serve. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Adds image input for Gemma 4 12B via the encoder-free "gemma4uv" projector — raw image patches projected straight into the LLM embedding space, no SigLIP ViT. Closes #250 (CPU image→text scope; GPU full-offload also included).
This is image understanding only (image in → text out). It is unrelated to the existing
sharpi-cli imagetext→image generation (Z-Image-Turbo,SharpInference.Diffusion).What's included
New
SharpInference.VisionprojectVisionModel— loads/validates the gemma4uv mmproj GGUFGemmaUvVisionEmbedder—im2col(48)→ LayerNorm×3 → linear → RMSNorm → bf16input_projection; parity vs a numpy oracle (cosine > 0.9995)ImageIO— minimal PNG decoder (no imaging dependency, AOT-safe)ImagePreprocessor—calc_size_preserved_ratio+ align-corners bilinear, mirroringmtmd-image.cppDecoder seam (
IForwardPass.ForwardEmbedding)ForwardPass(CPU) andCudaForwardPass(CUDA full offload); gated bySupportsEmbeddingInput(default-throw elsewhere)CLI
--image(repeatable) +--mmproj<image>marker in-p(left-to-right), or omit markers to prepend them<|image|>placeholder (258880) expanded with its soft tokens wrapped in<|image>(255999) …<image|>(258882)-g 0(CPU) and-g -1(CUDA)Tooling
scripts/gemma4uv_ref.pynumpy oracle (golden fixtures intests/fixtures/gemma4uv/,.f32gitignored)download-model.ps1now pulls the mmproj alongside the main GGUFVerification (4070 Ti)
-g 0, red image-g -1, red imagefirst <image> … second <image>)Notes / findings
<|turn>/<turn|>with a<|image|>placeholder — not Gemma 3's<start_of_turn>. Early garbage output was this format bug, not missing bidirectional attention; causal attention is sufficient.0xC0000409), so the numpy projector oracle + observed correctness is the validation backbone.Follow-ups (out of scope)
ForwardEmbeddingon partial-offload hybrids and the Vulkan backendgemma4ua)docs/update pending)🤖 Generated with Claude Code