feat(vision): encoder-free Gemma 4 image input (CPU + CUDA) (#250) by pekkah · Pull Request #251 · pekkah/SharpInference

pekkah · 2026-06-15T07:33:34Z

Summary

Adds image input for Gemma 4 12B via the encoder-free "gemma4uv" projector — raw image patches projected straight into the LLM embedding space, no SigLIP ViT. Closes #250 (CPU image→text scope; GPU full-offload also included).

This is image understanding only (image in → text out). It is unrelated to the existing sharpi-cli image text→image generation (Z-Image-Turbo, SharpInference.Diffusion).

What's included

New SharpInference.Vision project

VisionModel — loads/validates the gemma4uv mmproj GGUF
GemmaUvVisionEmbedder — im2col(48) → LayerNorm×3 → linear → RMSNorm → bf16 input_projection; parity vs a numpy oracle (cosine > 0.9995)
ImageIO — minimal PNG decoder (no imaging dependency, AOT-safe)
ImagePreprocessor — calc_size_preserved_ratio + align-corners bilinear, mirroring mtmd-image.cpp

Decoder seam (IForwardPass.ForwardEmbedding)

Injects a precomputed embedding: skips the sqrt(d) embedding scale (raw embeddings arrive final, gemma4.cpp:182) and builds PLE from the padding token (id 0)
Implemented on both ForwardPass (CPU) and CudaForwardPass (CUDA full offload); gated by SupportsEmbeddingInput (default-throw elsewhere)

CLI

--image (repeatable) + --mmproj
Reference each image with an <image> marker in -p (left-to-right), or omit markers to prepend them
Prompt rendered through the model's own chat template; each <|image|> placeholder (258880) expanded with its soft tokens wrapped in <|image>(255999) … <image|>(258882)
Works under -g 0 (CPU) and -g -1 (CUDA)

Tooling

scripts/gemma4uv_ref.py numpy oracle (golden fixtures in tests/fixtures/gemma4uv/, .f32 gitignored)
download-model.ps1 now pulls the mmproj alongside the main GGUF

Verification (4070 Ti)

Path	Result
CPU `-g 0`, red image	"Red", logit 26.08, clean `<turn
CUDA `-g -1`, red image	"Red", logit 25.87 (faithful to CPU), ~50 t/s prefill
CUDA multi-image (`first <image> … second <image>`)	"red and blue" — correct order

Notes / findings

The decisive fix: Gemma 4's turn format is <|turn>/<turn|> with a <|image|> placeholder — not Gemma 3's <start_of_turn>. Early garbage output was this format bug, not missing bidirectional attention; causal attention is sufficient.
The current llama.cpp build cannot run this brand-new gemma4 mmproj (crashes 0xC0000409), so the numpy projector oracle + observed correctness is the validation backbone.

Follow-ups (out of scope)

ForwardEmbedding on partial-offload hybrids and the Vulkan backend
Audio (gemma4ua)
Server image endpoint (content-blocks → same splice)
Docs: design-doc gemma4uv section (this PR's docs/ update pending)

🤖 Generated with Claude Code

Adds CPU image→text for Gemma 4 12B via the encoder-free "gemma4uv" projector — raw image patches projected straight into the LLM embedding space, no SigLIP ViT. New SharpInference.Vision project: - VisionModel: loads/validates the gemma4uv mmproj GGUF - GemmaUvVisionEmbedder: im2col(48) -> LN×3 -> linear -> RMSNorm -> bf16 input_projection; parity vs a numpy oracle (cosine > 0.9995) - ImageIO: minimal PNG decoder; ImagePreprocessor: calc_size_preserved_ratio + align-corners bilinear, mirroring mtmd-image.cpp Decoder seam: ForwardPass.ForwardEmbedding injects a precomputed embedding (no sqrt(d) scale per gemma4.cpp; PLE via padding token 0), gated by IForwardPass.SupportsEmbeddingInput (default-throw elsewhere — CPU only). CLI: --image (repeatable) + --mmproj. Images are referenced by <image> markers in -p (left-to-right) or prepended when omitted; the prompt is rendered through the model's own chat template and each <|image|> placeholder (258880) is expanded with its soft tokens wrapped in <|image>(255999) … <image|>(258882). Verified end to end: a solid-red image decodes "Red" (logit 26 vs 19), then stops cleanly on <turn|>. The earlier garbage output was a prompt- format bug — Gemma 4 uses <|turn>/<turn|>, not Gemma 3's <start_of_turn>; causal attention is sufficient (no bidirectional image span needed). download-model.ps1 also pulls the mmproj alongside the main GGUF. GPU (CUDA/Vulkan) ForwardEmbedding, audio, and the server path remain follow-ups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Implements SupportsEmbeddingInput/ForwardEmbedding on CudaForwardPass — the GPU mirror of the CPU seam. Uploads the precomputed vision soft token into the device hidden buffer, skips the sqrt(d) embedding scale (raw embeddings arrive final, gemma4.cpp:182), builds PLE from the padding token (id 0), and runs the same device region (and CUDA graph) as text decode. Gemma 4 only; partial-offload hybrids and Vulkan still throw. CLI: --image now works under full CUDA offload (-g -1). RunImagePrompt routes through IForwardPass (gpuFwd ?? cpu fwd); the CPU-only gate is dropped and replaced by a backend-aware SupportsEmbeddingInput check with an actionable message for unsupported paths. Verified on a 4070 Ti: the red image decodes "Red" at logit 25.87 (CPU was 26.08 — faithful), and a two-image marker prompt ("first <image> … second <image>") answers "red and blue" in order, at ~50 t/s prefill. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces multimodal vision support for Gemma 4 models by implementing the encoder-free gemma4uv image projector. It adds a new SharpInference.Vision project containing a custom PNG decoder, an image preprocessor, and the GemmaUvVisionEmbedder to project images into soft-token embeddings, while updating the CLI and forward pass backends to support splicing these embeddings. The review feedback highlights several improvement opportunities, including addressing potential integer overflow and validating image dimensions in the PNG decoder, optimizing bilinear interpolation in the preprocessor, validating the input buffer length in the vision embedder, catching more specific file I/O exceptions, and expanding special token compatibility for other Gemma versions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-15T07:36:47Z

+            if (dataOff + clen + 4 > len)
+                throw new InvalidDataException($"Truncated PNG chunk '{type}'.");


Potential integer overflow and out-of-bounds read/crash when validating chunk length clen in the PNG decoder. If clen is negative or extremely large, dataOff + clen + 4 can overflow and bypass the bounds check. Using clen > len - pos - 12 is safe from overflow.

if (clen < 0 || clen > len - pos - 12) throw new InvalidDataException($"Truncated or invalid PNG chunk '{type}' with length {clen}.");

gemini-code-assist · 2026-06-15T07:36:47Z

+                case "IHDR":
+                    w = ReadUInt32BE(png, dataOff);
+                    h = ReadUInt32BE(png, dataOff + 4);
+                    bitDepth = png[dataOff + 8];
+                    colorType = png[dataOff + 9];
+                    interlace = png[dataOff + 12];
+                    seenIhdr = true;


Malformed PNG files with 0, negative, or extremely large dimensions can cause division-by-zero, NaN propagation, and negative array allocations in downstream preprocessing, or out-of-memory denial of service. Adding a validation check for dimensions prevents these issues.

case "IHDR": w = ReadUInt32BE(png, dataOff); h = ReadUInt32BE(png, dataOff + 4); if (w <= 0 || h <= 0 || w > 16384 || h > 16384) throw new InvalidDataException($"Invalid or unsupported PNG dimensions: {w}x{h}."); bitDepth = png[dataOff + 8]; colorType = png[dataOff + 9]; interlace = png[dataOff + 12]; seenIhdr = true; break;

gemini-code-assist · 2026-06-15T07:36:47Z

+            for (int x = 0; x < w; x++)
+            {
+                float px = x * xRatio;
+                int x0 = Math.Min((int)px, srcWidth - 1);
+                int x1 = Math.Min(x0 + 1, srcWidth - 1);
+                float xf = px - x0;
+
+                int i00 = (y0 * srcWidth + x0) * 3, i10 = (y0 * srcWidth + x1) * 3;
+                int i01 = (y1 * srcWidth + x0) * 3, i11 = (y1 * srcWidth + x1) * 3;


Avoid redundant multiplications in the inner loop of bilinear interpolation. y0 * srcWidth and y1 * srcWidth can be calculated once in the outer loop to improve performance.

int y0Offset = y0 * srcWidth; int y1Offset = y1 * srcWidth; for (int x = 0; x < w; x++) { float px = x * xRatio; int x0 = Math.Min((int)px, srcWidth - 1); int x1 = Math.Min(x0 + 1, srcWidth - 1); float xf = px - x0; int i00 = (y0Offset + x0) * 3, i10 = (y0Offset + x1) * 3; int i01 = (y1Offset + x0) * 3, i11 = (y1Offset + x1) * 3;

gemini-code-assist · 2026-06-15T07:36:47Z

+        if (width % _patch != 0 || height % _patch != 0)
+            throw new ArgumentException($"image dims ({width}x{height}) must be multiples of {_patch}.");


Validate chw.Length before accessing it. If the input span is smaller than 3 * height * width, it will throw an unhandled IndexOutOfRangeException.

if (width % _patch != 0 || height % _patch != 0) throw new ArgumentException($"image dims ({width}x{height}) must be multiples of {_patch}."); if (chw.Length < 3 * height * width) throw new ArgumentException($"chw length ({chw.Length}) is too small for dims {width}x{height}.");

gemini-code-assist · 2026-06-15T07:36:47Z

+            {
+                rgb = ImageIO.LoadRgb(imagePaths[i], out srcW, out srcH);
+            }
+            catch (Exception ex) when (ex is IOException or NotSupportedException or InvalidDataException)


Catch specific exceptions such as UnauthorizedAccessException and SecurityException in addition to IOException to ensure precise error handling and graceful application exits when permission issues occur.

catch (Exception ex) when (ex is IOException or NotSupportedException or InvalidDataException or UnauthorizedAccessException or System.Security.SecurityException)

References

When performing file I/O operations, catch specific exceptions (such as IOException, UnauthorizedAccessException, SecurityException, and NotSupportedException) instead of a generic Exception to ensure precise error handling and graceful application exits.

gemini-code-assist · 2026-06-15T07:36:48Z

+        int imgOpen = tok.SpecialTokens.TryGetValue("<|image>", out var o) ? o : 255999;
+        int imgClose = tok.SpecialTokens.TryGetValue("<image|>", out var c) ? c : 258882;


Support both Gemma 4's <|image|>/<image|> and Gemma 3's <img>/<end_of_image> special tokens to ensure compatibility across different Gemma multimodal model versions.

int imgOpen = tok.SpecialTokens.TryGetValue("<|image|>", out var o) ? o : tok.SpecialTokens.TryGetValue("<img>", out var o2) ? o2 : 255999; int imgClose = tok.SpecialTokens.TryGetValue("<image|>", out var c) ? c : tok.SpecialTokens.TryGetValue("<end_of_image>", out var c2) ? c2 : 258882;

…ract (#250) Addresses PR #251 review (cycle 1): - ImageIO: reject negative chunk lengths and non-positive/absurd IHDR dimensions as InvalidDataException (signed ReadUInt32BE could wrap past the bounds check into an alloc crash that escaped the CLI's catch filter) - VisionModel.Open: validate the two raw-pointer tensor dtypes (v.patch_embd.weight=F32, mm.input_projection.weight=BF16) so a divergent mmproj export fails loudly instead of misreading bytes into garbage - IForwardPass.ForwardEmbedding: document the PLE-token-0 contract alongside the sqrt(d)-skip so future backend mirrors stay faithful - tests: malformed-PNG regression (bad signature, zero dimensions) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…le 2) Belt-and-suspenders: (long)dataOff + clen + 4 so a chunk length near int.MaxValue (only reachable with a ~2 GB in-memory PNG) can't wrap negative and defeat the truncation check. Closes the last review nit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

#250/#126) - docs/SharpInference-Design.md: new §8.3 documenting the shipped 12B encoder-free gemma4uv image path (projector forward, ForwardEmbedding splice, CLI prompt wiring), with an explicit note that E4B is a different, unsupported architecture. - docs/gemma4-e4b-vision-plan.md: retire the verification debt. Dumped the real E4B mmproj header — it is NOT MobileNet-V5/gemma3n as the plan assumed. E4B ships a gemma4v transformer ViT vision encoder (16 blocks, 768-dim, 12 heads, QK-norm, GeGLU, conv patch-embed) + a gemma4a audio encoder. The plan's conclusion (encoder-full, separate from the 12B) holds; the specifics are corrected. - download-model.ps1: gemma4-e4b-qat now also pulls the ~992 MB mmproj, with accurate labeling (gemma4v ViT + gemma4a, not the 12B gemma4uv). Confirms Gemma 4 E4B and 12B use distinct vision architectures; #250's path covers only the 12B. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…oist (#250) PR #251 Gemini code-assist findings: - GemmaUvVisionEmbedder.Forward: validate chw.Length >= 3*h*w before reading (clear ArgumentException instead of IndexOutOfRange on a short span) - ImagePreprocessor: hoist y0*srcWidth / y1*srcWidth out of the bilinear inner loop - RunImagePrompt: broaden the image-read catch to UnauthorizedAccessException / SecurityException, matching the prompt-file read path The two security-high PNG findings (chunk-length overflow, IHDR dimension bounds) were already fixed in the earlier review cycles ((long) chunk math + the (long)w*h > 64M dimension guard). The suggestion to add Gemma 3 <img>/<end_of_image> token fallbacks was declined: this path is Gemma 4-only (arch-gated) and Gemma 3 uses a different encoder-full vision architecture the splice can't serve. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pekkah and others added 2 commits June 15, 2026 09:41

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

pekkah and others added 3 commits June 15, 2026 10:40

pekkah mentioned this pull request Jun 15, 2026

Add Gemma 4 E4B multimodal (vision / image input) support #126

Open

7 tasks

pekkah merged commit 0a1ff40 into master Jun 15, 2026
1 check passed

pekkah deleted the feat/gemma4-vision-250 branch June 15, 2026 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vision): encoder-free Gemma 4 image input (CPU + CUDA) (#250)#251

feat(vision): encoder-free Gemma 4 image input (CPU + CUDA) (#250)#251
pekkah merged 6 commits into
masterfrom
feat/gemma4-vision-250

pekkah commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if (dataOff + clen + 4 > len)
		throw new InvalidDataException($"Truncated PNG chunk '{type}'.");

		if (width % _patch != 0 \|\| height % _patch != 0)
		throw new ArgumentException($"image dims ({width}x{height}) must be multiples of {_patch}.");

		int imgOpen = tok.SpecialTokens.TryGetValue("<\|image>", out var o) ? o : 255999;
		int imgClose = tok.SpecialTokens.TryGetValue("<image\|>", out var c) ? c : 258882;

Conversation

pekkah commented Jun 15, 2026

Summary

What's included

Verification (4070 Ti)

Notes / findings

Follow-ups (out of scope)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant