Skip to content

feat(vision): encoder-free Gemma 4 image input (CPU + CUDA) (#250)#251

Merged
pekkah merged 6 commits into
masterfrom
feat/gemma4-vision-250
Jun 15, 2026
Merged

feat(vision): encoder-free Gemma 4 image input (CPU + CUDA) (#250)#251
pekkah merged 6 commits into
masterfrom
feat/gemma4-vision-250

Conversation

@pekkah

@pekkah pekkah commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Summary

Adds image input for Gemma 4 12B via the encoder-free "gemma4uv" projector — raw image patches projected straight into the LLM embedding space, no SigLIP ViT. Closes #250 (CPU image→text scope; GPU full-offload also included).

This is image understanding only (image in → text out). It is unrelated to the existing sharpi-cli image text→image generation (Z-Image-Turbo, SharpInference.Diffusion).

What's included

New SharpInference.Vision project

  • VisionModel — loads/validates the gemma4uv mmproj GGUF
  • GemmaUvVisionEmbedderim2col(48) → LayerNorm×3 → linear → RMSNorm → bf16 input_projection; parity vs a numpy oracle (cosine > 0.9995)
  • ImageIO — minimal PNG decoder (no imaging dependency, AOT-safe)
  • ImagePreprocessorcalc_size_preserved_ratio + align-corners bilinear, mirroring mtmd-image.cpp

Decoder seam (IForwardPass.ForwardEmbedding)

  • Injects a precomputed embedding: skips the sqrt(d) embedding scale (raw embeddings arrive final, gemma4.cpp:182) and builds PLE from the padding token (id 0)
  • Implemented on both ForwardPass (CPU) and CudaForwardPass (CUDA full offload); gated by SupportsEmbeddingInput (default-throw elsewhere)

CLI

  • --image (repeatable) + --mmproj
  • Reference each image with an <image> marker in -p (left-to-right), or omit markers to prepend them
  • Prompt rendered through the model's own chat template; each <|image|> placeholder (258880) expanded with its soft tokens wrapped in <|image>(255999) … <image|>(258882)
  • Works under -g 0 (CPU) and -g -1 (CUDA)

Tooling

  • scripts/gemma4uv_ref.py numpy oracle (golden fixtures in tests/fixtures/gemma4uv/, .f32 gitignored)
  • download-model.ps1 now pulls the mmproj alongside the main GGUF

Verification (4070 Ti)

Path Result
CPU -g 0, red image "Red", logit 26.08, clean `<turn
CUDA -g -1, red image "Red", logit 25.87 (faithful to CPU), ~50 t/s prefill
CUDA multi-image (first <image> … second <image>) "red and blue" — correct order

Notes / findings

  • The decisive fix: Gemma 4's turn format is <|turn>/<turn|> with a <|image|> placeholder — not Gemma 3's <start_of_turn>. Early garbage output was this format bug, not missing bidirectional attention; causal attention is sufficient.
  • The current llama.cpp build cannot run this brand-new gemma4 mmproj (crashes 0xC0000409), so the numpy projector oracle + observed correctness is the validation backbone.

Follow-ups (out of scope)

  • ForwardEmbedding on partial-offload hybrids and the Vulkan backend
  • Audio (gemma4ua)
  • Server image endpoint (content-blocks → same splice)
  • Docs: design-doc gemma4uv section (this PR's docs/ update pending)

🤖 Generated with Claude Code

pekkah and others added 2 commits June 15, 2026 09:41
Adds CPU image→text for Gemma 4 12B via the encoder-free "gemma4uv"
projector — raw image patches projected straight into the LLM embedding
space, no SigLIP ViT.

New SharpInference.Vision project:
- VisionModel: loads/validates the gemma4uv mmproj GGUF
- GemmaUvVisionEmbedder: im2col(48) -> LN×3 -> linear -> RMSNorm ->
  bf16 input_projection; parity vs a numpy oracle (cosine > 0.9995)
- ImageIO: minimal PNG decoder; ImagePreprocessor: calc_size_preserved_ratio
  + align-corners bilinear, mirroring mtmd-image.cpp

Decoder seam: ForwardPass.ForwardEmbedding injects a precomputed embedding
(no sqrt(d) scale per gemma4.cpp; PLE via padding token 0), gated by
IForwardPass.SupportsEmbeddingInput (default-throw elsewhere — CPU only).

CLI: --image (repeatable) + --mmproj. Images are referenced by <image>
markers in -p (left-to-right) or prepended when omitted; the prompt is
rendered through the model's own chat template and each <|image|>
placeholder (258880) is expanded with its soft tokens wrapped in
<|image>(255999) … <image|>(258882).

Verified end to end: a solid-red image decodes "Red" (logit 26 vs 19),
then stops cleanly on <turn|>. The earlier garbage output was a prompt-
format bug — Gemma 4 uses <|turn>/<turn|>, not Gemma 3's <start_of_turn>;
causal attention is sufficient (no bidirectional image span needed).

download-model.ps1 also pulls the mmproj alongside the main GGUF.

GPU (CUDA/Vulkan) ForwardEmbedding, audio, and the server path remain
follow-ups.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements SupportsEmbeddingInput/ForwardEmbedding on CudaForwardPass —
the GPU mirror of the CPU seam. Uploads the precomputed vision soft token
into the device hidden buffer, skips the sqrt(d) embedding scale (raw
embeddings arrive final, gemma4.cpp:182), builds PLE from the padding
token (id 0), and runs the same device region (and CUDA graph) as text
decode. Gemma 4 only; partial-offload hybrids and Vulkan still throw.

CLI: --image now works under full CUDA offload (-g -1). RunImagePrompt
routes through IForwardPass (gpuFwd ?? cpu fwd); the CPU-only gate is
dropped and replaced by a backend-aware SupportsEmbeddingInput check with
an actionable message for unsupported paths.

Verified on a 4070 Ti: the red image decodes "Red" at logit 25.87 (CPU
was 26.08 — faithful), and a two-image marker prompt ("first <image> …
second <image>") answers "red and blue" in order, at ~50 t/s prefill.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multimodal vision support for Gemma 4 models by implementing the encoder-free gemma4uv image projector. It adds a new SharpInference.Vision project containing a custom PNG decoder, an image preprocessor, and the GemmaUvVisionEmbedder to project images into soft-token embeddings, while updating the CLI and forward pass backends to support splicing these embeddings. The review feedback highlights several improvement opportunities, including addressing potential integer overflow and validating image dimensions in the PNG decoder, optimizing bilinear interpolation in the preprocessor, validating the input buffer length in the vision embedder, catching more specific file I/O exceptions, and expanding special token compatibility for other Gemma versions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/SharpInference.Vision/ImageIO.cs Outdated
Comment on lines +46 to +47
if (dataOff + clen + 4 > len)
throw new InvalidDataException($"Truncated PNG chunk '{type}'.");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Potential integer overflow and out-of-bounds read/crash when validating chunk length clen in the PNG decoder. If clen is negative or extremely large, dataOff + clen + 4 can overflow and bypass the bounds check. Using clen > len - pos - 12 is safe from overflow.

            if (clen < 0 || clen > len - pos - 12)
                throw new InvalidDataException($"Truncated or invalid PNG chunk '{type}' with length {clen}.");

Comment on lines +51 to +57
case "IHDR":
w = ReadUInt32BE(png, dataOff);
h = ReadUInt32BE(png, dataOff + 4);
bitDepth = png[dataOff + 8];
colorType = png[dataOff + 9];
interlace = png[dataOff + 12];
seenIhdr = true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Malformed PNG files with 0, negative, or extremely large dimensions can cause division-by-zero, NaN propagation, and negative array allocations in downstream preprocessing, or out-of-memory denial of service. Adding a validation check for dimensions prevents these issues.

                case "IHDR":
                    w = ReadUInt32BE(png, dataOff);
                    h = ReadUInt32BE(png, dataOff + 4);
                    if (w <= 0 || h <= 0 || w > 16384 || h > 16384)
                        throw new InvalidDataException($"Invalid or unsupported PNG dimensions: {w}x{h}.");
                    bitDepth = png[dataOff + 8];
                    colorType = png[dataOff + 9];
                    interlace = png[dataOff + 12];
                    seenIhdr = true;
                    break;

Comment on lines +70 to +78
for (int x = 0; x < w; x++)
{
float px = x * xRatio;
int x0 = Math.Min((int)px, srcWidth - 1);
int x1 = Math.Min(x0 + 1, srcWidth - 1);
float xf = px - x0;

int i00 = (y0 * srcWidth + x0) * 3, i10 = (y0 * srcWidth + x1) * 3;
int i01 = (y1 * srcWidth + x0) * 3, i11 = (y1 * srcWidth + x1) * 3;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid redundant multiplications in the inner loop of bilinear interpolation. y0 * srcWidth and y1 * srcWidth can be calculated once in the outer loop to improve performance.

            int y0Offset = y0 * srcWidth;
            int y1Offset = y1 * srcWidth;
            for (int x = 0; x < w; x++)
            {
                float px = x * xRatio;
                int x0 = Math.Min((int)px, srcWidth - 1);
                int x1 = Math.Min(x0 + 1, srcWidth - 1);
                float xf = px - x0;

                int i00 = (y0Offset + x0) * 3, i10 = (y0Offset + x1) * 3;
                int i01 = (y1Offset + x0) * 3, i11 = (y1Offset + x1) * 3;

Comment on lines +81 to +82
if (width % _patch != 0 || height % _patch != 0)
throw new ArgumentException($"image dims ({width}x{height}) must be multiples of {_patch}.");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Validate chw.Length before accessing it. If the input span is smaller than 3 * height * width, it will throw an unhandled IndexOutOfRangeException.

        if (width % _patch != 0 || height % _patch != 0)
            throw new ArgumentException($"image dims ({width}x{height}) must be multiples of {_patch}.");
        if (chw.Length < 3 * height * width)
            throw new ArgumentException($"chw length ({chw.Length}) is too small for dims {width}x{height}.");

Comment thread src/SharpInference.Cli/RunCommand.cs Outdated
{
rgb = ImageIO.LoadRgb(imagePaths[i], out srcW, out srcH);
}
catch (Exception ex) when (ex is IOException or NotSupportedException or InvalidDataException)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catch specific exceptions such as UnauthorizedAccessException and SecurityException in addition to IOException to ensure precise error handling and graceful application exits when permission issues occur.

            catch (Exception ex) when (ex is IOException or NotSupportedException or InvalidDataException or UnauthorizedAccessException or System.Security.SecurityException)
References
  1. When performing file I/O operations, catch specific exceptions (such as IOException, UnauthorizedAccessException, SecurityException, and NotSupportedException) instead of a generic Exception to ensure precise error handling and graceful application exits.

Comment on lines +1190 to +1191
int imgOpen = tok.SpecialTokens.TryGetValue("<|image>", out var o) ? o : 255999;
int imgClose = tok.SpecialTokens.TryGetValue("<image|>", out var c) ? c : 258882;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Support both Gemma 4's <|image|>/<image|> and Gemma 3's <img>/<end_of_image> special tokens to ensure compatibility across different Gemma multimodal model versions.

        int imgOpen = tok.SpecialTokens.TryGetValue("<|image|>", out var o) ? o :
                      tok.SpecialTokens.TryGetValue("<img>", out var o2) ? o2 : 255999;
        int imgClose = tok.SpecialTokens.TryGetValue("<image|>", out var c) ? c :
                       tok.SpecialTokens.TryGetValue("<end_of_image>", out var c2) ? c2 : 258882;

pekkah and others added 3 commits June 15, 2026 10:40
…ract (#250)

Addresses PR #251 review (cycle 1):
- ImageIO: reject negative chunk lengths and non-positive/absurd IHDR
  dimensions as InvalidDataException (signed ReadUInt32BE could wrap past
  the bounds check into an alloc crash that escaped the CLI's catch filter)
- VisionModel.Open: validate the two raw-pointer tensor dtypes
  (v.patch_embd.weight=F32, mm.input_projection.weight=BF16) so a divergent
  mmproj export fails loudly instead of misreading bytes into garbage
- IForwardPass.ForwardEmbedding: document the PLE-token-0 contract alongside
  the sqrt(d)-skip so future backend mirrors stay faithful
- tests: malformed-PNG regression (bad signature, zero dimensions)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…le 2)

Belt-and-suspenders: (long)dataOff + clen + 4 so a chunk length near
int.MaxValue (only reachable with a ~2 GB in-memory PNG) can't wrap
negative and defeat the truncation check. Closes the last review nit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
#250/#126)

- docs/SharpInference-Design.md: new §8.3 documenting the shipped 12B encoder-free
  gemma4uv image path (projector forward, ForwardEmbedding splice, CLI prompt wiring),
  with an explicit note that E4B is a different, unsupported architecture.
- docs/gemma4-e4b-vision-plan.md: retire the verification debt. Dumped the real E4B
  mmproj header — it is NOT MobileNet-V5/gemma3n as the plan assumed. E4B ships a
  gemma4v transformer ViT vision encoder (16 blocks, 768-dim, 12 heads, QK-norm,
  GeGLU, conv patch-embed) + a gemma4a audio encoder. The plan's conclusion
  (encoder-full, separate from the 12B) holds; the specifics are corrected.
- download-model.ps1: gemma4-e4b-qat now also pulls the ~992 MB mmproj, with accurate
  labeling (gemma4v ViT + gemma4a, not the 12B gemma4uv).

Confirms Gemma 4 E4B and 12B use distinct vision architectures; #250's path covers
only the 12B.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…oist (#250)

PR #251 Gemini code-assist findings:
- GemmaUvVisionEmbedder.Forward: validate chw.Length >= 3*h*w before reading
  (clear ArgumentException instead of IndexOutOfRange on a short span)
- ImagePreprocessor: hoist y0*srcWidth / y1*srcWidth out of the bilinear inner loop
- RunImagePrompt: broaden the image-read catch to UnauthorizedAccessException /
  SecurityException, matching the prompt-file read path

The two security-high PNG findings (chunk-length overflow, IHDR dimension bounds)
were already fixed in the earlier review cycles ((long) chunk math + the
(long)w*h > 64M dimension guard). The suggestion to add Gemma 3 <img>/<end_of_image>
token fallbacks was declined: this path is Gemma 4-only (arch-gated) and Gemma 3
uses a different encoder-full vision architecture the splice can't serve.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma 4 12B encoder-free multimodal (vision/audio) input support

1 participant