Skip to content

Add Gemma 4 E4B multimodal (vision / image input) support #126

@pekkah

Description

@pekkah

Goal

Add image input (and later audio) to the already-working Gemma 4 E4B text path. Full plan
and rationale: docs/gemma4-e4b-vision-plan.md.

E4B is NOT the 12B path. The Gemma 4 12B is encoder-free gemma4uv (raw patches → linear
projection, no ViT) and is implemented in src/SharpInference.Vision (issue #250, PR #251).
E4B uses a different, encoder-full architecture (below) and is not implemented yet.

Context

  • Gemma 4 is natively multimodal (text+image all sizes; E2B/E4B add audio). The gemma4 text
    trunk is implemented (ForwardPass.cs: embedding scale, PLE, dual-RoPE, SWA, cross-layer
    KV-share, GeGLU, final-logit softcap). The embedding-injection seam
    (IForwardPass.ForwardEmbedding, skip sqrt(d) scale, PLE from padding token 0) already exists on
    CPU + CUDA from Gemma 4 12B encoder-free multimodal (vision/audio) input support #250 and is reusable here.
  • llama.cpp supports E4B image input (llama-mtmd-cli + mmproj) → reference for parity. (Note:
    the current local llama.cpp build crashes on the 12B gemma4 mmproj, 0xC0000409 — recheck for E4B.)

✅ Architecture — confirmed from the real mmproj header (2026-06-15)

Dumped google/gemma-4-E4B-it-qat-q4_0-ggufgemma-4-E4B-it-mmproj.gguf (~992 MB, on disk at
E:\models\). The earlier MobileNet-V5 / Gemma-3n assumption was WRONG. Actual:

  • Vision clip.vision.projector_type = gemma4v — a transformer ViT encoder (NOT conv):
    block_count=16, embedding_length=768, head_count=12 (head_dim 64, with QK-norm),
    GeGLU FFN (feed_forward_length=3072), per-block ln1/ln2 + attn_q/k/v/out + attn_q_norm/k_norm + ffn_gate/up/down + post-norms; conv patch-embed v.patch_embd.weight [16,16,3,768]; learned 2D
    position table v.position_embd.weight [768,10240,2]; image_size=224, patch_size=16,
    image_mean=0 / image_std=1, attn LN eps 1e-6.
  • Audio clip.audio.projector_type = gemma4a — separate ~12-block conformer-style encoder
    (a.* tensors, num_mel_bins=128, embedding 1024, ffn 4096, 8 heads); clip.has_audio_encoder=True.
  • Projectors into the E4B text embed dim (2560): mm.input_projection [768→2560] (vision),
    mm.a.input_projection [1536→2560] (audio).
  • Markers: the unified Gemma-4 tokens <|image|> (258880) / <|audio|>, expanded at runtime to
    <|image>(255999) + soft tokens + <image|>(258882) — same convention as the 12B, NOT Gemma 3's
    <start_of_image>.

Net: the long pole is a ViT encoder, not MobileNet-V5. Good news — a ViT reuses our existing
attention / MLP / RMSNorm kernels almost directly (this was the old plan's "SigLIP fallback", which
turns out to be the real architecture). Soft-token count is dynamic from the resized grid (confirm
the exact reduction / any pooling against llama.cpp), not a fixed 256.

Mechanism to replicate

Two GGUF files (text + mmproj). Preprocess image → gemma4v ViT encoder → mm.input_projection
→ soft tokens at text embed dim (2560) → splice via ForwardEmbedding at the <|image|> placeholder
positions. Reuse the #250 splice/marker machinery and SharpInference.Vision project layout.

Phased plan

  • V0 — extend the mmproj/clip loader to gemma4v/gemma4a (clip.* meta, v.*/a.*/mm.*
    tensors). (Verification debt retired — header above.)
  • V1 — image preprocessing for the ViT (decode/resize to 224/patch 16/normalize; check Pan & Scan).
  • V2gemma4v ViT encoder forward pass (CPU-first, then GPU); stage-by-stage parity vs
    llama.cpp clip. Reuses attention/MLP/RMSNorm/QK-norm kernels; conv patch-embed + 2D pos table.
  • V3 — token reduction (confirm rule) + mm.input_projection to embed dim 2560.
  • V4 — splice via the existing ForwardEmbedding seam (Gemma 4 12B encoder-free multimodal (vision/audio) input support #250); decide bidirectional-within-image
    mask vs causal (12B works causal — re-evaluate for E4B); greedy parity vs llama-mtmd-cli.
  • V5 — CLI --image (already exists from Gemma 4 12B encoder-free multimodal (vision/audio) input support #250 — just drop the arch gate once E4B is wired) +
    server image content blocks.
  • (deferred) V6 — E-model audio via the gemma4a encoder (a.*).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions