You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add image input (and later audio) to the already-working Gemma 4 E4B text path. Full plan
and rationale: docs/gemma4-e4b-vision-plan.md.
E4B is NOT the 12B path. The Gemma 4 12B is encoder-free gemma4uv (raw patches → linear
projection, no ViT) and is implemented in src/SharpInference.Vision (issue #250, PR #251).
E4B uses a different, encoder-full architecture (below) and is not implemented yet.
Context
Gemma 4 is natively multimodal (text+image all sizes; E2B/E4B add audio). The gemma4 text
trunk is implemented (ForwardPass.cs: embedding scale, PLE, dual-RoPE, SWA, cross-layer
KV-share, GeGLU, final-logit softcap). The embedding-injection seam
(IForwardPass.ForwardEmbedding, skip sqrt(d) scale, PLE from padding token 0) already exists on
CPU + CUDA from Gemma 4 12B encoder-free multimodal (vision/audio) input support #250 and is reusable here.
llama.cpp supports E4B image input (llama-mtmd-cli + mmproj) → reference for parity. (Note:
the current local llama.cpp build crashes on the 12B gemma4 mmproj, 0xC0000409 — recheck for E4B.)
✅ Architecture — confirmed from the real mmproj header (2026-06-15)
Dumped google/gemma-4-E4B-it-qat-q4_0-gguf → gemma-4-E4B-it-mmproj.gguf (~992 MB, on disk at E:\models\). The earlier MobileNet-V5 / Gemma-3n assumption was WRONG. Actual:
Projectors into the E4B text embed dim (2560): mm.input_projection [768→2560] (vision), mm.a.input_projection [1536→2560] (audio).
Markers: the unified Gemma-4 tokens <|image|> (258880) / <|audio|>, expanded at runtime to <|image>(255999) + soft tokens + <image|>(258882) — same convention as the 12B, NOT Gemma 3's <start_of_image>.
Net: the long pole is a ViT encoder, not MobileNet-V5. Good news — a ViT reuses our existing
attention / MLP / RMSNorm kernels almost directly (this was the old plan's "SigLIP fallback", which
turns out to be the real architecture). Soft-token count is dynamic from the resized grid (confirm
the exact reduction / any pooling against llama.cpp), not a fixed 256.
Mechanism to replicate
Two GGUF files (text + mmproj). Preprocess image → gemma4v ViT encoder → mm.input_projection
→ soft tokens at text embed dim (2560) → splice via ForwardEmbedding at the <|image|> placeholder
positions. Reuse the #250 splice/marker machinery and SharpInference.Vision project layout.
Phased plan
V0 — extend the mmproj/clip loader to gemma4v/gemma4a (clip.* meta, v.*/a.*/mm.*
tensors). (Verification debt retired — header above.)
V1 — image preprocessing for the ViT (decode/resize to 224/patch 16/normalize; check Pan & Scan).
V2 — gemma4v ViT encoder forward pass (CPU-first, then GPU); stage-by-stage parity vs
llama.cpp clip. Reuses attention/MLP/RMSNorm/QK-norm kernels; conv patch-embed + 2D pos table.
V3 — token reduction (confirm rule) + mm.input_projection to embed dim 2560.
Goal
Add image input (and later audio) to the already-working Gemma 4 E4B text path. Full plan
and rationale:
docs/gemma4-e4b-vision-plan.md.Context
trunk is implemented (
ForwardPass.cs: embedding scale, PLE, dual-RoPE, SWA, cross-layerKV-share, GeGLU, final-logit softcap). The embedding-injection seam
(
IForwardPass.ForwardEmbedding, skip sqrt(d) scale, PLE from padding token 0) already exists onCPU + CUDA from Gemma 4 12B encoder-free multimodal (vision/audio) input support #250 and is reusable here.
llama-mtmd-cli+mmproj) → reference for parity. (Note:the current local llama.cpp build crashes on the 12B gemma4 mmproj, 0xC0000409 — recheck for E4B.)
✅ Architecture — confirmed from the real mmproj header (2026-06-15)
Dumped
google/gemma-4-E4B-it-qat-q4_0-gguf→gemma-4-E4B-it-mmproj.gguf(~992 MB, on disk atE:\models\). The earlier MobileNet-V5 / Gemma-3n assumption was WRONG. Actual:clip.vision.projector_type = gemma4v— a transformer ViT encoder (NOT conv):block_count=16,embedding_length=768,head_count=12(head_dim 64, with QK-norm),GeGLU FFN (
feed_forward_length=3072), per-blockln1/ln2 + attn_q/k/v/out + attn_q_norm/k_norm + ffn_gate/up/down + post-norms; conv patch-embedv.patch_embd.weight [16,16,3,768]; learned 2Dposition table
v.position_embd.weight [768,10240,2];image_size=224,patch_size=16,image_mean=0 / image_std=1, attn LN eps 1e-6.
clip.audio.projector_type = gemma4a— separate ~12-block conformer-style encoder(
a.*tensors,num_mel_bins=128, embedding 1024, ffn 4096, 8 heads);clip.has_audio_encoder=True.mm.input_projection [768→2560](vision),mm.a.input_projection [1536→2560](audio).<|image|>(258880) /<|audio|>, expanded at runtime to<|image>(255999) + soft tokens +<image|>(258882) — same convention as the 12B, NOT Gemma 3's<start_of_image>.Net: the long pole is a ViT encoder, not MobileNet-V5. Good news — a ViT reuses our existing
attention / MLP / RMSNorm kernels almost directly (this was the old plan's "SigLIP fallback", which
turns out to be the real architecture). Soft-token count is dynamic from the resized grid (confirm
the exact reduction / any pooling against llama.cpp), not a fixed 256.
Mechanism to replicate
Two GGUF files (text +
mmproj). Preprocess image → gemma4v ViT encoder →mm.input_projection→ soft tokens at text embed dim (2560) → splice via
ForwardEmbeddingat the<|image|>placeholderpositions. Reuse the #250 splice/marker machinery and
SharpInference.Visionproject layout.Phased plan
gemma4v/gemma4a(clip.*meta,v.*/a.*/mm.*tensors). (Verification debt retired — header above.)
llama.cpp
clip. Reuses attention/MLP/RMSNorm/QK-norm kernels; conv patch-embed + 2D pos table.mm.input_projectionto embed dim 2560.ForwardEmbeddingseam (Gemma 4 12B encoder-free multimodal (vision/audio) input support #250); decide bidirectional-within-imagemask vs causal (12B works causal — re-evaluate for E4B); greedy parity vs
llama-mtmd-cli.--image(already exists from Gemma 4 12B encoder-free multimodal (vision/audio) input support #250 — just drop the arch gate once E4B is wired) +server image content blocks.
gemma4aencoder (a.*).Related
gemma4uv) image input — shipped (CPU + CUDA);provides the reusable
SharpInference.Visionproject +ForwardEmbeddingseam + CLI.