Add Gemma 4 E4B multimodal (vision / image input) support

## Goal

Add **image input** (and later audio) to the already-working Gemma 4 E4B **text** path. Full plan
and rationale: `docs/gemma4-e4b-vision-plan.md`.

> **E4B is NOT the 12B path.** The Gemma 4 **12B** is encoder-free `gemma4uv` (raw patches → linear
> projection, no ViT) and is **implemented** in `src/SharpInference.Vision` (issue #250, PR #251).
> E4B uses a *different, encoder-full* architecture (below) and is **not implemented yet**.

## Context

- Gemma 4 is natively multimodal (text+image all sizes; E2B/E4B add audio). The gemma4 **text**
  trunk is implemented (`ForwardPass.cs`: embedding scale, PLE, dual-RoPE, SWA, cross-layer
  KV-share, GeGLU, final-logit softcap). The embedding-injection seam
  (`IForwardPass.ForwardEmbedding`, skip sqrt(d) scale, PLE from padding token 0) already exists on
  CPU + CUDA from #250 and is reusable here.
- llama.cpp supports E4B image input (`llama-mtmd-cli` + `mmproj`) → reference for parity. (Note:
  the current local llama.cpp build crashes on the **12B** gemma4 mmproj, 0xC0000409 — recheck for E4B.)

## ✅ Architecture — confirmed from the real mmproj header (2026-06-15)

Dumped `google/gemma-4-E4B-it-qat-q4_0-gguf` → `gemma-4-E4B-it-mmproj.gguf` (~992 MB, on disk at
`E:\models\`). The earlier MobileNet-V5 / Gemma-3n assumption was **WRONG**. Actual:

- **Vision** `clip.vision.projector_type = gemma4v` — a **transformer ViT** encoder (NOT conv):
  `block_count=16`, `embedding_length=768`, `head_count=12` (head_dim 64, with QK-norm),
  GeGLU FFN (`feed_forward_length=3072`), per-block `ln1/ln2 + attn_q/k/v/out + attn_q_norm/k_norm +
  ffn_gate/up/down + post-norms`; conv patch-embed `v.patch_embd.weight [16,16,3,768]`; learned 2D
  position table `v.position_embd.weight [768,10240,2]`; `image_size=224`, `patch_size=16`,
  image_mean=0 / image_std=1, attn LN eps 1e-6.
- **Audio** `clip.audio.projector_type = gemma4a` — separate ~12-block conformer-style encoder
  (`a.*` tensors, `num_mel_bins=128`, embedding 1024, ffn 4096, 8 heads); `clip.has_audio_encoder=True`.
- **Projectors** into the E4B text embed dim (2560): `mm.input_projection [768→2560]` (vision),
  `mm.a.input_projection [1536→2560]` (audio).
- **Markers**: the unified Gemma-4 tokens `<|image|>` (258880) / `<|audio|>`, expanded at runtime to
  `<|image>`(255999) + soft tokens + `<image|>`(258882) — same convention as the 12B, NOT Gemma 3's
  `<start_of_image>`.

Net: the long pole is a **ViT** encoder, not MobileNet-V5. Good news — a ViT reuses our existing
attention / MLP / RMSNorm kernels almost directly (this was the old plan's "SigLIP fallback", which
turns out to be the real architecture). Soft-token count is dynamic from the resized grid (confirm
the exact reduction / any pooling against llama.cpp), not a fixed 256.

## Mechanism to replicate

Two GGUF files (text + `mmproj`). Preprocess image → **gemma4v ViT** encoder → `mm.input_projection`
→ soft tokens at text embed dim (2560) → splice via `ForwardEmbedding` at the `<|image|>` placeholder
positions. Reuse the #250 splice/marker machinery and `SharpInference.Vision` project layout.

## Phased plan

- [ ] **V0** — extend the mmproj/clip loader to `gemma4v`/`gemma4a` (`clip.*` meta, `v.*`/`a.*`/`mm.*`
      tensors). (Verification debt retired — header above.)
- [ ] **V1** — image preprocessing for the ViT (decode/resize to 224/patch 16/normalize; check Pan & Scan).
- [ ] **V2** — **gemma4v ViT** encoder forward pass (CPU-first, then GPU); stage-by-stage parity vs
      llama.cpp `clip`. Reuses attention/MLP/RMSNorm/QK-norm kernels; conv patch-embed + 2D pos table.
- [ ] **V3** — token reduction (confirm rule) + `mm.input_projection` to embed dim 2560.
- [ ] **V4** — splice via the existing `ForwardEmbedding` seam (#250); decide bidirectional-within-image
      mask vs causal (12B works causal — re-evaluate for E4B); greedy parity vs `llama-mtmd-cli`.
- [ ] **V5** — CLI `--image` (already exists from #250 — just drop the arch gate once E4B is wired) +
      server image content blocks.
- [ ] (deferred) **V6** — E-model **audio** via the `gemma4a` encoder (`a.*`).

## Related

- #250 / PR #251 — Gemma 4 **12B** encoder-free (`gemma4uv`) image input — **shipped** (CPU + CUDA);
  provides the reusable `SharpInference.Vision` project + `ForwardEmbedding` seam + CLI.
- #82 — Gemma 4 model family support (parent).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma 4 E4B multimodal (vision / image input) support #126

Goal

Context

✅ Architecture — confirmed from the real mmproj header (2026-06-15)

Mechanism to replicate

Phased plan

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add Gemma 4 E4B multimodal (vision / image input) support #126

Description

Goal

Context

✅ Architecture — confirmed from the real mmproj header (2026-06-15)

Mechanism to replicate

Phased plan

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions