Gemma 4 12B encoder-free multimodal (vision/audio) input support

## Summary

Add multimodal **input** support (image first, audio/video later) for **Gemma 4 12B**, which is a *unified, encoder-free* multimodal model. Today sharpi loads only the text decoder of `gemma-4-12b-it-qat-q4_0.gguf` and has no image/audio input path.

## Background

Gemma 4 12B (released 2026-06-03) drops the separate SigLIP-style vision encoder used by earlier Gemma. Instead it **projects raw image patches and audio waveforms directly into the LLM embedding space** through lightweight linear layers (encoder-free / "unified"). See:
- https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/
- https://huggingface.co/blog/gemma4
- https://ai.google.dev/gemma/docs/core

### GGUF packaging
Even though it's encoder-free, in the GGUF/llama.cpp ecosystem the projection tensors still ship as a **small companion mmproj** (~175 MB, vs the multi-hundred-MB SigLIP encoders of the encoder-based Gemma 4 sizes). The official repo `google/gemma-4-12B-it-qat-q4_0-gguf` contains exactly two files:

| File | Size | Role |
|---|---|---|
| `gemma-4-12b-it-qat-q4_0.gguf` | 6,975,877,728 B | Text LLM (decoder) — what we load today |
| `mmproj-gemma-4-12b-it-qat-q4_0.gguf` | 175,115,264 B | Multimodal projector (vision + audio) |

Inspecting the text GGUF confirms it carries **only** text metadata (`gemma4.attention.*`, `gemma4.rope.*`, `gemma4.embedding_length_per_layer_input` for PLE) and no vision/audio tensors — those live in the mmproj.

## What's done

- `scripts/download-model.ps1`: the `gemma4-12b-qat` entry now pulls the mmproj alongside the text GGUF, and the disk-space guard counts only missing files (so the 175 MB top-up isn't blocked by the 7 GB bundle size).

## Scope (engine work — not yet done)

- [ ] Parse the mmproj GGUF (projector tensors + config).
- [ ] Implement the encoder-free patch embedder: image preprocessing (variable aspect-ratio / resolution tiling) → patch projection → soft-token insertion into the embedding sequence.
- [ ] Wire image input through the CLI and the API server (OpenAI/Anthropic image-content parts).
- [ ] Forward-pass interleaving of image soft-tokens with text tokens (Gemma 4 supports freely interleaved multimodal input).
- [ ] (Later) audio waveform projection; video = frame sequence.
- [ ] Tests: projector parity against llama.cpp `libmtmd` on a fixed image.

## Related

- #124 — Gemma 4 12B bring-up (text path)
- `docs/gemma4-e4b-vision-plan.md` — existing vision plan
- The `gemma4-e4b-qat` model also ships an mmproj we currently skip; same work would cover it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 4 12B encoder-free multimodal (vision/audio) input support #250

Summary

Background

GGUF packaging

What's done

Scope (engine work — not yet done)

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Size	Role
`gemma-4-12b-it-qat-q4_0.gguf`	6,975,877,728 B	Text LLM (decoder) — what we load today
`mmproj-gemma-4-12b-it-qat-q4_0.gguf`	175,115,264 B	Multimodal projector (vision + audio)

Gemma 4 12B encoder-free multimodal (vision/audio) input support #250

Description

Summary

Background

GGUF packaging

What's done

Scope (engine work — not yet done)

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions