Summary
Add multimodal input support (image first, audio/video later) for Gemma 4 12B, which is a unified, encoder-free multimodal model. Today sharpi loads only the text decoder of gemma-4-12b-it-qat-q4_0.gguf and has no image/audio input path.
Background
Gemma 4 12B (released 2026-06-03) drops the separate SigLIP-style vision encoder used by earlier Gemma. Instead it projects raw image patches and audio waveforms directly into the LLM embedding space through lightweight linear layers (encoder-free / "unified"). See:
GGUF packaging
Even though it's encoder-free, in the GGUF/llama.cpp ecosystem the projection tensors still ship as a small companion mmproj (~175 MB, vs the multi-hundred-MB SigLIP encoders of the encoder-based Gemma 4 sizes). The official repo google/gemma-4-12B-it-qat-q4_0-gguf contains exactly two files:
| File |
Size |
Role |
gemma-4-12b-it-qat-q4_0.gguf |
6,975,877,728 B |
Text LLM (decoder) — what we load today |
mmproj-gemma-4-12b-it-qat-q4_0.gguf |
175,115,264 B |
Multimodal projector (vision + audio) |
Inspecting the text GGUF confirms it carries only text metadata (gemma4.attention.*, gemma4.rope.*, gemma4.embedding_length_per_layer_input for PLE) and no vision/audio tensors — those live in the mmproj.
What's done
scripts/download-model.ps1: the gemma4-12b-qat entry now pulls the mmproj alongside the text GGUF, and the disk-space guard counts only missing files (so the 175 MB top-up isn't blocked by the 7 GB bundle size).
Scope (engine work — not yet done)
Related
Summary
Add multimodal input support (image first, audio/video later) for Gemma 4 12B, which is a unified, encoder-free multimodal model. Today sharpi loads only the text decoder of
gemma-4-12b-it-qat-q4_0.ggufand has no image/audio input path.Background
Gemma 4 12B (released 2026-06-03) drops the separate SigLIP-style vision encoder used by earlier Gemma. Instead it projects raw image patches and audio waveforms directly into the LLM embedding space through lightweight linear layers (encoder-free / "unified"). See:
GGUF packaging
Even though it's encoder-free, in the GGUF/llama.cpp ecosystem the projection tensors still ship as a small companion mmproj (~175 MB, vs the multi-hundred-MB SigLIP encoders of the encoder-based Gemma 4 sizes). The official repo
google/gemma-4-12B-it-qat-q4_0-ggufcontains exactly two files:gemma-4-12b-it-qat-q4_0.ggufmmproj-gemma-4-12b-it-qat-q4_0.ggufInspecting the text GGUF confirms it carries only text metadata (
gemma4.attention.*,gemma4.rope.*,gemma4.embedding_length_per_layer_inputfor PLE) and no vision/audio tensors — those live in the mmproj.What's done
scripts/download-model.ps1: thegemma4-12b-qatentry now pulls the mmproj alongside the text GGUF, and the disk-space guard counts only missing files (so the 175 MB top-up isn't blocked by the 7 GB bundle size).Scope (engine work — not yet done)
libmtmdon a fixed image.Related
docs/gemma4-e4b-vision-plan.md— existing vision plangemma4-e4b-qatmodel also ships an mmproj we currently skip; same work would cover it.