Skip to content

Gemma 4 12B encoder-free multimodal (vision/audio) input support #250

@pekkah

Description

@pekkah

Summary

Add multimodal input support (image first, audio/video later) for Gemma 4 12B, which is a unified, encoder-free multimodal model. Today sharpi loads only the text decoder of gemma-4-12b-it-qat-q4_0.gguf and has no image/audio input path.

Background

Gemma 4 12B (released 2026-06-03) drops the separate SigLIP-style vision encoder used by earlier Gemma. Instead it projects raw image patches and audio waveforms directly into the LLM embedding space through lightweight linear layers (encoder-free / "unified"). See:

GGUF packaging

Even though it's encoder-free, in the GGUF/llama.cpp ecosystem the projection tensors still ship as a small companion mmproj (~175 MB, vs the multi-hundred-MB SigLIP encoders of the encoder-based Gemma 4 sizes). The official repo google/gemma-4-12B-it-qat-q4_0-gguf contains exactly two files:

File Size Role
gemma-4-12b-it-qat-q4_0.gguf 6,975,877,728 B Text LLM (decoder) — what we load today
mmproj-gemma-4-12b-it-qat-q4_0.gguf 175,115,264 B Multimodal projector (vision + audio)

Inspecting the text GGUF confirms it carries only text metadata (gemma4.attention.*, gemma4.rope.*, gemma4.embedding_length_per_layer_input for PLE) and no vision/audio tensors — those live in the mmproj.

What's done

  • scripts/download-model.ps1: the gemma4-12b-qat entry now pulls the mmproj alongside the text GGUF, and the disk-space guard counts only missing files (so the 175 MB top-up isn't blocked by the 7 GB bundle size).

Scope (engine work — not yet done)

  • Parse the mmproj GGUF (projector tensors + config).
  • Implement the encoder-free patch embedder: image preprocessing (variable aspect-ratio / resolution tiling) → patch projection → soft-token insertion into the embedding sequence.
  • Wire image input through the CLI and the API server (OpenAI/Anthropic image-content parts).
  • Forward-pass interleaving of image soft-tokens with text tokens (Gemma 4 supports freely interleaved multimodal input).
  • (Later) audio waveform projection; video = frame sequence.
  • Tests: projector parity against llama.cpp libmtmd on a fixed image.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions