Skip to content

feat(vision): Gemma 4 audio input (gemma4ua / gemma4a) #254

@pekkah

Description

@pekkah

Follow-up to #250 / PR #251 (image input shipped). Gemma 4 is natively multimodal incl. audio, which is not implemented.

Weight-availability findings (from dumped mmproj headers, 2026-06-15)

  • 12B QAT mmproj (mmproj-gemma-4-12b-it-qat-q4_0.gguf): clip.audio.projector_type=gemma4ua and has_audio_encoder=True are declared, but the file ships ZERO a.* audio tensors (only v.* vision + mm.*). So 12B audio is not possible with this QAT mmproj — it would need an audio-carrying 12B mmproj (verify one exists upstream).
  • E4B mmproj (gemma-4-E4B-it-mmproj.gguf): ships a real audio encoder — clip.audio.projector_type=gemma4a, 751 a.* tensors (~12-block conformer-style: num_mel_bins=128, embedding 1024, ffn 4096, 8 heads), mm.a.input_projection [1536→2560].

Scope

  • Audio preprocessing (mel-spectrogram, 128 mel bins) for the gemma4a path.
  • gemma4a audio encoder forward pass (conformer blocks incl. norm_conv) — part of the E4B encoder-full effort.
  • Project audio features → text embed dim and splice via the existing ForwardEmbedding seam at <|audio|> placeholders (<|audio>/<audio|> wrap).
  • CLI --audio + (later) server audio content blocks.

Relationship

  • The E4B audio encoder (gemma4a) is part of the E4B encoder-full work — see Add Gemma 4 E4B multimodal (vision / image input) support #126 (its deferred V6). This issue can either subsume that or stay the cross-cutting audio tracker.
  • 12B audio (gemma4ua, encoder-free like its vision) is blocked on weight availability (above).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions