Skip to content

Add Gemma 4 12B (dense gemma4_unified) support — CUDA-Hybrid/CUDA first, Q4_K_M #124

@pekkah

Description

@pekkah

Track support for the dense Gemma 4 12B model (gemma4_unified, released 2026-05-23). Full research + phased plan: docs/gemma4-12b-implementation-plan.md.

Context

The merged gemma4 path (E4B, #82) is metadata/tensor-driven and already implements the hard parts the 12B needs: per-layer head-dim variance, dual-RoPE, SWA, shared-KV, post-attn/post-ffn norms, GeLU-tanh FFN, final-logit softcap, and proportional RoPE (rope_freqs). The dense 12B is essentially E4B minus PLE / layer_output_scale, plus a 1024 sliding window, plus a possible arch-string rename. So this is mostly validation of the never-exercised dense (no-PLE) path + decode parity, not new kernels.

Target

  • Primary model: google/gemma-4-12B-it-qat-q4_0-gguf (~7.0 GB) — Google's official quantization-aware-trained (QAT) 4-bit weights. Best quality-per-byte at 4-bit; fits full-GPU offload on 12 GB. → https://hf.co/google/gemma-4-12B-it-qat-q4_0-gguf
  • Fallback model: unsloth/gemma-4-12b-it-GGUF @ Q4_K_M (~7.3 GB) — community K-quant of the non-QAT weights, for cross-checking. → https://hf.co/unsloth/gemma-4-12b-it-GGUF
  • Parity oracle: ggml-org/gemma-4-12B-it-GGUF (llama.cpp reference quant) for Phase 3 decode parity.
  • Hardware: 12 GB VRAM (RTX 3060/4070), full-GPU offload + KV at ~8K-32K context; 64 GB system RAM for CUDA-Hybrid spill if going higher than Q6.
  • Backends (in order): CUDA-Hybrid + CUDA → CPU → Vulkan.
  • Out of scope (iter 1): encoder-free multimodal (vision/audio) — separate mmproj, follow-up epic.

Gap analysis

  • G0 (NEW) — QAT weights are stored as q4_0, not Q4_K_M. Confirm the q4_0 dequant path is wired across CPU/CUDA/CUDA-Hybrid for the gemma4 tensors (token-embd, attn/ffn weights). If only K-quants were exercised for gemma4 so far, validate/enable q4_0. (The unsloth Q4_K_M fallback covers the K-quant path.)
  • G1 — Verify general.architecture: gemma4 vs gemma4_unified. If the latter, alias it in the NEOX allowlist, isGemma4, _isGemma4Like, and metadata-key prefix logic (ModelGraph.cs:289,352).
  • G2 — Dense (no-PLE) path has never been exercised; all gemma4 tests are E4B/PLE. Validate HasPerLayerTokenEmbd == false end-to-end across CPU/CUDA/CUDA-Hybrid.
  • G3layer_output_scale absence: confirm clean skip on the HasLayerOutputScale == false branch.
  • G4 — Sliding window = 1024 (vs E4B's 512); confirm nothing hard-codes 512.
  • G5 — Proportional RoPE (rope_freqs) on global layers; validate at long positions, not just position 0.
  • G6 — KV/context budget: no PLE table to subtract; size per-layer KV by min(maxSeqLen, window); 256K context not realistic on 12 GB.
  • G7 — Tokenizer / chat template / EOG smoke for the 12B GGUF.

Phases

  • Phase 0 — Acquire + dump real 12B GGUF headers → tests/fixtures/gemma4_12b_header.md for both the QAT q4_0 primary and the unsloth Q4_K_M fallback (note quant-type differences) (needs HF egress)
  • Phase 1 — Arch-string + metadata wiring (G1) + Gemma4_12B_Dense_PopulatesAllFields test
  • Phase 2 — Dense (no-PLE) CUDA + CUDA-Hybrid decode (G0/G2/G3/G4/G5)
  • Phase 3 — CUDA / CUDA-Hybrid decode parity vs llama.cpp (acceptance gate)tests/fixtures/gemma4_12b_greedy.json
  • Phase 4 — Server + CLI smoke (G7)
  • Phase 5 — CPU backend parity
  • Phase 6 — Vulkan backend parity
  • Phase 7 (future epic) — Encoder-free multimodal (vision + audio)

Open questions (resolve from Phase 0 dump)

  1. general.architecture string — gemma4 or gemma4_unified?
  2. block_count, embedding_length, feed_forward_length, head counts.
  3. Confirm per_layer_* and layer_output_scale absent (dense).
  4. rope_freqs.weight present? length (maxHeadDim/2)?
  5. sliding_window == 1024? pattern length == block_count, final entry global?
  6. shared_kv_layers count.
  7. Does the text GGUF embed any multimodal projection tensors, or is mmproj fully separate?
  8. QAT GGUF tensor quant types — all q4_0, or are norms/embeddings kept at higher precision (f16/q8_0)?

Plan doc PR: develops on claude/gemma-4-12b-support-4Reex.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions