Track support for the dense Gemma 4 12B model (gemma4_unified, released 2026-05-23). Full research + phased plan: docs/gemma4-12b-implementation-plan.md.
Context
The merged gemma4 path (E4B, #82) is metadata/tensor-driven and already implements the hard parts the 12B needs: per-layer head-dim variance, dual-RoPE, SWA, shared-KV, post-attn/post-ffn norms, GeLU-tanh FFN, final-logit softcap, and proportional RoPE (rope_freqs). The dense 12B is essentially E4B minus PLE / layer_output_scale, plus a 1024 sliding window, plus a possible arch-string rename. So this is mostly validation of the never-exercised dense (no-PLE) path + decode parity, not new kernels.
Target
- Primary model:
google/gemma-4-12B-it-qat-q4_0-gguf (~7.0 GB) — Google's official quantization-aware-trained (QAT) 4-bit weights. Best quality-per-byte at 4-bit; fits full-GPU offload on 12 GB. → https://hf.co/google/gemma-4-12B-it-qat-q4_0-gguf
- Fallback model:
unsloth/gemma-4-12b-it-GGUF @ Q4_K_M (~7.3 GB) — community K-quant of the non-QAT weights, for cross-checking. → https://hf.co/unsloth/gemma-4-12b-it-GGUF
- Parity oracle:
ggml-org/gemma-4-12B-it-GGUF (llama.cpp reference quant) for Phase 3 decode parity.
- Hardware: 12 GB VRAM (RTX 3060/4070), full-GPU offload + KV at ~8K-32K context; 64 GB system RAM for CUDA-Hybrid spill if going higher than Q6.
- Backends (in order): CUDA-Hybrid + CUDA → CPU → Vulkan.
- Out of scope (iter 1): encoder-free multimodal (vision/audio) — separate
mmproj, follow-up epic.
Gap analysis
- G0 (NEW) — QAT weights are stored as
q4_0, not Q4_K_M. Confirm the q4_0 dequant path is wired across CPU/CUDA/CUDA-Hybrid for the gemma4 tensors (token-embd, attn/ffn weights). If only K-quants were exercised for gemma4 so far, validate/enable q4_0. (The unsloth Q4_K_M fallback covers the K-quant path.)
- G1 — Verify
general.architecture: gemma4 vs gemma4_unified. If the latter, alias it in the NEOX allowlist, isGemma4, _isGemma4Like, and metadata-key prefix logic (ModelGraph.cs:289,352).
- G2 — Dense (no-PLE) path has never been exercised; all gemma4 tests are E4B/PLE. Validate
HasPerLayerTokenEmbd == false end-to-end across CPU/CUDA/CUDA-Hybrid.
- G3 —
layer_output_scale absence: confirm clean skip on the HasLayerOutputScale == false branch.
- G4 — Sliding window = 1024 (vs E4B's 512); confirm nothing hard-codes 512.
- G5 — Proportional RoPE (
rope_freqs) on global layers; validate at long positions, not just position 0.
- G6 — KV/context budget: no PLE table to subtract; size per-layer KV by
min(maxSeqLen, window); 256K context not realistic on 12 GB.
- G7 — Tokenizer / chat template / EOG smoke for the 12B GGUF.
Phases
Open questions (resolve from Phase 0 dump)
general.architecture string — gemma4 or gemma4_unified?
block_count, embedding_length, feed_forward_length, head counts.
- Confirm
per_layer_* and layer_output_scale absent (dense).
rope_freqs.weight present? length (maxHeadDim/2)?
sliding_window == 1024? pattern length == block_count, final entry global?
shared_kv_layers count.
- Does the text GGUF embed any multimodal projection tensors, or is
mmproj fully separate?
- QAT GGUF tensor quant types — all
q4_0, or are norms/embeddings kept at higher precision (f16/q8_0)?
Plan doc PR: develops on claude/gemma-4-12b-support-4Reex.
Track support for the dense Gemma 4 12B model (
gemma4_unified, released 2026-05-23). Full research + phased plan:docs/gemma4-12b-implementation-plan.md.Context
The merged
gemma4path (E4B, #82) is metadata/tensor-driven and already implements the hard parts the 12B needs: per-layer head-dim variance, dual-RoPE, SWA, shared-KV, post-attn/post-ffn norms, GeLU-tanh FFN, final-logit softcap, and proportional RoPE (rope_freqs). The dense 12B is essentially E4B minus PLE /layer_output_scale, plus a 1024 sliding window, plus a possible arch-string rename. So this is mostly validation of the never-exercised dense (no-PLE) path + decode parity, not new kernels.Target
google/gemma-4-12B-it-qat-q4_0-gguf(~7.0 GB) — Google's official quantization-aware-trained (QAT) 4-bit weights. Best quality-per-byte at 4-bit; fits full-GPU offload on 12 GB. → https://hf.co/google/gemma-4-12B-it-qat-q4_0-ggufunsloth/gemma-4-12b-it-GGUF@ Q4_K_M (~7.3 GB) — community K-quant of the non-QAT weights, for cross-checking. → https://hf.co/unsloth/gemma-4-12b-it-GGUFggml-org/gemma-4-12B-it-GGUF(llama.cpp reference quant) for Phase 3 decode parity.mmproj, follow-up epic.Gap analysis
q4_0, notQ4_K_M. Confirm the q4_0 dequant path is wired across CPU/CUDA/CUDA-Hybrid for the gemma4 tensors (token-embd, attn/ffn weights). If only K-quants were exercised for gemma4 so far, validate/enable q4_0. (The unsloth Q4_K_M fallback covers the K-quant path.)general.architecture:gemma4vsgemma4_unified. If the latter, alias it in the NEOX allowlist,isGemma4,_isGemma4Like, and metadata-key prefix logic (ModelGraph.cs:289,352).HasPerLayerTokenEmbd == falseend-to-end across CPU/CUDA/CUDA-Hybrid.layer_output_scaleabsence: confirm clean skip on theHasLayerOutputScale == falsebranch.rope_freqs) on global layers; validate at long positions, not just position 0.min(maxSeqLen, window); 256K context not realistic on 12 GB.Phases
tests/fixtures/gemma4_12b_header.mdfor both the QAT q4_0 primary and the unsloth Q4_K_M fallback (note quant-type differences) (needs HF egress)Gemma4_12B_Dense_PopulatesAllFieldstesttests/fixtures/gemma4_12b_greedy.jsonOpen questions (resolve from Phase 0 dump)
general.architecturestring —gemma4orgemma4_unified?block_count,embedding_length,feed_forward_length, head counts.per_layer_*andlayer_output_scaleabsent (dense).rope_freqs.weightpresent? length (maxHeadDim/2)?sliding_window == 1024? pattern length == block_count, final entry global?shared_kv_layerscount.mmprojfully separate?q4_0, or are norms/embeddings kept at higher precision (f16/q8_0)?Plan doc PR: develops on
claude/gemma-4-12b-support-4Reex.