Add Gemma 4 12B (dense gemma4_unified) support — CUDA-Hybrid/CUDA first, Q4_K_M

Track support for the **dense Gemma 4 12B** model (`gemma4_unified`, released 2026-05-23). Full research + phased plan: `docs/gemma4-12b-implementation-plan.md`.

## Context
The merged `gemma4` path (E4B, #82) is metadata/tensor-driven and already implements the hard parts the 12B needs: per-layer head-dim variance, dual-RoPE, SWA, shared-KV, post-attn/post-ffn norms, GeLU-tanh FFN, final-logit softcap, and proportional RoPE (`rope_freqs`). The dense 12B is essentially **E4B minus PLE / `layer_output_scale`, plus a 1024 sliding window, plus a possible arch-string rename**. So this is mostly **validation of the never-exercised dense (no-PLE) path + decode parity**, not new kernels.

## Target
- **Primary model:** `google/gemma-4-12B-it-qat-q4_0-gguf` (~7.0 GB) — Google's official **quantization-aware-trained (QAT)** 4-bit weights. Best quality-per-byte at 4-bit; fits full-GPU offload on 12 GB. → https://hf.co/google/gemma-4-12B-it-qat-q4_0-gguf
- **Fallback model:** `unsloth/gemma-4-12b-it-GGUF` @ **Q4_K_M** (~7.3 GB) — community K-quant of the non-QAT weights, for cross-checking. → https://hf.co/unsloth/gemma-4-12b-it-GGUF
- **Parity oracle:** `ggml-org/gemma-4-12B-it-GGUF` (llama.cpp reference quant) for Phase 3 decode parity.
- **Hardware:** 12 GB VRAM (RTX 3060/4070), full-GPU offload + KV at ~8K-32K context; 64 GB system RAM for CUDA-Hybrid spill if going higher than Q6.
- **Backends (in order):** CUDA-Hybrid + CUDA → CPU → Vulkan.
- **Out of scope (iter 1):** encoder-free multimodal (vision/audio) — separate `mmproj`, follow-up epic.

## Gap analysis
- **G0 (NEW)** — QAT weights are stored as **`q4_0`**, not `Q4_K_M`. Confirm the q4_0 dequant path is wired across CPU/CUDA/CUDA-Hybrid for the gemma4 tensors (token-embd, attn/ffn weights). If only K-quants were exercised for gemma4 so far, validate/enable q4_0. (The unsloth Q4_K_M fallback covers the K-quant path.)
- **G1** — Verify `general.architecture`: `gemma4` vs `gemma4_unified`. If the latter, alias it in the NEOX allowlist, `isGemma4`, `_isGemma4Like`, and metadata-key prefix logic (`ModelGraph.cs:289,352`).
- **G2** — Dense (no-PLE) path has never been exercised; all gemma4 tests are E4B/PLE. Validate `HasPerLayerTokenEmbd == false` end-to-end across CPU/CUDA/CUDA-Hybrid.
- **G3** — `layer_output_scale` absence: confirm clean skip on the `HasLayerOutputScale == false` branch.
- **G4** — Sliding window = 1024 (vs E4B&#39;s 512); confirm nothing hard-codes 512.
- **G5** — Proportional RoPE (`rope_freqs`) on global layers; validate at long positions, not just position 0.
- **G6** — KV/context budget: no PLE table to subtract; size per-layer KV by `min(maxSeqLen, window)`; 256K context not realistic on 12 GB.
- **G7** — Tokenizer / chat template / EOG smoke for the 12B GGUF.

## Phases
- [ ] **Phase 0** — Acquire + dump real 12B GGUF headers → `tests/fixtures/gemma4_12b_header.md` for **both** the QAT q4_0 primary and the unsloth Q4_K_M fallback (note quant-type differences) *(needs HF egress)*
- [ ] **Phase 1** — Arch-string + metadata wiring (G1) + `Gemma4_12B_Dense_PopulatesAllFields` test
- [ ] **Phase 2** — Dense (no-PLE) CUDA + CUDA-Hybrid decode (G0/G2/G3/G4/G5)
- [ ] **Phase 3** — CUDA / CUDA-Hybrid decode parity vs llama.cpp *(acceptance gate)* → `tests/fixtures/gemma4_12b_greedy.json`
- [ ] **Phase 4** — Server + CLI smoke (G7)
- [ ] **Phase 5** — CPU backend parity
- [ ] **Phase 6** — Vulkan backend parity
- [ ] **Phase 7** (future epic) — Encoder-free multimodal (vision + audio)

## Open questions (resolve from Phase 0 dump)
1. `general.architecture` string — `gemma4` or `gemma4_unified`?
2. `block_count`, `embedding_length`, `feed_forward_length`, head counts.
3. Confirm `per_layer_*` and `layer_output_scale` absent (dense).
4. `rope_freqs.weight` present? length (`maxHeadDim/2`)?
5. `sliding_window == 1024`? pattern length == block_count, final entry global?
6. `shared_kv_layers` count.
7. Does the text GGUF embed any multimodal projection tensors, or is `mmproj` fully separate?
8. QAT GGUF tensor quant types — all `q4_0`, or are norms/embeddings kept at higher precision (`f16`/`q8_0`)?

Plan doc PR: develops on `claude/gemma-4-12b-support-4Reex`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma 4 12B (dense gemma4_unified) support — CUDA-Hybrid/CUDA first, Q4_K_M #124

Context

Target

Gap analysis

Phases

Open questions (resolve from Phase 0 dump)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add Gemma 4 12B (dense gemma4_unified) support — CUDA-Hybrid/CUDA first, Q4_K_M #124

Description

Context

Target

Gap analysis

Phases

Open questions (resolve from Phase 0 dump)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions