Evaluate NVIDIA Model-Optimizer techniques for low-VRAM (12GB/64GB) perf

## Summary

Investigated whether [NVIDIA Model-Optimizer](https://github.com/NVIDIA/Model-Optimizer) contains techniques we can adapt to improve performance on low-VRAM systems (target: **12 GB VRAM / 64 GB system RAM**).

**Verdict:** Mostly orthogonal. Model-Optimizer is a Python/PyTorch *model-preparation* toolkit that emits checkpoints for NVIDIA's own runtimes (TensorRT-LLM, vLLM, SGLang). It is not a runtime library we can link into a C# GGUF engine — "adapting" it means porting an *algorithm or format*, not reusing code. Most of its surface (QAT, distillation, pruning) is training-time and out of scope for an engine that loads pre-quantized GGUF.

Critically, the levers that matter most on a 12 GB box — KV-cache compression, expert offload, paged cache, speculative decode — **we already have, and in a couple of cases they're more aggressive than Model-Optimizer's.**

## What it offers vs. what we already have

| Model-Optimizer technique | SharpInference status | Adapt? |
|---|---|---|
| FP8 / INT4-AWQ / INT8-SmoothQuant PTQ | We load pre-quantized GGUF (Q4_K…IQ-series, fp8 E4M3); llama.cpp imatrix quant ≈ AWQ | No — different layer |
| **NVFP4** (FP4 E2M1 + FP8 E4M3 block-scale/16 + per-tensor FP32) | Not supported | Maybe (conditional) |
| KV-cache FP8/INT8 quant | TurboQuant **3–4 bit** Lloyd-Max + Hadamard — *more* aggressive | No — we're ahead |
| **AutoQuantize** per-layer mixed-precision sensitivity search | Not present; `TierPlanner` places layers by size only | **Yes — best idea** |
| EAGLE-3 / Medusa draft modules | `SpeculativeDecoder` + `MtpDecoder` (NEXTN) already | Partial (consume their heads) |
| Minitron depth/width pruning + distillation | None (training-side, out of scope) | No — just run their output GGUFs |
| 2:4 structured sparsity | None | No — needs Ampere+ sparse-tensor HW path |

## Adaptable ideas, ranked

### 1. AutoQuantize-style per-layer sensitivity → smarter offload (highest value)
Model-Optimizer's most portable *idea* (not code) is ranking layers by quantization sensitivity, then keeping sensitive layers high-precision and pushing insensitive ones lower. On a 12 GB box the binding constraint is *what stays in VRAM*. Today `src/SharpInference.Engine/TierPlanner.cs` greedily fills the GPU layer-0-upward by byte size. A sensitivity-aware variant — keep high-impact layers (often early + last) resident/high-precision, spill the flat-sensitivity middle layers to CPU or a lower bit-width — would directly improve quality-per-VRAM-GB. Maps cleanly onto the existing `TierPlanner` + GGUF mixed-quant support.

Suggested approach: start with a cheap proxy metric (per-layer activation/weight magnitude, or a small calibration pass) rather than Model-Optimizer's full search, opt-in behind a flag.

### 2. NVFP4 dequant path (conditional)
NVFP4 beats INT4 on quality at the same 4 bits, and we already have fp8 E4M3 plumbing in `CudaBackend` to build the block-scale decode. **But** it only pays off if models actually ship in NVFP4, it's tuned for Blackwell, and it overlaps what Q4_K/IQ4 already give us on CPU/Vulkan. Park until the GGUF ecosystem (or llama.cpp's MXFP4 / gpt-oss format) forces our hand.

### 3. EAGLE-3 draft heads (niche)
Speculative + MTP infra already exists; the only adaptable piece is *consuming* an EAGLE-3-trained draft head for a better accept rate. That's a loader/format task, gated on someone publishing EAGLE-3 heads for the models we run.

Everything else (pruning, distillation, QAT, SmoothQuant calibration) is model-production work that belongs in their Python pipeline — the right way to "use" it is to run their pre-optimized checkpoints (Nemotron, pruned Llamas) once converted to GGUF.

## Recommendation

The one item worth real engineering is **#1 — a sensitivity-aware `TierPlanner`**, because it leverages infrastructure we already have and attacks the actual 12 GB constraint. The rest is either already covered better by TurboQuant, or gated on external model availability.

## References
- [Model-Optimizer repo](https://github.com/NVIDIA/Model-Optimizer)
- [AutoQuantize mixed-precision search](https://deepwiki.com/NVIDIA/Model-Optimizer/6.1-autoquantize-mixed-precision-search)
- [llm_ptq examples (NVFP4/AWQ/W4A8)](https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/llm_ptq/README.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate NVIDIA Model-Optimizer techniques for low-VRAM (12GB/64GB) perf #104

Summary

What it offers vs. what we already have

Adaptable ideas, ranked

1. AutoQuantize-style per-layer sensitivity → smarter offload (highest value)

2. NVFP4 dequant path (conditional)

3. EAGLE-3 draft heads (niche)

Recommendation

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model-Optimizer technique	SharpInference status	Adapt?
FP8 / INT4-AWQ / INT8-SmoothQuant PTQ	We load pre-quantized GGUF (Q4_K…IQ-series, fp8 E4M3); llama.cpp imatrix quant ≈ AWQ	No — different layer
NVFP4 (FP4 E2M1 + FP8 E4M3 block-scale/16 + per-tensor FP32)	Not supported	Maybe (conditional)
KV-cache FP8/INT8 quant	TurboQuant 3–4 bit Lloyd-Max + Hadamard — more aggressive	No — we're ahead
AutoQuantize per-layer mixed-precision sensitivity search	Not present; `TierPlanner` places layers by size only	Yes — best idea
EAGLE-3 / Medusa draft modules	`SpeculativeDecoder` + `MtpDecoder` (NEXTN) already	Partial (consume their heads)
Minitron depth/width pruning + distillation	None (training-side, out of scope)	No — just run their output GGUFs
2:4 structured sparsity	None	No — needs Ampere+ sparse-tensor HW path

Evaluate NVIDIA Model-Optimizer techniques for low-VRAM (12GB/64GB) perf #104

Description

Summary

What it offers vs. what we already have

Adaptable ideas, ranked

1. AutoQuantize-style per-layer sensitivity → smarter offload (highest value)

2. NVFP4 dequant path (conditional)

3. EAGLE-3 draft heads (niche)

Recommendation

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions