Skip to content

Evaluate NVIDIA Model-Optimizer techniques for low-VRAM (12GB/64GB) perf #104

@pekkah

Description

@pekkah

Summary

Investigated whether NVIDIA Model-Optimizer contains techniques we can adapt to improve performance on low-VRAM systems (target: 12 GB VRAM / 64 GB system RAM).

Verdict: Mostly orthogonal. Model-Optimizer is a Python/PyTorch model-preparation toolkit that emits checkpoints for NVIDIA's own runtimes (TensorRT-LLM, vLLM, SGLang). It is not a runtime library we can link into a C# GGUF engine — "adapting" it means porting an algorithm or format, not reusing code. Most of its surface (QAT, distillation, pruning) is training-time and out of scope for an engine that loads pre-quantized GGUF.

Critically, the levers that matter most on a 12 GB box — KV-cache compression, expert offload, paged cache, speculative decode — we already have, and in a couple of cases they're more aggressive than Model-Optimizer's.

What it offers vs. what we already have

Model-Optimizer technique SharpInference status Adapt?
FP8 / INT4-AWQ / INT8-SmoothQuant PTQ We load pre-quantized GGUF (Q4_K…IQ-series, fp8 E4M3); llama.cpp imatrix quant ≈ AWQ No — different layer
NVFP4 (FP4 E2M1 + FP8 E4M3 block-scale/16 + per-tensor FP32) Not supported Maybe (conditional)
KV-cache FP8/INT8 quant TurboQuant 3–4 bit Lloyd-Max + Hadamard — more aggressive No — we're ahead
AutoQuantize per-layer mixed-precision sensitivity search Not present; TierPlanner places layers by size only Yes — best idea
EAGLE-3 / Medusa draft modules SpeculativeDecoder + MtpDecoder (NEXTN) already Partial (consume their heads)
Minitron depth/width pruning + distillation None (training-side, out of scope) No — just run their output GGUFs
2:4 structured sparsity None No — needs Ampere+ sparse-tensor HW path

Adaptable ideas, ranked

1. AutoQuantize-style per-layer sensitivity → smarter offload (highest value)

Model-Optimizer's most portable idea (not code) is ranking layers by quantization sensitivity, then keeping sensitive layers high-precision and pushing insensitive ones lower. On a 12 GB box the binding constraint is what stays in VRAM. Today src/SharpInference.Engine/TierPlanner.cs greedily fills the GPU layer-0-upward by byte size. A sensitivity-aware variant — keep high-impact layers (often early + last) resident/high-precision, spill the flat-sensitivity middle layers to CPU or a lower bit-width — would directly improve quality-per-VRAM-GB. Maps cleanly onto the existing TierPlanner + GGUF mixed-quant support.

Suggested approach: start with a cheap proxy metric (per-layer activation/weight magnitude, or a small calibration pass) rather than Model-Optimizer's full search, opt-in behind a flag.

2. NVFP4 dequant path (conditional)

NVFP4 beats INT4 on quality at the same 4 bits, and we already have fp8 E4M3 plumbing in CudaBackend to build the block-scale decode. But it only pays off if models actually ship in NVFP4, it's tuned for Blackwell, and it overlaps what Q4_K/IQ4 already give us on CPU/Vulkan. Park until the GGUF ecosystem (or llama.cpp's MXFP4 / gpt-oss format) forces our hand.

3. EAGLE-3 draft heads (niche)

Speculative + MTP infra already exists; the only adaptable piece is consuming an EAGLE-3-trained draft head for a better accept rate. That's a loader/format task, gated on someone publishing EAGLE-3 heads for the models we run.

Everything else (pruning, distillation, QAT, SmoothQuant calibration) is model-production work that belongs in their Python pipeline — the right way to "use" it is to run their pre-optimized checkpoints (Nemotron, pruned Llamas) once converted to GGUF.

Recommendation

The one item worth real engineering is #1 — a sensitivity-aware TierPlanner, because it leverages infrastructure we already have and attacks the actual 12 GB constraint. The rest is either already covered better by TurboQuant, or gated on external model availability.

References

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions