Getting Started
This isn't a step-by-step guide. You can find those everywhere. This is the decision guide — what you need to figure out before you start clicking, and what actually matters.
Decision 1: Hardware
The only thing that matters is VRAM. Period. Everything else is secondary.
RAM matters too, but only for CPU offloading. If the model doesn't fit in VRAM, the framework moves layers to CPU RAM and pays a PCIe penalty on every token. A 27B model with 16GB VRAM can work — but you lose speed for the layers on CPU. That's exactly what happened with the first tower experiments: the GPU had 16GB and only 3.7GB was in use because the framework was routing expert tensors through CPU RAM.
CPU matters for llama.cpp. For vLLM, the CPU is mostly idle during decode. If you're running llama.cpp, get a CPU with AVX-VNNI or equivalent. On Intel, that means Alder Lake or newer. The E-cores matter more than you'd think — they're load-bearing for MoE expert dispatch.
Decision 2: Backend
Two real options. Not ten. Two.
llama.cpp — The universal backend. Runs on CPU, GPU, anything with GGUF support. Slower for some architectures because it doesn't have CUDA kernels for everything. GDN/DeltaNet hybrid layers ran at 22 tok/s on an RTX 5060 Ti because the SSM state updates had no optimized path. No flag fixes a missing kernel. GGUF format only. Good for: CPU inference, MoE models, small models, experimentation.
vLLM — The performance backend. Needs GPU. Has optimized kernels for architectures that llama.cpp doesn't. Qwen3.6-27B with GDN layers ran at 22 tok/s in llama.cpp and 83 tok/s in vLLM with the right patches. The difference is kernel support. Good for: anything that has CUDA kernels, anything where speed matters.
The rule: if the model has optimized CUDA kernels in vLLM, use vLLM. If it doesn't, or if you're on CPU, use llama.cpp. That's the whole decision.
Decision 3: Model
This is the hardest decision. The model that's fastest on one hardware setup might be the slowest on another. The model that works for chat might fail for tool calling. The model that passes benchmarks might produce JSON errors 38% of the time.
The real constraint is VRAM. Calculate how big the model is at your chosen quantization, add KV cache overhead, and check if it fits. If it doesn't fit, it doesn't matter how good it is on paper.
For 16GB VRAM: Qwen3.6-27B at INT4 is the current best option. It needs about 14–15GB. The specific quantization matters — the Lorbus AutoRound INT4 preserves the MTP projection weights that enable speculative decoding. Other quantizations of the same model that drop those weights run 4.5x slower.
For 12GB or less: You're looking at 7B–8B models. GGUF format, llama.cpp backend. We haven't tested specific models at this tier — pick one that fits your VRAM and start there.
For 24GB+: MoE models become practical. Qwen3-35B-A3B and GLM-4.7-Flash are both 30B+ total with only 3B active per token. We tested these on 32GB total VRAM (2× 16GB cards) — 100.24 and 95.9 tok/s respectively. A single 24GB card is untested territory.
Decision 4: Quantization
Not all quantizations are equal. The format matters more than the bit count.
GGUF (Q4_K_M, Q5_K_M, IQ4_XS) — llama.cpp native. Works everywhere. Some formats break on certain GPU architectures (CUDA 13.0 breaks Q4_K_M on Blackwell).
AutoRound INT4 — Needs vLLM with Marlin kernel support. The specific quant matters — one that preserves MTP heads gives you speculative decoding. One that doesn't loses 4.5x speed. Check what the quantization includes.
NVFP4 — Nvidia's native FP4 format. Only quantizes MLP layers, attention stays BF16. The loaded model is larger than you'd expect. On 2x16GB GPUs, the NVFP4 version of Gemma 4 (30.6GB) won't fit with room for KV cache.
AWQ — Doesn't have efficient kernel paths on Blackwell. 32.77 tok/s on the tower vs. 83 tok/s for the winning stack. The format itself is the bottleneck.
Decision 5: What You Actually Need It For
Speed isn't the only metric. It's the easiest one, but it's not the only one.
Running agents that need tool calling? vLLM has bugs with tool_choice: required combined with reasoning parsers. The streaming parser drops tool calls intermittently. Thinking mode plus tool calls breaks roughly 60% of the time. These are infrastructure bugs, not model failures, but they matter if you're running production agents.
Running an agent that needs to remember things? Check the model's memory-recall performance. AEON caught adversarial manipulation well but burned context budget on reasoning chains that hurt memory precision. Fast doesn't mean right for your workload.
Just want it to run? Pick the model that fits your VRAM, use the backend that has kernels for it, and don't overthink it. The experiments will tell you what actually works.
The Shortest Path
If you want the fastest path from zero to something useful:
Get a GPU with 16GB+ VRAM. Install llama.cpp. Download a GGUF model that fits. Run it. See how fast it goes. If it's slow, figure out why — is the model mostly on CPU? Are the kernels missing? Is the architecture unsupported? Answer those questions and you'll know what to do next.
The alternative path — buying expensive hardware and spending weeks tuning — is what most people do. It's not what you need to do.