Getting Started

Not a tutorial. The decisions you need to make.

This isn't a step-by-step guide. You can find those everywhere. This is the decision guide — what you need to figure out before you start clicking, and what actually matters.

Decision 1: Hardware

The only thing that matters is VRAM. Period. Everything else is secondary.

Under 12GB VRAM: You're running small models. 7B–8B dense. Quantized. llama.cpp is your backend. We don't have GPU benchmarks at this tier — our CPU-only Beelink ran at 10.44 tok/s, but a GPU changes the picture significantly.

16GB VRAM: You can fit a 27B dense model at INT4 — barely. Qwen3.6-27B needs 14–15GB, leaving 1–2GB for KV cache. vLLM is the better backend if the model has kernels. We ran 83 tok/s on 32GB dual GPU with 12GB of KV headroom. On a single 16GB card the number will be lower — we haven't tested that configuration.

16–24GB VRAM: Larger models become possible. MoE models with small active parameter counts are the best value here — 30B+ total parameters, only 3B active per token.

24GB+ VRAM: You have choices. The ceiling is higher but the rules are the same.

RAM matters too, but only for CPU offloading. If the model doesn't fit in VRAM, the framework moves layers to CPU RAM and pays a PCIe penalty on every token. A 27B model with 16GB VRAM can work — but you lose speed for the layers on CPU. That's exactly what happened with the first tower experiments: the GPU had 16GB and only 3.7GB was in use because the framework was routing expert tensors through CPU RAM.

CPU matters for llama.cpp. For vLLM, the CPU is mostly idle during decode. If you're running llama.cpp, get a CPU with AVX-VNNI or equivalent. On Intel, that means Alder Lake or newer. The E-cores matter more than you'd think — they're load-bearing for MoE expert dispatch.

Don't buy a GPU for local inference without checking kernel support. The RTX 5060 Ti has Blackwell architecture (SM_120). CUDA 12.8 targets it. CUDA 13.0 breaks Q4_K_M quantization. The quantization format you pick needs kernel support on your specific GPU architecture. No kernel, no speed. You can't tune around a missing kernel.

Decision 2: Backend

Two real options. Not ten. Two.

llama.cpp — The universal backend. Runs on CPU, GPU, anything with GGUF support. Slower for some architectures because it doesn't have CUDA kernels for everything. GDN/DeltaNet hybrid layers ran at 22 tok/s on an RTX 5060 Ti because the SSM state updates had no optimized path. No flag fixes a missing kernel. GGUF format only. Good for: CPU inference, MoE models, small models, experimentation.

vLLM — The performance backend. Needs GPU. Has optimized kernels for architectures that llama.cpp doesn't. Qwen3.6-27B with GDN layers ran at 22 tok/s in llama.cpp and 83 tok/s in vLLM with the right patches. The difference is kernel support. Good for: anything that has CUDA kernels, anything where speed matters.

The rule: if the model has optimized CUDA kernels in vLLM, use vLLM. If it doesn't, or if you're on CPU, use llama.cpp. That's the whole decision.

Decision 3: Model

This is the hardest decision. The model that's fastest on one hardware setup might be the slowest on another. The model that works for chat might fail for tool calling. The model that passes benchmarks might produce JSON errors 38% of the time.

The real constraint is VRAM. Calculate how big the model is at your chosen quantization, add KV cache overhead, and check if it fits. If it doesn't fit, it doesn't matter how good it is on paper.

For 16GB VRAM: Qwen3.6-27B at INT4 is the current best option. It needs about 14–15GB. The specific quantization matters — the Lorbus AutoRound INT4 preserves the MTP projection weights that enable speculative decoding. Other quantizations of the same model that drop those weights run 4.5x slower.

For 12GB or less: You're looking at 7B–8B models. GGUF format, llama.cpp backend. We haven't tested specific models at this tier — pick one that fits your VRAM and start there.

For 24GB+: MoE models become practical. Qwen3-35B-A3B and GLM-4.7-Flash are both 30B+ total with only 3B active per token. We tested these on 32GB total VRAM (2× 16GB cards) — 100.24 and 95.9 tok/s respectively. A single 24GB card is untested territory.

Decision 4: Quantization

Not all quantizations are equal. The format matters more than the bit count.

GGUF (Q4_K_M, Q5_K_M, IQ4_XS) — llama.cpp native. Works everywhere. Some formats break on certain GPU architectures (CUDA 13.0 breaks Q4_K_M on Blackwell).

AutoRound INT4 — Needs vLLM with Marlin kernel support. The specific quant matters — one that preserves MTP heads gives you speculative decoding. One that doesn't loses 4.5x speed. Check what the quantization includes.

NVFP4 — Nvidia's native FP4 format. Only quantizes MLP layers, attention stays BF16. The loaded model is larger than you'd expect. On 2x16GB GPUs, the NVFP4 version of Gemma 4 (30.6GB) won't fit with room for KV cache.

AWQ — Doesn't have efficient kernel paths on Blackwell. 32.77 tok/s on the tower vs. 83 tok/s for the winning stack. The format itself is the bottleneck.

Decision 5: What You Actually Need It For

Speed isn't the only metric. It's the easiest one, but it's not the only one.

Running agents that need tool calling? vLLM has bugs with tool_choice: required combined with reasoning parsers. The streaming parser drops tool calls intermittently. Thinking mode plus tool calls breaks roughly 60% of the time. These are infrastructure bugs, not model failures, but they matter if you're running production agents.

Running an agent that needs to remember things? Check the model's memory-recall performance. AEON caught adversarial manipulation well but burned context budget on reasoning chains that hurt memory precision. Fast doesn't mean right for your workload.

Just want it to run? Pick the model that fits your VRAM, use the backend that has kernels for it, and don't overthink it. The experiments will tell you what actually works.

The Shortest Path

If you want the fastest path from zero to something useful:

Get a GPU with 16GB+ VRAM. Install llama.cpp. Download a GGUF model that fits. Run it. See how fast it goes. If it's slow, figure out why — is the model mostly on CPU? Are the kernels missing? Is the architecture unsupported? Answer those questions and you'll know what to do next.

The alternative path — buying expensive hardware and spending weeks tuning — is what most people do. It's not what you need to do.