Resources

Models, tools, and quantizations. What works, what doesn't, why.

Every model here was tested on real hardware. The notes are from actual runs, not readme descriptions. If a model needs NVLink to hit its advertised speed, it says so. If a quantization format has no efficient kernel paths on your GPU, it says that too.

Models

Qwen3-27B

27B dense · GDN hybrid layers · AutoRound INT4 · GGUF · NVFP4

winner The production model. Dense, not MoE. GDN hybrid layers mix traditional attention with state-space model computations. In llama.cpp: 22 tok/s (no CUDA kernel for GDN layers). In vLLM with Genesis patches + MTP n=3: 83 tok/s. The AutoRound INT4 quant from Lorbus is the specific one that preserves MTP projection weights — without them, speculative decoding has nothing to work with. The 32B variant from kaitchup drops the MTP heads and runs at 17.9 tok/s. Check your quantization.

Gemma 4 27B (supergemma4-26b)

26B MoE · A4B active · GGUF · NVFP4 · AWQ

tested Fast on llama.cpp, dead on vLLM. MoE architecture: 26B total, only a fraction active per token. On single GPU llama.cpp: 71.1 tok/s ceiling. Dual GPU: 107.23 tok/s. The SWA architecture means context is essentially free — 25 of 30 layers use a fixed 5,120-token ring buffer regardless of context size. NVFP4 version (30.6GB) won't fit on 2x16GB GPUs. AWQ version: 32.77 tok/s, no efficient Blackwell kernels. Google announced a speculative decoding drafter claiming 3x speedup — didn't materialize in testing.

Qwen3-32B

32B dense · AutoRound INT4 (kaitchup) · GGUF

17.9 tok/s Bigger, slower, wrong. The kaitchup AutoRound INT4 quant doesn't include MTP heads. Without them you're doing one token per pass with 32B parameters instead of two tokens per pass with 27B. 4.5x slower than Genesis. Tool calls failed on format mismatch. The quantization you pick matters more than the model size.

Qwen3-35B-A3B

35B MoE · 3B active · GGUF · 131K context variant

tested Already at ceiling on dual GPU. 100.24 tok/s baseline. 20 experiments found zero improvements. The 131K context variant degrades badly with depth: 32K depth at 71.62 tok/s, 65K at 56.50, 131K at 39.72. The full-attention layers dominate at depth. If you need long context, the model architecture matters more than the advertised window size.

Llama 3.3 70B

70B dense · IQ4_XS · GGUF

VRAM wall A VRAM problem, not a config problem. 60 layers on GPU, 20 on CPU, 65K context. Baseline 55.6 tok/s. Every experiment required a server restart. VRAM doesn't free instantaneously when a process dies — new processes OOM before old memory releases. Zero improvements found. The 70B is the wrong choice for 32GB total VRAM. You need more GPU memory or a smaller model.

GLM-4.7-Flash

30B MoE · 3B active · Q5_K_M · GGUF

tested Fast, but infrastructure problems. 95.9 tok/s baseline. One real improvement: q5_0 KV over q4_0 gave +1.17 tok/s. Final: 97.07 tok/s. The research loop was running on the same inference server it was testing — restarts killed the loop mid-experiment. Good model, same workload problem as everything else.

Backends

llama.cpp

C/C++ · GGUF · CPU + GPU · single and multi-GPU

universal Works everywhere. Has limits. The default backend for consumer hardware. Excellent for MoE models — Gemma 4 hit 107 tok/s on dual GPU. The hard limit: no optimized CUDA kernel for GDN/DeltaNet hybrid layers. Qwen3.6-27B ran at 22 tok/s. You can't tune around a missing kernel. CUDA 13.0 breaks Q4_K_M on Blackwell. Stay on CUDA 12.8.

vLLM

Python · GPU-only · Marlin/AWQ/GPTQ · speculative decoding

required for GDN models Fast when it works. Fragile when it doesn't. The only backend with kernel support for GDN hybrid layers. With the right patches and quantization, Qwen3.6-27B hits 83 tok/s. Without MTP speculative decoding: lose 35 tok/s. Known bugs: tool_choice: required + reasoning parser crashes (issue #19051), streaming parser drops tool calls, thinking mode + tool calls breaks ~60% of the time. These are infrastructure bugs, not model failures. Use with awareness.

Quantization Formats

GGUF (Q4_K_M, Q5_K_M, IQ4_XS)

llama.cpp native · CPU + GPU · widely supported

safe default Works everywhere. Bit count matters less than architecture match. Q5_K_M was the Beelink production quant. Q4_K_M is the sweet spot for GPU — fits more model in VRAM. CUDA 13.0 breaks Q4_K_M on Blackwell with BLACKWELL_NATIVE_FP4. IQ4_XS exists but the 70B at that quantization was already at the VRAM edge.

AutoRound INT4

vLLM + Marlin kernel · requires specific quantization

best for GDN models The format that won. But check what it includes. The Lorbus quant of Qwen3.6-27B preserves MTP projection weights (280MB). The kaitchup quant of Qwen3-32B doesn't. Same format, different outcome. 83 tok/s vs 17.9 tok/s. The quantization file matters more than the format name.

NVFP4

Nvidia native FP4 · Blackwell hardware · vLLM

larger than expected Hardware-native but doesn't fit. Only quantizes MLP layers, attention stays BF16. The loaded model is bigger than the bit count suggests. Gemma 4 NVFP4: 30.6GB loaded, won't fit on 2x16GB with room for KV cache. Qwen3.6-27B NVFP4 (AEON): 68.86 tok/s, 15 tok/s below Genesis, 13 experiments found nothing. The 122K context window is real but rarely relevant.

AWQ

Activation-aware Weight Quantization · vLLM

no Blackwell kernels Dead on Blackwell. No efficient kernel paths on RTX 5060 Ti. Gemma 4 AWQ: 32.77 tok/s vs Genesis at 83. The format itself is the bottleneck. Not a bad format — just the wrong one for this hardware.

Resources

Models

Backends

Quantization Formats

Other Links