Resources

Models, tools, and quantizations. What works, what doesn't, why.

Every model here was tested on real hardware. The notes are from actual runs, not readme descriptions. If a model needs NVLink to hit its advertised speed, it says so. If a quantization format has no efficient kernel paths on your GPU, it says that too.

Models

27B dense · GDN hybrid layers · AutoRound INT4 · GGUF · NVFP4
winner The production model. Dense, not MoE. GDN hybrid layers mix traditional attention with state-space model computations. In llama.cpp: 22 tok/s (no CUDA kernel for GDN layers). In vLLM with Genesis patches + MTP n=3: 83 tok/s. The AutoRound INT4 quant from Lorbus is the specific one that preserves MTP projection weights — without them, speculative decoding has nothing to work with. The 32B variant from kaitchup drops the MTP heads and runs at 17.9 tok/s. Check your quantization.
26B MoE · A4B active · GGUF · NVFP4 · AWQ
tested Fast on llama.cpp, dead on vLLM. MoE architecture: 26B total, only a fraction active per token. On single GPU llama.cpp: 71.1 tok/s ceiling. Dual GPU: 107.23 tok/s. The SWA architecture means context is essentially free — 25 of 30 layers use a fixed 5,120-token ring buffer regardless of context size. NVFP4 version (30.6GB) won't fit on 2x16GB GPUs. AWQ version: 32.77 tok/s, no efficient Blackwell kernels. Google announced a speculative decoding drafter claiming 3x speedup — didn't materialize in testing.
32B dense · AutoRound INT4 (kaitchup) · GGUF
17.9 tok/s Bigger, slower, wrong. The kaitchup AutoRound INT4 quant doesn't include MTP heads. Without them you're doing one token per pass with 32B parameters instead of two tokens per pass with 27B. 4.5x slower than Genesis. Tool calls failed on format mismatch. The quantization you pick matters more than the model size.
35B MoE · 3B active · GGUF · 131K context variant
tested Already at ceiling on dual GPU. 100.24 tok/s baseline. 20 experiments found zero improvements. The 131K context variant degrades badly with depth: 32K depth at 71.62 tok/s, 65K at 56.50, 131K at 39.72. The full-attention layers dominate at depth. If you need long context, the model architecture matters more than the advertised window size.
70B dense · IQ4_XS · GGUF
VRAM wall A VRAM problem, not a config problem. 60 layers on GPU, 20 on CPU, 65K context. Baseline 55.6 tok/s. Every experiment required a server restart. VRAM doesn't free instantaneously when a process dies — new processes OOM before old memory releases. Zero improvements found. The 70B is the wrong choice for 32GB total VRAM. You need more GPU memory or a smaller model.
30B MoE · 3B active · Q5_K_M · GGUF
tested Fast, but infrastructure problems. 95.9 tok/s baseline. One real improvement: q5_0 KV over q4_0 gave +1.17 tok/s. Final: 97.07 tok/s. The research loop was running on the same inference server it was testing — restarts killed the loop mid-experiment. Good model, same workload problem as everything else.

Backends

C/C++ · GGUF · CPU + GPU · single and multi-GPU
universal Works everywhere. Has limits. The default backend for consumer hardware. Excellent for MoE models — Gemma 4 hit 107 tok/s on dual GPU. The hard limit: no optimized CUDA kernel for GDN/DeltaNet hybrid layers. Qwen3.6-27B ran at 22 tok/s. You can't tune around a missing kernel. CUDA 13.0 breaks Q4_K_M on Blackwell. Stay on CUDA 12.8.
Python · GPU-only · Marlin/AWQ/GPTQ · speculative decoding
required for GDN models Fast when it works. Fragile when it doesn't. The only backend with kernel support for GDN hybrid layers. With the right patches and quantization, Qwen3.6-27B hits 83 tok/s. Without MTP speculative decoding: lose 35 tok/s. Known bugs: tool_choice: required + reasoning parser crashes (issue #19051), streaming parser drops tool calls, thinking mode + tool calls breaks ~60% of the time. These are infrastructure bugs, not model failures. Use with awareness.

Quantization Formats

GGUF (Q4_K_M, Q5_K_M, IQ4_XS)
llama.cpp native · CPU + GPU · widely supported
safe default Works everywhere. Bit count matters less than architecture match. Q5_K_M was the Beelink production quant. Q4_K_M is the sweet spot for GPU — fits more model in VRAM. CUDA 13.0 breaks Q4_K_M on Blackwell with BLACKWELL_NATIVE_FP4. IQ4_XS exists but the 70B at that quantization was already at the VRAM edge.
AutoRound INT4
vLLM + Marlin kernel · requires specific quantization
best for GDN models The format that won. But check what it includes. The Lorbus quant of Qwen3.6-27B preserves MTP projection weights (280MB). The kaitchup quant of Qwen3-32B doesn't. Same format, different outcome. 83 tok/s vs 17.9 tok/s. The quantization file matters more than the format name.
NVFP4
Nvidia native FP4 · Blackwell hardware · vLLM
larger than expected Hardware-native but doesn't fit. Only quantizes MLP layers, attention stays BF16. The loaded model is bigger than the bit count suggests. Gemma 4 NVFP4: 30.6GB loaded, won't fit on 2x16GB with room for KV cache. Qwen3.6-27B NVFP4 (AEON): 68.86 tok/s, 15 tok/s below Genesis, 13 experiments found nothing. The 122K context window is real but rarely relevant.
AWQ
Activation-aware Weight Quantization · vLLM
no Blackwell kernels Dead on Blackwell. No efficient kernel paths on RTX 5060 Ti. Gemma 4 AWQ: 32.77 tok/s vs Genesis at 83. The format itself is the bottleneck. Not a bad format — just the wrong one for this hardware.

Other Links

The Genesis quantization · preserves MTP heads
production The specific quantization that makes Genesis work. Preserves the 280MB MTP projection weights. Without these, the speculative decoding head has nothing to predict and you lose 87% of your throughput. This is the one that matters.
P60–P82 · TurboQuant hybrid gate + 19 downstream fixes
required Without these patches, GDN models don't run. Monkey-patches the TurboQuant hybrid gate and 19 downstream fixes. Also requires a CUDA graph guard for .tolist() calls that crash during warmup. Three things to align: the quant, the patches, the CUDA graph guard. Get all three right and the model runs at 80+ tok/s.