The Stack
This is the exact production setup. Not a reference config. Not a demo. This is what runs. The model all agents talk to, the benchmark that beats everything else on this hardware, the thing that has been running since April 27th without manual intervention.
Hardware
Total cost: around $2,000. That's including the case, both GPUs, the CPU, and memory. Not a workstation. Not a server. A consumer PC that happens to run inference.
Model
The model name is Genesis. Not Qwen3.6-27B. Genesis is the combination of the model, the quantization, the patches, and the flags. You can't get these numbers with just the base model. The AutoRound INT4 quant from Lorbus is the specific one that preserves the MTP projection weights. Other quantizations of the same model that drop those weights are slower — a 32B variant without MTP heads ran at 17.9 tok/s on the same hardware. Genesis runs at 83.
Software
vLLM is mandatory here. llama.cpp has no optimized CUDA kernel path for GDN/DeltaNet hybrid layers. The same model in llama.cpp ran at 22 tok/s. You can't tune around a missing kernel. The Genesis patches are monkey-patches to the TurboQuant hybrid gate plus 19 downstream fixes. Without them, the model crashes or runs at garbage speeds.
CUDA 12.8 is the version. CUDA 13.0 enables BLACKWELL_NATIVE_FP4 which is incompatible with Q4_K_M quantization as currently implemented. No benefit from upgrading, and it actively breaks things.
Proxy
Port 8010 is the single swap point. Every agent, every client, everything talks to port 8010. The proxy points to the active backend. Switching models means changing the proxy target, not reconfiguring every client. The bind address is 127.0.0.1 — a local coding agent caught the security exposure when the network topology changed from WiFi to a direct wire and changed it without being prompted.
Numbers
Steady throughput: ~83 tok/s. Peak measured: 86.83 tok/s. The MTP speculative decoding head contributes roughly 87% of the throughput lift over non-speculative. That single component is the difference between 22 and 83 tok/s. It's not a feature. It's the whole game.
Context window: 32K tokens in production. The model can go to 128K with a 0.8% speed penalty at GMU 0.90. Most workloads don't need 128K. The profile run is cleaner at 32K.
What Didn't Work
These are the alternatives that lost, on the same hardware:
AEON (Qwen3.6-27B in Nvidia's native FP4): 68.86 tok/s, 13 experiments, zero improvement. 15 tok/s below Genesis. The 122K context window is real but rarely relevant.
Gemma 4 NVFP4: 30.6GB loaded size, won't fit on 32GB total VRAM. NVFP4 only quantizes MLP layers, attention stays BF16.
Gemma 4 AWQ: 32.77 tok/s. No efficient kernel paths on Blackwell for the AWQ format.
Qwen3-32B AutoRound INT4: 17.9 tok/s. No MTP heads in the quant. One token per pass with more parameters. 4.5x slower than Genesis.
The lesson is the same as the chronicles: the model that wins isn't the biggest or the newest. It's the one where the full stack aligns.