The Stack

Hardware · Software · Configuration · What actually works

This is the exact production setup. Not a reference config. Not a demo. This is what runs. The model all agents talk to, the benchmark that beats everything else on this hardware, the thing that has been running since April 27th without manual intervention.

Hardware

Machine

Host: cha0tiktower

Case: CyberPowerPC GXi3400BSTV17

CPU: Intel Core Ultra 7 265F (Arrow Lake)

RAM: 32GB DDR5

GPU: 2× RTX 5060 Ti 16GB GDDR7 (Blackwell SM_120)

Total VRAM: 32GB

Network: Direct wire to cha0tikhome · 0.76ms ping

Total cost: around $2,000. That's including the case, both GPUs, the CPU, and memory. Not a workstation. Not a server. A consumer PC that happens to run inference.

Model

Name: Genesis

Base: Qwen3.6-27B (dense, not MoE)

Quant: AutoRound INT4 by Lorbus

MTP heads: preserved (280MB of projection weights)

Architecture: GDN hybrid layers (attention + state-space model)

Context: 32K tokens (128K possible, <1% speed cost)

The model name is Genesis. Not Qwen3.6-27B. Genesis is the combination of the model, the quantization, the patches, and the flags. You can't get these numbers with just the base model. The AutoRound INT4 quant from Lorbus is the specific one that preserves the MTP projection weights. Other quantizations of the same model that drop those weights are slower — a 32B variant without MTP heads ran at 17.9 tok/s on the same hardware. Genesis runs at 83.

Software

Inference Engine

Backend: vLLM 0.19.2rc1

Patches: Genesis patches (P60–P82) — TurboQuant hybrid gate + 19 downstream fixes

CUDA: 12.8 (NOT 13.0 — breaks Q4_K_M with BLACKWELL_NATIVE_FP4)

KV dtype: fp16 (not fp8 — eliminates per-layer dequant overhead)

MTP speculative decoding: n=3 (goldilocks — n=2 loses 4.3, n=4 loses 6)

Flag: VLLM_MARLIN_USE_ATOMIC_ADD=1 (+6.25 tok/s, undocumented)

vLLM is mandatory here. llama.cpp has no optimized CUDA kernel path for GDN/DeltaNet hybrid layers. The same model in llama.cpp ran at 22 tok/s. You can't tune around a missing kernel. The Genesis patches are monkey-patches to the TurboQuant hybrid gate plus 19 downstream fixes. Without them, the model crashes or runs at garbage speeds.

CUDA 12.8 is the version. CUDA 13.0 enables BLACKWELL_NATIVE_FP4 which is incompatible with Q4_K_M quantization as currently implemented. No benefit from upgrading, and it actively breaks things.

Proxy

Routing

Proxy: Single swap point at port 8010

Backend: vLLM on port 8022

Bind: 127.0.0.1 (not 0.0.0.0)

Service: systemd with Restart=always

Port 8010 is the single swap point. Every agent, every client, everything talks to port 8010. The proxy points to the active backend. Switching models means changing the proxy target, not reconfiguring every client. The bind address is 127.0.0.1 — a local coding agent caught the security exposure when the network topology changed from WiFi to a direct wire and changed it without being prompted.

Numbers

Steady throughput: ~83 tok/s. Peak measured: 86.83 tok/s. The MTP speculative decoding head contributes roughly 87% of the throughput lift over non-speculative. That single component is the difference between 22 and 83 tok/s. It's not a feature. It's the whole game.

Context window: 32K tokens in production. The model can go to 128K with a 0.8% speed penalty at GMU 0.90. Most workloads don't need 128K. The profile run is cleaner at 32K.

What Didn't Work

These are the alternatives that lost, on the same hardware:

AEON (Qwen3.6-27B in Nvidia's native FP4): 68.86 tok/s, 13 experiments, zero improvement. 15 tok/s below Genesis. The 122K context window is real but rarely relevant.

Gemma 4 NVFP4: 30.6GB loaded size, won't fit on 32GB total VRAM. NVFP4 only quantizes MLP layers, attention stays BF16.

Gemma 4 AWQ: 32.77 tok/s. No efficient kernel paths on Blackwell for the AWQ format.

Qwen3-32B AutoRound INT4: 17.9 tok/s. No MTP heads in the quant. One token per pass with more parameters. 4.5x slower than Genesis.

The lesson is the same as the chronicles: the model that wins isn't the biggest or the newest. It's the one where the full stack aligns.

Full experimental history with dates, per-experiment results, and the autoresearch methodology: The Local Inference Chronicles. Visual speed progression: Timeline.