Local Inference Timeline

10.44

Starting tok/s (CPU)

Production tok/s

100+

Experiments run

8×

Total improvement

Speed progression · tok/s (peak per milestone)

Beelink · CPU · Gemma-4 26B

10.44

Tower · first contact

32.4

GPU offload breakthrough

70.7

Single-GPU ceiling

71.1

Dual GPU · no PCIe limit

107

Architecture wall (llama.cpp)

Genesis · vLLM · MTP n=3

86.8

AEON NVFP4 (stuck)

68.9

32B challenger (no MTP)

17.9

Beelink EQI12 · i5-1235U · CPU Only · llama.cpp

Pre
Apr 17

Gemma-4 26B MoE — production inference server 10.44 tok/s

Q5_K_MMoE · 8 active experts Port :8081. mlock required — without it the model pages out. Key research finding: E-cores are load-bearing for MoE expert dispatch. Restricting to P-cores dropped to 7.73 tok/s. Optimal: threads=8, cpu_mask=3F5. Autoresearch loop ran 15+ experiments. This was the production server for all agents before the tower existed.

RTX 5060 Ti 16GB · Blackwell SM_120 · GDDR7 · cha0tiktower

Apr
17

Tower arrives — first contact 32.4 tok/s

Supergemma4-26B, same model family as the Beelink. GPU mostly idle — expert tensors routing through CPU RAM across PCIe on every token. 3.7GB of 16GB GDDR7 in use.

Apr
18

GPU offload breakthrough 70.7 tok/s +118%

Expert tensors moved to GPU (last 6 of 30 layers stayed on CPU). Then: Gemma 4 SWA architecture discovery — 25 of 30 layers use a fixed 5,120-token ring buffer regardless of ctx size. Context window is essentially free. Restored to 32K ctx, q8 KV. Final: 69.8 tok/s at full context and full quality.

Apr
19

Single-GPU autoresearch — ceiling found 71.1 tok/s

15 experiments. Real gains from ubatch 512, 4 CPU layers (from 6), threads 4. Everything else noise or regression. PCIe is the hard ceiling on one GPU.

Apr
20

Second GPU — PCIe bottleneck eliminated 107.23 tok/s +51%

All 30 expert layers on GPU. 12GB headroom — KV upgraded to f16, zero dequant overhead. 20 experiments, two real wins (KV f16 +7.17, ubatch 4096 +0.86), then a hard wall. llama.cpp peak.

Apr
22

Field exploration — five models tested

35B MoE 100 tok/s — 20 experiments, zero improvements, already at ceiling
70B VRAM cliff 55.6 tok/s — model at VRAM edge, OOM killed most experiments before they ran
GLM-4.7-Flash 97 tok/s — found one real gain (+1.17 tok/s), infrastructure issues halted research

Apr
23

Architecture wall — Qwen3.6-27B in llama.cpp 22 tok/s

No CUDA kernel GDN/DeltaNet hybrid layers have no optimized CUDA path. SSM state updates run via fallback code. No config change fixes a missing kernel. The stack had to change entirely.

Apr
27

Genesis — three things aligned 83 tok/s +277%

AutoRound INT4 preserves 280MB MTP projection weights (without it: speculative head finds nothing)
vLLM Genesis patches TurboQuant hybrid gate + 19 downstream fixes
MTP n=3 self-speculative decoding — +87% throughput lift. It's the whole game.
Pass 2 found VLLM_MARLIN_USE_ATOMIC_ADD=1: +6.25 tok/s (undocumented). Peak: 86.83.

Apr
28

AEON NVFP4 — 13 experiments, zero improvement 68.86 tok/s

Same model in Nvidia's native FP4 format. 122K context window. Never moved — already at its ceiling, 15 tok/s below Genesis.

Apr
29–30

Wire swap — and jr catches a security issue autonomously

Tower moved from WiFi to direct point-to-point wire (0.76ms). While the session was offline during the swap, the local coding agent (AEON via Crush) recognized vLLM was now bound to 0.0.0.0 on a direct wire — changed it to 127.0.0.1 without being prompted. Also: MESA benchmark showed AEON regressed Mike's memory-recall 0.10–0.16. Wrong model for the job.

May
3

Switch confirmed — Genesis as default 85.8 tok/s

AEON: 38% structured JSON failures on LME tasks. Genesis: clean. Proxy switched. AEON stopped.

May
5

Reboot — VRAM conflict settles the question

Both models auto-started, both claimed VRAM, neither got enough. 32GB total, each needs ~14–15GB. AEON disabled from auto-start permanently. Genesis is the only thing that runs on the tower.

May
6–9

Challengers fail

Gemma 4 NVFP4 30.6GB — won't fit. NVFP4 keeps attention BF16, loaded size ~28GB.
Gemma 4 AWQ 32.77 tok/s — no efficient Blackwell kernels for AWQ format.
Qwen3-32B 17.9 tok/s — no MTP heads in quant. One token per pass, 32B parameters. 4.5× slower than Genesis.

Production · Genesis · Always On

The Local Inference Chronicles

Genesis