Local LLM Research Log · cha0tikhome + cha0tiktower

The Local Inference Chronicles

Every model. Every experiment. Every failure. Two machines, three backends, one winner.

10.44
Starting tok/s (CPU)
83
Production tok/s
100+
Experiments run
Total improvement
Speed progression · tok/s (peak per milestone)
Beelink · CPU · Gemma-4 26B
10.44
Tower · first contact
32.4
GPU offload breakthrough
70.7
Single-GPU ceiling
71.1
Dual GPU · no PCIe limit
107
Architecture wall (llama.cpp)
22
Genesis · vLLM · MTP n=3
86.8
AEON NVFP4 (stuck)
68.9
32B challenger (no MTP)
17.9
Beelink EQI12 · i5-1235U · CPU Only · llama.cpp
Pre
Apr 17
Gemma-4 26B MoE — production inference server 10.44 tok/s
Q5_K_MMoE · 8 active experts Port :8081. mlock required — without it the model pages out. Key research finding: E-cores are load-bearing for MoE expert dispatch. Restricting to P-cores dropped to 7.73 tok/s. Optimal: threads=8, cpu_mask=3F5. Autoresearch loop ran 15+ experiments. This was the production server for all agents before the tower existed.
RTX 5060 Ti 16GB · Blackwell SM_120 · GDDR7 · cha0tiktower
Apr
17
Tower arrives — first contact 32.4 tok/s
Supergemma4-26B, same model family as the Beelink. GPU mostly idle — expert tensors routing through CPU RAM across PCIe on every token. 3.7GB of 16GB GDDR7 in use.
Apr
18
GPU offload breakthrough 70.7 tok/s +118%
Expert tensors moved to GPU (last 6 of 30 layers stayed on CPU). Then: Gemma 4 SWA architecture discovery — 25 of 30 layers use a fixed 5,120-token ring buffer regardless of ctx size. Context window is essentially free. Restored to 32K ctx, q8 KV. Final: 69.8 tok/s at full context and full quality.
Apr
19
Single-GPU autoresearch — ceiling found 71.1 tok/s
15 experiments. Real gains from ubatch 512, 4 CPU layers (from 6), threads 4. Everything else noise or regression. PCIe is the hard ceiling on one GPU.
Apr
20
Second GPU — PCIe bottleneck eliminated 107.23 tok/s +51%
All 30 expert layers on GPU. 12GB headroom — KV upgraded to f16, zero dequant overhead. 20 experiments, two real wins (KV f16 +7.17, ubatch 4096 +0.86), then a hard wall. llama.cpp peak.
Apr
22
Field exploration — five models tested
35B MoE 100 tok/s — 20 experiments, zero improvements, already at ceiling
70B VRAM cliff 55.6 tok/s — model at VRAM edge, OOM killed most experiments before they ran
GLM-4.7-Flash 97 tok/s — found one real gain (+1.17 tok/s), infrastructure issues halted research
Apr
23
Architecture wall — Qwen3.6-27B in llama.cpp 22 tok/s
No CUDA kernel GDN/DeltaNet hybrid layers have no optimized CUDA path. SSM state updates run via fallback code. No config change fixes a missing kernel. The stack had to change entirely.
Apr
27
Genesis — three things aligned 83 tok/s +277%
AutoRound INT4 preserves 280MB MTP projection weights (without it: speculative head finds nothing)
vLLM Genesis patches TurboQuant hybrid gate + 19 downstream fixes
MTP n=3 self-speculative decoding — +87% throughput lift. It's the whole game.
Pass 2 found VLLM_MARLIN_USE_ATOMIC_ADD=1: +6.25 tok/s (undocumented). Peak: 86.83.
Apr
28
AEON NVFP4 — 13 experiments, zero improvement 68.86 tok/s
Same model in Nvidia's native FP4 format. 122K context window. Never moved — already at its ceiling, 15 tok/s below Genesis.
Apr
29–30
Wire swap — and jr catches a security issue autonomously
Tower moved from WiFi to direct point-to-point wire (0.76ms). While the session was offline during the swap, the local coding agent (AEON via Crush) recognized vLLM was now bound to 0.0.0.0 on a direct wire — changed it to 127.0.0.1 without being prompted. Also: MESA benchmark showed AEON regressed Mike's memory-recall 0.10–0.16. Wrong model for the job.
May
3
Switch confirmed — Genesis as default 85.8 tok/s
AEON: 38% structured JSON failures on LME tasks. Genesis: clean. Proxy switched. AEON stopped.
May
5
Reboot — VRAM conflict settles the question
Both models auto-started, both claimed VRAM, neither got enough. 32GB total, each needs ~14–15GB. AEON disabled from auto-start permanently. Genesis is the only thing that runs on the tower.
May
6–9
Challengers fail
Gemma 4 NVFP4 30.6GB — won't fit. NVFP4 keeps attention BF16, loaded size ~28GB.
Gemma 4 AWQ 32.77 tok/s — no efficient Blackwell kernels for AWQ format.
Qwen3-32B 17.9 tok/s — no MTP heads in quant. One token per pass, 32B parameters. 4.5× slower than Genesis.
Production · Genesis · Always On

Genesis

Qwen3.6-27B AutoRound INT4 · MTP speculative decoding n=3
vLLM 0.19.2rc1 Genesis patches (P60–P82 applied)
VLLM_MARLIN_USE_ATOMIC_ADD=1 · Port 8022 · systemd service

83
tok/s · stable