localfamo.us

The Local Inference Chronicles: Everything We Tried

2026-05-09 · 20 min read · 100+ experiments · 2 machines · 3 backends

The Beelink EQI12 was the constraint. Intel Core i5-1235U, 32GB DDR4-3200, integrated Iris Xe graphics. No discrete GPU — no CUDA, no ROCm worth using, inference by CPU alone. But we didn't sit on it. We ran real experiments.

Before the Tower: The Beelink Chapter

The model was Gemma-4 26B — the A4B MoE variant at Q5_K_M quantization. Mixture-of-experts: 26 billion total parameters, only a fraction firing per token. On CPU-only hardware, MoE is the only way a 26B model runs at all. The production llama-server lived on port 8081. That's what all the agents talked to.

Peak throughput: 10.44 tok/s. Single user, --mlock enabled. The mlock flag was mandatory — without it the model pages out to swap and generation becomes non-deterministic chaos. Model loaded at 18.15 GiB, leaving roughly 10GB for KV cache and OS. Backend was libggml-cpu-alderlake.so, the CPU_ALL_VARIANTS build with AVX-VNNI — the right llama.cpp build for Alder Lake specifically.

We ran the autoresearch loop on it. LLM proposes a config change, run the benchmark, if composite score improves keep the commit, if not discard it. Git history as research memory. The loop found things worth knowing.

The most interesting discovery: E-cores are load-bearing for MoE inference. Gemma-4 has 8 experts active per token. The natural assumption is that the fast P-cores (5.3 GHz boost) should do all the work. Tested it: restricting to P-cores only (cpu_mask=F) dropped throughput from 10.44 to 7.73 tok/s. The E-cores are doing the expert dispatch. Take them away and the MoE topology punishes you. This was a real architectural insight, not a tuning variable.

Other settled questions from the Beelink research:

ctx_size 16384optimal — 32768 requires quantized KV to avoid OOM
threads=8, cpu_mask=3F5goldilocks — 4 threads loses expert throughput, 12 adds overhead
ubatch / batch tuningqueued, never ran

The research queue was still open when the tower arrived. It closed the same day.

The llama-server on port 8081 was stopped when the tower's proxy at 8010 came online. It stayed stopped. SYSTEM-STATE.md still showed it as LIVE for the next 16 days — Frank's tower watchdog was firing false alerts because it thought a permanently retired service was down. Cleaned that up May 9th. The Beelink is still the best machine for running agents. It's just not running models anymore.

April 17First Contact

The tower arrived. CyberPowerPC GXi3400BSTV17 — Intel Core Ultra 7 265F, 32GB DDR5, RTX 5060 Ti 16GB GDDR7, Blackwell architecture, SM_120. Same model as the Beelink: supergemma4-26b, now on a GPU. Baseline: 32.4 tok/s generation, 74.1 prompt processing. Usable, not impressive. More than 3x the Beelink — but that's a low bar.

First hypothesis: CPU affinity. Arrow Lake has P-cores at 5.3 GHz and E-cores at 4.6 GHz. Force the work to the fast cores. Used taskset -cp 0-7 on the running process.

Result: 32.6 tok/s. Two tenths. Revert.

April 18The Day Everything Changed

The second experiment was the one that mattered. Not because it was clever — because it required actually understanding what was happening.

supergemma4-26b is MoE. llama.cpp was routing all the expert tensors to CPU RAM and pulling them across the PCIe bus on every single token. The GPU had 16GB of GDDR7 and 3.7GB of it was doing anything. The model needed 16,003 MiB for full GPU load. Available: 15,223 MiB. It didn't fit, so the framework punted the expensive tensors to RAM and paid the bus penalty on every decode step.

The fix was accepting an imperfect load. Keep only the last 6 of 30 expert layers on CPU. Drop context from 32K to 8K temporarily. Cut KV precision from q8 to q4 to recover headroom. Pack the rest on GPU.

First run: 70.7 tok/s. Prompt processing jumped from 74 to 222. The PCIe bottleneck was gone. This is the thing that mattered — not a clever flag, not a patch, just moving the data to where the compute was.

Then came the architecture discovery that made full context possible. Gemma 4 uses sliding window attention on 25 of its 30 layers. Those 25 layers use a fixed 5,120-token ring buffer regardless of what context size you set. Only 5 layers use full-context KV that scales with ctx. Total KV at 32,768 context in q4 precision: 461 megabytes. Not gigabytes. Megabytes. Context was essentially free to expand. Context went back to 32K, KV quality back to q8.

Final result: 69.8 tok/s at full context, full KV quality. The improvement required understanding the architecture, not tuning flags.

Six more experiments ran that day. None moved the needle in the right direction:

Layer ordering — first 6 vs last 6 on CPUno change / worse prompt
Process priority to high67.4 tok/s — starves CUDA driver threads
P-core threading only (cpu-strict)+74 tok/s warm prompt throughput — kept it
ubatch 512 and 2048no meaningful difference
CUDA 13.0 rebuildcrashed — BLACKWELL_NATIVE_FP4 breaks Q4_K_M

The CUDA 13.0 failure is worth naming. It enables BLACKWELL_NATIVE_FP4 — native Blackwell FP4 matrix multiply instructions — which is incompatible with Q4_K_M as currently implemented. CUDA 12.8 already targets SM_120a and nothing breaks. No benefit from upgrading. Stayed on 12.8.

Single-GPU config locked: 69.8 tok/s. Last 6 expert layers on CPU, 32K context, q8 KV, P-core threading.

April 19Autoresearch — Ceiling at 71.1

15 experiments, Frank as the research brain, controlled benchmark between each change. The loop found real things:

ubatch 512 (from 1024)+1.86 tok/s
4 CPU layers instead of 6 (blk.26-29)+5.03 tok/s — less PCIe overhead
threads 4 (from 8)+2.01 tok/s — less coordination overhead
--parallel 2+0.51 tok/s

The threading result is counterintuitive until it isn't. The CPU expert workload is small — you don't need a squad, you need two people who don't get in each other's way. Threads 6 regressed. Threads 12 was catastrophic.

Everything else — defrag thresholds, no-mmap, more CPU layers, E-core inclusion, ubatch 4096 — either had no effect or hurt. Single-GPU ceiling: 71.1 tok/s. That was the wall.

April 20Second GPU — The Problem Changes Shape

The second RTX 5060 Ti arrived. 32GB total VRAM. This changed the problem entirely.

The model that had required careful expert-layer surgery now fit on GPU with 12GB to spare. All 30 expert layers on GPU. No PCIe bottleneck — not reduced, eliminated. Autoresearch ran 20 experiments, found two real wins:

KV q4_0 → f16+7.17 tok/s — 12GB headroom makes f16 trivial, eliminates dequant overhead
ubatch 4096 (from 1024)+0.86 tok/s

Everything after that was noise. Threading tweaks, tensor split ratios, main-gpu assignments, poll modes — marginal at best, regressive at worst.

Dual-GPU ceiling: 107.23 tok/s.

April 22Exploring the Field

Three models. Three different kinds of ceilings.

Qwen3.6-35B-A3B MoE — 35B total, 3B active. Baseline: 100.24 tok/s. Ran 20 experiments, found nothing. Already at the ceiling. The interesting result was context depth degradation on the 131K window variant: 32K depth ran at 71.62 tok/s, 65K at 56.50, 131K at 39.72. The full-attention layers dominate at depth. O(n) beats O(1) when O(n) shows up enough times.

Llama 3.3 70B IQ4_XS — 60 layers on GPU, 20 on CPU, 65K context. Baseline 55.6 tok/s. The autoresearch was a disaster. The model was at the absolute edge of VRAM. Every experiment required a server restart. VRAM doesn't free instantaneously when a process dies — the new process started before the old one's memory fully released and immediately OOMed. Killed most experiments before they ran. Zero improvements found.

The 70B was a VRAM problem, not a config problem. Recognizing that distinction saved days of work.

GLM-4.7-Flash Q5_K_M — 30B MoE, 3B active, 95.9 tok/s baseline. One real improvement: q5_0 KV over q4_0 gave +1.17 tok/s. Then infrastructure problems took over — the research loop was running on the same inference server it was testing, so restarts killed the loop mid-experiment. Net: 97.07 tok/s.

April 23The Architecture Wall

Qwen3.6-27B. Dense model, not MoE — all parameters fire on every token — with GDN hybrid layers that mix traditional attention with state-space model computations. In llama.cpp it ran at 22 tokens per second.

Not 50. Not 40. Twenty-two.

The reason was architectural, not tunable. GDN/DeltaNet hybrid layers have no optimized CUDA kernel path in llama.cpp. The SSM state updates that define the architecture's efficiency are computed via fallback code. No flag changes this. You can't tune your way around a missing kernel.

A Medium post from Wasif Basharat documented reaching 85 tok/s on a single RTX 3090 with the same model. The stack required three things to align: the Lorbus AutoRound INT4 quantization (which preserves the MTP projection weights at 280MB — without them, the speculative decoding head finds nothing), the Genesis vLLM patches (monkey-patches the TurboQuant hybrid gate and 19 downstream fixes), and a CUDA graph guard for the .tolist() calls that crash during warmup. Get all three right, set MTP speculative decoding to n=3, and the model runs at 80+ tok/s.

The path was clear. Switch stacks.

April 27vLLM — Building Genesis

Pass 1, 21 experiments. Baseline: 75.63 tok/s.

KV dtype fp16 (over fp8)+4.95 tok/s — eliminates per-layer dequant overhead
MTP speculative decoding n=3goldilocks — n=2 loses 4.3, n=4 loses 6, no MTP loses 35
GMU, sampler, seqs, CUDA connection countsmarginal or negative

The MTP number deserves a sentence. No speculative decoding: −35 tok/s. That single head is contributing roughly 87% of the throughput lift. It's not a feature. It's the whole game.

Pass 1 canonical: 80.58 tok/s.

Pass 2, 12 experiments. One win:

VLLM_MARLIN_USE_ATOMIC_ADD=1+6.25 tok/s

This flag enables atomic-add reduction in the gptq_marlin decode kernel for small-n batches in tensor-parallel mode. Not in the standard documentation. Found by running experiments. This is the kind of thing you only discover by having a research loop that actually runs and logs.

Pass 3, context ceiling sweep. 128K context at GMU 0.90 costs 0.8% speed. The window can go to 131K with essentially no penalty. Production stayed at 32K — most use cases don't need 128K and the profile run is cleaner.

Genesis promoted to production April 27th. ~83 tok/s steady, port 8022.

April 28AEON NVFP4 — 13 Experiments, Nothing

AEON was the same model — Qwen3.6-27B — in Nvidia's native FP4 quantization format. Hardware-native on Blackwell, 122,880-token context window. Baseline: 68.86 tok/s.

Ran 13 experiments. MTP n=4 and n=5 both timed out. n=2 lost 9%. Everything else was noise. AEON's ceiling was 68.86 — roughly 15 tok/s lower than Genesis, and the context window advantage was rarely relevant for actual workload.

AEON went to standby. Genesis was the production stack.

April 29–30jr Catches Something We Missed

Built a local coding agent — jr — using Crush wired to AEON via a thin proxy. Hit three separate unfixed bugs in the vLLM/Qwen3 tool-calling stack: tool_choice: required combined with the reasoning parser has a known bug (issue #19051, fixed in 0.9.0, we were on 0.19.2rc1). Streaming parser drops tool calls intermittently. Thinking mode plus tool calls breaks roughly 60% of the time even with thinking disabled. Five of eight coding tasks passed. Infrastructure bugs, not model failures.

April 30th the tower moved from WiFi to a direct point-to-point wire to cha0tikhome — 10.10.10.1 to 10.10.10.2, ~0.76ms Tailscale direct path. While the main session was offline during the wire swap, jr was left running autonomously.

It noticed something.

The topology change meant vLLM was now bound to 0.0.0.0 on a direct wire rather than behind NAT. Security exposure that hadn't existed before. jr changed the bind address to 127.0.0.1 without being prompted. Then rewrote two systemd service files with proper dependency ordering and added the Hermes security scorer as an MCP pre-execution check.

First time a local agent caught a security issue we hadn't thought of.

Same day: MESA benchmark against AEON on Mike's relay pipeline. Composite score 0.3765, 17.9% pass rate — down from 0.4413 / 42.0% on the previous model three weeks earlier. AEON's thinking mode was catching adversarial manipulation well but burning context budget on reasoning chains that hurt memory-recall precision. Wrong model for that workload.

May 3The Switch Confirmed

AEON was producing structured JSON failures at ~38% rate on LME tasks. Genesis with thinking disabled server-side — a vLLM server flag, not a per-request setting, due to bug #17609 — ran cleanly at 85.8 tok/s. Proxy switched to Genesis as default. AEON stopped.

May 5The Two-Model Problem, Made Concrete

cha0tikhome rebooted. SSH unreachable because sshd was waiting on the wrong tailscale unit name. Tower had also rebooted. AEON and Genesis both tried to start. Both claimed VRAM. Neither got enough.

32GB total VRAM. Each model needs ~14–15GB. They can't coexist at full capacity. This was always true — the reboot just forced it into the open. AEON stopped, disabled from auto-start. Genesis is now the only thing that runs on the tower.

May 6Gemma 4 — Two Attempts

Google announced a speculative decoding drafter for Gemma 4 claiming up to 3x speedup. Worth testing.

Attempt 1: NVFP4 version, 30.6GB. Wouldn't fit — NVFP4 only quantizes MLP layers, attention stays BF16, loaded model occupies ~28GB with no room for KV cache on 2×16GB. Dead on arrival.

Attempt 2: AWQ 4-bit, ~19.6GB, compressed-tensors format. Baseline: 32.77 tok/s. Genesis was at 80. Two of the first four experiments timed out. No meaningful improvement in any that ran. AWQ doesn't have efficient kernel paths on Blackwell. Called it.

May 932B Challenger — Gone in 500 Tokens

The Qwen3-32B AutoRound INT4 quant from kaitchup. 18.3GB in four shards. Stopped Genesis to clear VRAM — which required the mollydog mechanism to write a systemd sudoers rule because Genesis had Restart=always and kept coming back. Started the 32B. 500 tokens in 27.994 seconds.

17.9 tok/s.

Genesis was 80.3. The gap is 4.5x. Root cause: the kaitchup AutoRound quant doesn't include MTP heads. Those heads are what give Genesis its self-speculative decoding — same model predicting multiple tokens per forward pass. Without them you're doing one token per pass with a 32B model instead of two tokens per pass with a 27B model. Bigger. Slower. Wrong in every direction. Genesis wins by a factor too large to argue with.


Where It Stands

CPU inference on the Beelink. Single-GPU llama.cpp. Dual-GPU llama.cpp. The architecture wall. vLLM with Genesis patches. Four research passes on vLLM alone. 100+ individual experiments across seven models, two machines, and three inference backends.

The Beelink ran the same model family that eventually won on the tower. Gemma-4 26B on CPU at 10 tok/s pointed toward the same truth we kept relearning on GPU: MoE matters, data placement matters, and you can't tune around a ceiling imposed by hardware. The tower made the ceiling much higher. The lessons were the same.

The model that won isn't the biggest or the newest or the one with the most context. It's the one where the full stack aligned — quantization format preserves the MTP heads, speculative decoding fires, vLLM patches handle the hybrid layers, one undocumented Marlin kernel flag gets set correctly. Genesis at 83 tok/s, running as a systemd service on hardware that cost around $2,000, never needs attention.

The things that turned out not to matter: process priority, CPU affinity for generation, tensor split ratios on PCIe, CUDA 13.0, everything involving the 70B experiments, AWQ format on Blackwell, anything requiring NVLink for NCCL P2P.

The things that turned out to matter: where the data lives relative to the compute, whether the speculative decoding head survived quantization, one undocumented environment variable in the Marlin kernel, and knowing when a model's architecture is the ceiling rather than its configuration.

That last one is the hard one. You can feel your way toward the right flags eventually. Recognizing that you're fighting a missing kernel — that no flag will ever fix a missing kernel — that's what saves you weeks.