localfamo.us

The Upgrade That Wasn't

2026-05-10 · 12 min read · 1 upgrade attempt · 1 crash loop · 3 compounding failures · back to baseline

Someone posted about DFlash. New speculative decoding technique. Real paper, real GitHub, already merged into vLLM and SGLang. The claim: 8.5× throughput improvement. 48.5 tokens/sec becomes 415 tokens/sec on the same model with zero accuracy loss.

That number is real — on a datacenter GPU, with a specific MoE model, at batch size 1. But it says nothing about whether it applies to your stack. Before touching anything, I read the logs.

Why DFlash Doesn't Apply Here

DFlash replaces the external draft model in speculative decoding with a block diffusion model that generates 16 tokens in one parallel shot. The main model then verifies the whole block at once. On hardware where verification is memory-bandwidth-bound (H100, A100, TPU), the speedup is massive. On consumer GPUs where verification is compute-bound, more like 2–2.5×.

Genesis — the production stack here — uses self-MTP. Qwen3.6-27B has native multi-token prediction heads baked directly into the model architecture. One forward pass, multiple tokens out. No separate draft model. No additional VRAM. No overhead beyond the base inference cost.

DFlash has nothing to replace. Self-MTP is already doing the thing DFlash is designed to do, and it's doing it for free because the weights are already loaded. We tested disabling MTP a day earlier. Throughput dropped from 80.8 t/s to 40.1 t/s. Half. That's how much the native prediction heads are contributing.

The 8.5× headline is real. It's just for a different setup than ours.

The Actual Upgrade

While investigating DFlash, found something worth trying: vLLM 0.20.2 shipped CUTLASS kernel fixes for Blackwell (SM_120). Our stack runs on dual RTX 5060 Ti — Blackwell architecture. The fixes were potentially relevant.

The Genesis patch system is drift-aware. It detects when upstream vLLM has absorbed a backported fix and skips that patch automatically. An upgrade should be low-risk: install new vLLM, re-run patch apply, patches auto-skip anything already fixed upstream. Clean in theory.

In practice:

pip install vLLM 0.20.2installed cleanly (244MB)
Genesis patch apply23 applied, 36 skipped, 0 failed
Service restartNCCL crash — TP=2 init fails on SM_120

The error: RuntimeError: NCCL error: unhandled cuda error. Multi-GPU tensor parallel initialization — the thing that splits the model across two GPUs — fails on Blackwell in vLLM 0.20.2. Not a configuration error. Not a patch conflict. The backend doesn't support SM_120 at TP>1 yet.

The service entered a crash loop. Each restart attempt failed at the same point. And this is where the compound problem started.

The Compound Problem

A crash loop in vLLM doesn't cleanly release GPU memory between attempts. Each failed init leaves CUDA contexts alive in the kernel even after the process dies. After four crash-restart cycles, both GPUs showed ~15GB occupied with nothing running. New instances couldn't allocate memory to start. The crash loop had made itself unrecoverable without manual intervention.

Fix: kill the zombie processes by PID, wait for VRAM to drain to near-zero, then proceed with rollback. Had a 241MB tar backup of the original vLLM install. Restored it.

One catch: pip uninstall removes the vllm binary from bin/. The tar restore puts the Python package back but not the entry point script. Service failed with exit code 127 — command not found — until the entry point was recreated manually. One printf command. Easy to fix, annoying to debug at 10pm.

ThenThe Kernel

Recovery required a reboot. The system came up on 6.17.0-23-generic — apt had staged a kernel update that landed as the new default. NVIDIA modules don't exist for -23. The new kernel requires nvidia-kernel-common-580 >= 580.142; installed version is 580.126.09. Ubuntu's package repos haven't caught up yet.

The system booted fine. Just without any GPU driver. nvidia-smi failed. Genesis wouldn't start.

Fix: pin GRUB to the previous kernel.

sudo sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.17.0-22-generic"/' /etc/default/grub
sudo update-grub

Done. Reboot into 6.17.0-22. Driver loads. But the system still needed to reach the BIOS.

ThenThe Headless Problem

The tower runs headless — no monitor attached. Intel Core Ultra 7 265F is the F-series variant: no integrated graphics. There's no iGPU fallback. The MSI MAG Z890 Tomahawk halts POST when it can't find a display output and there's no iGPU to initialize.

The machine boots fine with a monitor plugged in. Without one, it shuts down at POST. This had never been an issue before because the power had never been hard-cycled mid-reboot.

Required: hauling out a monitor and keyboard, physical access, BIOS navigation to diagnose. What we found and changed:

AC Power Recoveryalready set to Power On — correct
ErP Readydisabled — was blocking WoL standby power
Resume by PCI-E/Networkenabled — Wake-on-LAN now functional

The original plug cycle failure wasn't a BIOS misconfiguration. The TPLink smart plug switches in under 100 milliseconds — too fast for the board to register as a real AC loss event. The capacitors hold charge through a sub-second cut. The board never saw the power go away. Need 3+ seconds off for the AC loss detection to trigger.

Permanent fix pending: a dummy HDMI adapter (a $8 dongle that tricks the GPU into thinking a monitor is attached). Eliminates the headless POST failure entirely without requiring any BIOS change.

Where It Landed

Back on vLLM 0.19.2rc1. Kernel pinned to 6.17.0-22. Genesis patches all applied. Wake-on-LAN configured for remote wakes without touching the power plug.

Post-recovery throughput: 71.4 t/s. Baseline was 79.4 t/s. About 10% below — expected after multiple forced kills and CUDA context corruption. CUDA graph caches rebuild over time. It'll normalize.

vLLM 0.20.x on Blackwell SM_120 with tensor-parallel > 1 is broken as of 0.20.2. Do not upgrade until the genesis-vllm-patches repo explicitly documents SM_120 + TP=2 support for a specific release.

The Actual Lessons

Read your own logs before chasing external benchmarks. The answer to "should I try DFlash" was in yesterday's experiment results. Self-MTP was already doing speculative decoding. The research was already done.

Infrastructure failures compound. A crash loop creates stuck VRAM. Stuck VRAM blocks rollback. Rollback requires a reboot. Reboot lands on a kernel without GPU drivers. Kernel mismatch requires BIOS access. No monitor requires physical access. Three hours to recover from what should have been a ten-minute rollback.

Take a backup before you upgrade. Not a pip freeze — a tar of the actual installed files. Pip freeze can't reinstall a dev nightly that's been rotated out of the wheel server. The tar can.

Consumer hardware at the bleeding edge of architecture support (Blackwell is four months old) means software compatibility lags. vLLM 0.20.x hasn't finished catching up to SM_120 in multi-GPU configurations. That's not a bug — it's just where the work is. Check the compatibility matrix, not just the release notes.