A 10 year old Xeon is all you need

204 points • by cafkafk • today at 6:38 AM • 72 comments • view on HN

Comments

Nice post and technically impressive work. I agree we need to understand the build pipeline and be able to do things locally. However, depending on your electricity cost, it might not make sense financially. These old servers are not energy efficient at all (I'm guessing that old Xeon server will easily pull 200W on load), and that model is currently at 0.1$/0.3$ per 1M tokens (with 76 tps and 262k context) in Openrouter (also, these servers are LOUD).

EDIT: I stand corrected, 200W is apparently way too high of an estimate. I used to run a bunch of old Xeon servers and they slurped watts like crazy, but I can't remember which ones exactly those were.

➕ show 2 replies

cafkafk • today at 6:42 AM

Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.

I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.

I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.

I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.

➕ show 5 replies

throwaway2027 • today at 10:09 AM

Glad to see other people realizing this. I've been running Gemma 26B-A4B Q4 on a 2012 Xeon with 16GB to 24GB of RAM in a container. It's getting around 8 to 12 tokens per second. Obviously it's not comparable to huge contexts and running it on a GPU and the image decoder in llama.cpp is super slow compared to a GPU but for some small automation tasks and general trivia questions it's decent. The speed is just enough to not have to wait for it to finish so you can read along.

Here's my setup. You may want to figure out what the best optimizations are for your specific CPU like AVX2 because mine didn't have most of them. I did try MTP briefly but I wasn't getting performance improvements. You could play around with the batch sizes for cache or context or go even lower for Q2 and don't overcommit on threads either, but I would suggest either defaults or trying out llama-bench. This isn't by any means the best I assume but it worked decently for me and I sometimes swap out Gemma for Qwen. You could also lower q8_0 to q4_0 for more context but it could hurt quality some say, altough I have noticed it too on some models.

# Building

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON

# Running

export OPENBLAS_NUM_THREADS=4

export OMP_NUM_THREADS=4

OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \

llama.cpp/build/bin/llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.00 --jinja --host 0.0.0.0 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 --threads 4 --threads-batch 4 --ctx-size 8192 -n 8192 --batch-size 2048 --ubatch-size 512 --no-mmap --mlock --chat-template-kwargs '{"enable_thinking":false}' --no-mmproj -np 1 -fa 1

phaser • today at 9:42 AM

What intrigues me the most about AI progress, is not AGI or the model du jour by $AI_UNICORN, but rather what can be run locally. I remember having an amusing, but rather useless model in a beefy gaming PC that I had 6 years ago; and now, something that’s a hundred times better on my M5 laptop.

Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening.

Also I don’t know what this means for the valuation of the AI companies. I remember asking about this very idea to one of their employees at an event and instead of answering he bailed out to grab a cocktail.

➕ show 4 replies

jansommer • today at 10:27 AM

The E5-2620 v4 is great. Have been using it for 10 years now. Wanted to upgrade until I saw current prices. I have 64 GB ddr4. Paired it with rx 9060 xt 16 GB and games run as fast as ever. Perhaps the cpu is a slight bottleneck in DOOM The Dark Ages, but i'm at 60 fps, so no problem. Light llm on the gpu is a nobrainer, and it's cool to see that things can be tuned to run ok on the cpu. I bought 2667 v4 a month ago for 30$. I'd expect it to give a decent performance boost but I just haven't had the need for it yet, but pushing into llm like in the article I'd probably upgrade because 2667 can handle slightly faster ram.

➕ show 1 reply

robotswantdata • today at 11:17 AM

Granite or sapphire rapids are very under rated for MoE inference loads. But you need a GPU for the KV cache.

Plus many boards also support CXL for RAM expansion over PCI 5!

Source: building a hybrid inference business for regulated industry workloads.

vhaudiquet • today at 8:48 AM

The E5 2620-v4 only supports DDR4.

FartyMcFarter • today at 11:15 AM

I may have missed this in the article, but:

What was the net effect of the optimisations? How much faster did it get?

cykros • today at 9:44 AM

Does this mean my 15 year old Phenom is too old? But it has 16 gb of DDR3 RAM!

Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.

car • today at 9:43 AM

Similar recent posting with optimizations for older Xeon:

High-Performance AI on a Budget: Optimizing llama.cpp for Qwen3.5 Inference on a Dual-GPU HP Z440

https://news.ycombinator.com/item?id=47320244

haunter • today at 10:01 AM

And this is one of those CPUs which had dual slot motherboards so you can have double the fun (and power bill)

https://pcpartpicker.com/products/motherboard/#s=20028,20029...

Hasan121212 • today at 10:59 AM

I think one overlooked advantage of older Xeon systems is their availability. Many people can experiment with local AI deployments at a fraction of the cost of building a brand-new setup.

anon-3988 • today at 10:03 AM

I tried to run gemma 4 on this CPU and it did not go well

https://www.techpowerup.com/cpu-specs/ryzen-7-4800u.c2281

It is way too slow

potus_kushner • today at 7:20 AM

@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?

➕ show 1 reply

asimovDev • today at 8:31 AM

I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?

➕ show 4 replies

NSUserDefaults • today at 8:53 AM

How about the iMac Pro? Would that work? I was able to put 128gb in it (not as easy as the regular iMac but possible).

➕ show 1 reply

egorfine • today at 10:00 AM

This and the previous one are insanely good articles. Thank you!

rvba • today at 10:39 AM

As someone doing this for fun on a windows 11 machine (96gb ram, 5090 24gb) I wonder if I need any flags to keep the model in memory and avoid swapping to ssd?

I use LM studio and qwen3.5 35B - but never figured out if it is swapping or not.

Om am unrelated note, does anyone know a model that can help with this use case:

https://news.ycombinator.com/item?id=48301635

gigatexal • today at 9:32 AM

What kind of tokens per second did the op get I saw nothing of this written.

➕ show 2 replies

hparadiz • today at 9:10 AM

I'm now staring at a 10 year old 4U with 256 GB of DDR4 and thinking hmmmmm

Eonexus • today at 7:18 AM

I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?

➕ show 1 reply

ezconnect • today at 10:48 AM

When you use page up and page down key when reading that blog the first line on the screen is obscured by the floating bar or what ever it is. It is not even needed for reading.

christkv • today at 8:04 AM

Makes you wonder if its possible to squeeze more tps out of a strix halo system using the 16 zen5 cores as well as the gpu.

➕ show 2 replies

nurettin • today at 9:06 AM

I also run a Qwen 3.6 moe A4B on old hardware. I set it up with

numactl --membind=1

so it is constrained to one of the memory sticks which speeds up token generation a little.

shevy-java • today at 10:13 AM

The webpage's layout is just horrible. Scrolling is also non-default - and thus rather annoying; I had to stop after two scroll events. Why do people think they need so much fancy effects or non-standard behaviour, if their alleged goal is to get information across to other people?

SXX • today at 9:44 AM

Now we need someone try run Kimi K2.6 on old Xeon and DDR3. After all these platforms do support up to 768GB RAM.

➕ show 1 reply

hypfer • today at 9:31 AM

> The argument for speculative decoding is stronger on CPU than on GPU.

Uh. Uuuh.

No?

___

Also

> While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip.

What purpose does the quoting of "caches" serve there? Is this AI writing written by that model running on that host?

bflesch • today at 9:17 AM

Might consider going for even older CPUs which don't have the Intel ME ring -3 thing which is full of backdoors

➕ show 1 reply

alt Hacker News

A 10 year old Xeon is all you need

Comments