so, interested how many people are running higher end AI models locally? Figure if I'm spending $800/month on tokens I can build a pretty beefy local machine for the cost of a few months spend - what is people's experience with say a $5k server custom built (and only for) running an AI model.
You will likely have to compromise on memory bandwidth or capacity under a $10k price. The Radeon R9700 has 32 GB of VRAM and is pretty cheap (~$1500 right now), which is what I primarily use. My home desktop has 128 GB RAM and my laptop has 96 GB RAM, but bandwidth limits make most models slow on those CPUs. Models with multi-token prediction are somewhat usable on them: Nemotron 3 Super runs reasonably well on my desktop but does poorly on agentic coding that I've given it; my laptop can run Qwen3.6-27B reasonably well with a version of llama.cpp that is patched for MTP support; but usually I run Qwen3.6-27B on my R9700. vLLM might support two or three R9700s on some OS, but I've not been able to get it to run at all with Ubuntu 26.04: system ROCm version is apparently different than what's in the container images, and system OpenMPI v5.0 finally removed C++ bindings that were deprecated in 2005 but are linked from some Python wheel that vLLM (probably indirectly) imports.
If you are spending $800/month on tokens you are likely to notice degradation for local models compared to near-frontier models. The models I can run locally are consistently worse than Claude Sonnet 4.6 (again for the work I give them), although Qwen3.6 does feel almost like magic for its size because it can do a lot. The really big open-weight models should be better, but they want 200+GB RAM, which will need a correspondingly expensive CPU.