If you're trying to build AI based applications you can and should compare the costs between vendor based solutions and hosting open models with your own hardware.
On the hardware side you can run some benchmarks on the hardware (or use other people's benchmarks) and get an idea of the tokens/second you can get from the machine. Normalize this for your usage pattern (and do your best to implement batch processing where you are able to, which will save you money on both methods) and you have a basic idea of how much it would cost per token.
Then you compare that to the cost of something like GPT5, which is a bit simpler because the cost per (million) token is something you can grab off of a website.
You'd be surprised how much money running something like DeepSeek (or if you prefer a more established company, Qwen3) will save you over the cloud systems.
That's just one factor though. Another is what hardware you can actually run things on. DeepSeek and Qwen will function on cheap GPUs that other models will simply choke on.
Well the seemingly cheap comes with significantly degraded performance, particular for agentic use. Have you tried replacing Claude Code with some locally deployed model, say, on 4090 or 5090? I have. It is not usable.
Mixture-of-Expert models benefit from economies of scale, because they can process queries in parallel, and expect different queries to hit different experts at a given layer. This leads to higher utilization of GPU resources. So unless your application is already getting a lot of use, you're probably under-utilizing your hardware.
>That's just one factor though. Another is what hardware you can actually run things on. DeepSeek and Qwen will function on cheap GPUs that other models will simply choke on.
What's cheap nowdays? I'm out of the loop. Does anything ever run on integrated AMD that is Ryzen AI that comes in framework motherboards? Is under 1k americans cheap?
> DeepSeek and Qwen will function on cheap GPUs that other models will simply choke on.
Uh, Deepseek will not (unless you are referring to one of their older R1 finetuned variants). But any flagship Deepseek model will require 16x A100/H100+ with NVL in FP8.
> with your own hardware
Or with somebody else's.
If you don't have strict data residency requirements, and if you aren't doing this at an extremely large scale, doing it on somebody else's hardware makes much more economic sense.
If you use MoE models (al modern >70B models are MoE), GPU utilization increases with batch size. If you don't have enough requests to keep GPUs properly fed 24/7, those GPUs will end up underutilized.
Sometimes underutilization is okay, if your system needs to be airgapped for example, but that's not an economics discussion any more.
Unlike e.g. video streaming workloads, LLMs can be hosted on the other side of the world from where the user is, and the difference is barely going to be noticeable. This means you can keep GPUs fed by bringing in workloads from other timezones when your cluster would otherwise be idle. Unless you're a large, worldwide organization, that is difficult to do if you're using your own hardware.