What inference performance are you getting on this with llama?
How long would it take to recoup the cost if you made the model available for others to run inference at the same price as the big players?
Running LLM's directly might not be effective.
I think there are probably Law Firms/doctors offices that would gladly pay ~3-4K euro a month to have this thing delivered and run truely "on-prem" to work with documents they can't risk leaking (patent filings, patient records etc).
For a company with 20-30 people, the legal and privacy protection is worth the small premium over using cloud providers.
Just a hunch though! This would have it paid-off in 3-4 months?
He has GLM 4.5 Running at ~100 Tokens per second.
Assumptions:
Batch 4x and get 400 tokens per second and push his power consumption to 900W instead of the underutilized 300W.
Electricity around €0.2/kWhr.
Tokens valued at €1/1M out.
Assume ~70% utilization.
Result:
You get ~1M tokens per hour which is a net profit of ~€0.8/hr. Which is a payoff time of a bit over a year or so given the €9K investment.
Honestly though there is a lot of handwaving here. The most significant unknown is getting high utilization with aggressive batching and 24/7 load.
Also the demand for privacy can make the utility of the tokens much higher than typical API prices for open source models.
In a sort of orthogonal way renting 2 H100s costs around $6 per hour which makes the payback time a bit over a couple months.