The big question for local LLMs is whether there is a 100 tok/s model which requires less than ...

lumost • today at 6:32 AM • 2 replies • view on HN

The big question for local LLMs is whether there is a 100 tok/s model which requires less than 16 GB of memory and is competitive on most tasks with the cloud models.

There is some signal that this is possible through both hardware innovation and training/data improvements.

Cloud models have their own constraints - I can’t have opus4.8 spend 4 hours on a deep research question I had in the shower without spending money. I can’t do real time video game upscaling and graphics work in the cloud period.

A laptop is about an order of magnitude cheaper than a cloud server thanks to economies of scale, uptime requirements, and other factors.

Replies

nok22kon • today at 7:51 AM

if you do the electricity math you'll see that you pay more on local models while getting less (local is more heavily quantized) compared with OpenRouter.

I'm not talking local Gemma/Qwen vs cloud Opus, but against OpenRouter same Gemma/Qwen

there are reasons to run local - privacy, availability, but cost is not one of them

➕ show 3 replies

re-thc • today at 7:57 AM

> The big question for local LLMs is whether there is a 100 tok/s model which requires less than 16 GB of memory and is competitive on most tasks with the cloud models.

Benchmarks maybe? Real world, no.

You just need the context otherwise. There's no way around it.

➕ show 1 reply

alt Hacker News

Replies