> In 5 years consumer chips and model inference will be so good you won't need a server for ...

topherhunt • last Friday at 8:03 PM • 1 reply • view on HN

> In 5 years consumer chips and model inference will be so good you won't need a server for SOTA.

Naw man, you crazy. If you tell me that in 5 years, consumer chips will be so good that I can run GPT-5.4-level AI on my phone, I'd find that plausible (I buy cheap phones). If you're telling me that in 5 years we won't need _servers_ because our _phones and/or desktops_ will be powerful enough to run the biggest newest LLMs in existence, I question your judgment, I think that prediction shows a deep uncreativity about how massively compute-hungry SOTA models will get.

The valuable things to do with inference will keep being a server niche because they'll keep being 1-2 OOM more compute-hungry than whatever consumer hardware can handle. Like gaming: my laptop can run games from 2015 at max settings no problem but the games actually worth getting excited about in 2026 still melt a $2k GPU, because whatever headroom the hardware gains, developers immediately spend on ray tracing and Nanite and modelling individual skin cells or whatever. I don't see any plausible reason to expect that the ceiling on "valuable server-side compute" or "inference capacity" will rise any more slowly than the on-device capability is rising.

My assumption is that in 2031, SOTA top-intelligence AI will be hosted on cloud servers like it is today, offering dirt-cheap access to capabilities we can't even dream of today, while your Android will be running some open-source GPT-5+ equivalent.

Replies

0xbadcafebee • last Friday at 10:20 PM

The thing is SOTA has a plateau. All LLMs work on the same principle: input goes in for training, reinforced by humans. There is only so much input (all recorded human knowledge), only so many human tweaks, that can produce only so much increased signal-to-noise in output. The machine can't read your mind, and there is no one truthful answer to most questions, so there will always be a limit on how accurate or correct or whatever any response will get. So at some point, you just can't make a better response. The agent harness, prompts, etc, are the only way to get better, and that's gonna be open source.

Add to that the algorithmic improvements on inference that's making inference faster with more context and higher quality. TurboQuant is just one example, more methods are coming out all the time. So the inference is getting more efficient.

At the same time, hardware can kind of keep getting infinitely better. Even if you can't make it smaller, you can make it more energy efficient, improve multitasking, more GPU cores/RAM or iGPUs, pack in more chips, improve cooling, use new materials... the sky's the limit.

Add all 3 together and at some point you will get Opus 4.7 on a phone with 40 t/s. At that point there's no way I'm paying for inference on a server. You can do RAG on-device, and image/video/voice is done by multi-modals. I want my agent chats replicated, but that's Google Drive. I want the agent to search the web, but that's Google Search. So eventually we're back to just doing what we do today (pre-AI) only with more automation.

The really advanced shit will come in 10 years, when we finally crack real memory and learning. That will absolutely be locked up in the cloud. But that's not an LLM, it's something else entirely. (slight caveat that WW3 will delay progress by 10-20 years)

alt Hacker News

Replies