logoalt Hacker News

SwellJoetoday at 12:16 AM1 replyview on HN

Yeah, that's probably true, but we're also seeing that there's still tons of inefficiencies in how LLMs are being run. Seems like every couple months there's some new technique to squeeze more performance out of less hardware. KV caching improvements, fast attention, speculative decoding, dynamic quantization, quantization aware training, etc.

That said, I recently replaced my five year old self-built PC (with a top-of-the-line desktop CPU, chipset, memory, and GPU of the time) with a new everything-the-best build, and while it's clear we're not keeping up with Moore's Law anymore, it's still 4-5 times faster for compute-intensive stuff, especially parallelizable tasks. We're still getting faster/cheaper. So, the time scale is maybe ten years rather than five.


Replies

ethbr1today at 12:33 PM

It's highly unlikely AI inference doesn't follow the same path as general purpose computing: variety and innovations in software lead to standardization on highest performance approaches.

As that transition happens, hardware evolves from general purpose (because nobody knows what's needed and hardware design is slow) to fixed function high performance (once requirements are better defined).

GPUs (and TPUs) are a weird middle-ground here, as they're already fairly specialized, but I wouldn't bet against next gen AI inference-optimized hardware architectures dominating that use case in ~10 years if the pace of AI arch tweaking slows.

The efficiency/power/cost gains from fixed function optimization are always too great, and the only thing that holds that approach back is rapidly mutating requirements.