logoalt Hacker News

hu3today at 2:32 PM3 repliesview on HN

What about real workloads? Because as context gets larger, these local LLMs aproxiate the useless end of the spectrum with regards to t/s.


Replies

zozbot234today at 8:01 PM

The thing about context/KV cache is that you can swap it out efficiently, which you can't with the activations because they're rewritten for every token. It will slow down as context grows (decode is often compute-limited when context is large) but it will run.

Someone1234today at 2:53 PM

I strongly agree. People see local "GPT-4 level" responses, and get excited, which I totally get. But how quickly is the fall-off as the context size grows? Because if it cannot hold and reference a single source-code file in its context, the efficiency will absolutely crater.

That's actually the biggest growth area in LLMs, it is no longer about smart, it is about context windows (usable ones, note spec-sheet hypotheticals). Smart enough is mostly solved, combating larger problems is slowly improving with every major release (but there is no ceiling).

satvikpendemtoday at 2:55 PM

That should be covered by the harness rather than the LLM itself, no? Compaction and summarization should be able to allow the LLM to still run smoothly even on large contexts.

show 1 reply