> any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidt...

lelanthran • today at 5:35 AM • 1 reply • view on HN

> any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token reques

So don't use it for large requests. Ideal for when you just want to categorise things, for example, "does this task need a shell" or "bucket this email into one of help request, bill due or personal comms".

Replies

zozbot234 • today at 6:41 AM

The best use is actually for a layer that "almost fits" into VRAM, such that automated offloading to system RAM will be rare enough that it doesn't impact performance.

➕ show 1 reply

alt Hacker News

Replies