logoalt Hacker News

azeirahlast Monday at 12:15 PM0 repliesview on HN

I'm active in the /r/localllama community and on the llama.cpp GitHub. For this use-case you absolutely do not need a big LLM. Even an 8B model will suffice, smaller models perform extremely well when the task is very clear and you provide a few shot prompt.

I've experimented in the past with running an LLM like this on a CPU-only VPS, and that actually just works.

If you host it on a server with a single GPU, you'll likely be able to easily fulfil all generation tasks for all customers. What many people don't know about inference is that it's _heavily_ memory bottlenecked, meaning that there is a lot of spare compute left over. What this means in practice is that even on a single GPU, you can serve many parallel chats at once. Think 10 "threads" of inference at 20 Tok/s.

Not only that, but there are also LLMs trained only on commons data.