logoalt Hacker News

pseufauxlast Monday at 11:25 AM1 replyview on HN

Environmental cost is a concern, though for me not the main one. In this case it's two things.

1. AI interactions cost the service money, which is inevitably passed on to the consumer. The if it's a feature I do not wish to use, I like to have options to avoid paying for that feature. So in this case, avoiding AI use is a purely economic decision.

2. I am concerned about the content LLMs are trained on. Every major AI has (in my opinion) stolen content as training material. I prefer not to support products which I believe are unethically built. In the future, if models can be trained solely on ethically sourced material where the authors have been properly compensated, I would think this position.


Replies

azeirahlast Monday at 12:15 PM

I'm active in the /r/localllama community and on the llama.cpp GitHub. For this use-case you absolutely do not need a big LLM. Even an 8B model will suffice, smaller models perform extremely well when the task is very clear and you provide a few shot prompt.

I've experimented in the past with running an LLM like this on a CPU-only VPS, and that actually just works.

If you host it on a server with a single GPU, you'll likely be able to easily fulfil all generation tasks for all customers. What many people don't know about inference is that it's _heavily_ memory bottlenecked, meaning that there is a lot of spare compute left over. What this means in practice is that even on a single GPU, you can serve many parallel chats at once. Think 10 "threads" of inference at 20 Tok/s.

Not only that, but there are also LLMs trained only on commons data.