Environmental cost is a concern, though for me not the main one. In this case it's two things.

pseufaux • last Monday at 11:25 AM • 1 reply • view on HN

1. AI interactions cost the service money, which is inevitably passed on to the consumer. The if it's a feature I do not wish to use, I like to have options to avoid paying for that feature. So in this case, avoiding AI use is a purely economic decision.

2. I am concerned about the content LLMs are trained on. Every major AI has (in my opinion) stolen content as training material. I prefer not to support products which I believe are unethically built. In the future, if models can be trained solely on ethically sourced material where the authors have been properly compensated, I would think this position.

Replies

azeirah • last Monday at 12:15 PM

I'm active in the /r/localllama community and on the llama.cpp GitHub. For this use-case you absolutely do not need a big LLM. Even an 8B model will suffice, smaller models perform extremely well when the task is very clear and you provide a few shot prompt.

I've experimented in the past with running an LLM like this on a CPU-only VPS, and that actually just works.

If you host it on a server with a single GPU, you'll likely be able to easily fulfil all generation tasks for all customers. What many people don't know about inference is that it's _heavily_ memory bottlenecked, meaning that there is a lot of spare compute left over. What this means in practice is that even on a single GPU, you can serve many parallel chats at once. Think 10 "threads" of inference at 20 Tok/s.

Not only that, but there are also LLMs trained only on commons data.

alt Hacker News

Replies