logoalt Hacker News

yjftsjthsd-h10/11/20248 repliesview on HN

If your goal is

> I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.

Then this is great.

If your goal is

> Run and explore Llama models locally with minimal dependencies on CPU

then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.


Replies

hedgehog10/11/2024

Ollama (also wrapping llama.cpp) has GPU support, unless you're really in love with the idea of bundling weights into the inference executable probably a better choice for most people.

show 2 replies
jart10/11/2024

A great place to start is with the LLaMA 3.2 q6 llamafile I posted a few days ago. https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafi... We have a new CLI chatbot interface that's really fun to use. Syntax highlighting and all. You can also use GPU by passing the -ngl 999 flag.

show 1 reply
seu10/12/2024

> then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.

First time that I have a "it just works" experience with LLMs on my computer. Amazing. Thanks for the recommendation!

rmbyrro10/11/2024

Do you have a ballpark idea of how much RAM would be necessary to run llama 3.1 8b and 70b on 8-quant?

show 1 reply
anordin9510/12/2024

Thanks for the suggestion. I've added a link to llamafile in the repo's README. Though, my focus was on exploring the model itself.

yumraj10/11/2024

Can it use GPU if available, say on Apple silicon Macs

show 1 reply
bagels10/11/2024

How great is the performance? Tokens/s?

show 1 reply
AlfredBarnes10/11/2024

Thanks for posting this!

show 1 reply