Run Llama locally with only PyTorch on CPU

168 points • by anordin95 • 10/08/2024 • 34 comments • view on HN

Comments

yjftsjthsd-h • 10/11/2024

If your goal is

> I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.

Then this is great.

If your goal is

> Run and explore Llama models locally with minimal dependencies on CPU

then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.

➕ show 8 replies

littlestymaar • 10/11/2024

With the same mindset, but without even PyTorch as dependency there's a straightforward CPU implementation of llama/gemma in Rust: https://github.com/samuel-vitorino/lm.rs/

It's impressive to realize how little code is needed to run these models at all.

Ship_Star_1010 • 10/11/2024

PyTorch has a native llm solution It supports all the LLama models. It supports CPU, MPS and CUDA https://github.com/pytorch/torchchat Getting 4.5 tokens a second using 3.1 8B full precision using CPU only on my M1

➕ show 1 reply

I_am_tiberius • 10/11/2024

Does anyone know what's the easiest way to finetune a model locally is today?

➕ show 1 reply

tcdent • 10/11/2024

> from llama_models.llama3.reference_impl.model import Transformer

This just imports the Llama reference implementation and patches the device FYI.

There are more robust implementations out there.

anordin95 • 10/08/2024

Peel back the layers of the onion and other gluey-mess to gain insight into these models.

klaussilveira • 10/12/2024

Fast enough for RPI5 ARM?

alt Hacker News

Run Llama locally with only PyTorch on CPU

Comments