Fast enough for RPI5 ARM?
PyTorch has a native llm solution It supports all the LLama models. It supports CPU, MPS and CUDA https://github.com/pytorch/torchchat Getting 4.5 tokens a second using 3.1 8B full precision using CPU only on my M1
Does anyone know what's the easiest way to finetune a model locally is today?
> from llama_models.llama3.reference_impl.model import Transformer
This just imports the Llama reference implementation and patches the device FYI.
There are more robust implementations out there.
With the same mindset, but without even PyTorch as dependency there's a straightforward CPU implementation of llama/gemma in Rust: https://github.com/samuel-vitorino/lm.rs/
It's impressive to realize how little code is needed to run these models at all.
Peel back the layers of the onion and other gluey-mess to gain insight into these models.
If your goal is
> I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.
Then this is great.
If your goal is
> Run and explore Llama models locally with minimal dependencies on CPU
then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.