PyTorch has a native llm solution It supports all the LLama models. It supports CPU, MPS and CUDA https://github.com/pytorch/torchchat Getting 4.5 tokens a second using 3.1 8B full precision using CPU only on my M1
> I was a bit surprised Meta didn't publish an example way to simply invoke one of these LLM's with only torch (or some minimal set of dependencies)
Seems like torchchat is exactly what the author was looking for.
> And the 8B model typically gets killed by the OS for using too much memory.
Torchchat also provides some quantization options so you can reduce the model size to fit into memory.
> I was a bit surprised Meta didn't publish an example way to simply invoke one of these LLM's with only torch (or some minimal set of dependencies)
Seems like torchchat is exactly what the author was looking for.
> And the 8B model typically gets killed by the OS for using too much memory.
Torchchat also provides some quantization options so you can reduce the model size to fit into memory.