logoalt Hacker News

Ship_Star_101010/11/20241 replyview on HN

PyTorch has a native llm solution It supports all the LLama models. It supports CPU, MPS and CUDA https://github.com/pytorch/torchchat Getting 4.5 tokens a second using 3.1 8B full precision using CPU only on my M1


Replies

ajaksalad10/11/2024

> I was a bit surprised Meta didn't publish an example way to simply invoke one of these LLM's with only torch (or some minimal set of dependencies)

Seems like torchchat is exactly what the author was looking for.

> And the 8B model typically gets killed by the OS for using too much memory.

Torchchat also provides some quantization options so you can reduce the model size to fit into memory.