logoalt Hacker News

bodegajedtoday at 8:44 AM1 replyview on HN

1.5B models can run on CPU inference at around 12 tokens per second if I remember correctly.


Replies

moffkalasttoday at 8:47 AM

Ingesting multiple code files will take forever in prompt processing without a GPU though, tg will be the least of your worries. Especially when you don't append but change it in random places so caching doesn't work.