logoalt Hacker News

bigyabaiyesterday at 3:32 PM0 repliesview on HN

Seconding this. You can get A3B/A4B models to run with 10+ tok/sec on a modern 6/8GB GPU with 32k context if you optimize things well. The cheapest way to run this model at larger contexts is probably a 12gb RTX 3060.