logoalt Hacker News

bjackmanyesterday at 8:14 AM1 replyview on HN

Does a cache help with inference workloads anyway?

I don't know much about it but my mental model is that for transformers you need random access to billions of parameters.


Replies

fc417fc802yesterday at 11:02 AM

It's streaming access, and no not as far as I'm aware. APUs have always been hilariously bottlenecked on memory bandwidth as soon as your task actually needed to pull in data. The only exception I know of is the PS5 because it uses GDDR instead of desktop memory.