logoalt Hacker News

nullc10/13/20240 repliesview on HN

They're memory bandwidth limited, you can basically just estimate the performance from the time it takes to read the entire model from ram for each token.