They're memory bandwidth limited, you can basically just estimate the performance from the time...

nullc • 10/13/2024 • 0 replies • view on HN

They're memory bandwidth limited, you can basically just estimate the performance from the time it takes to read the entire model from ram for each token.

alt Hacker News