I wonder how LLM performance is on the higher core counts?
With recent DDR generations and many core CPUs, perhaps CPUs will give GPUs a run for their money.
They're memory bandwidth limited, you can basically just estimate the performance from the time it takes to read the entire model from ram for each token.
The H100 has 16,000 cuda cores at 1.2ghz. My rough calculation is it can handle 230k concurrent calculations. Whereas a 192 core avx512 chip (assuming it calculates on 16 bit data) can handle 6k concurrent calculations at 4x the frequency. So, about a 10x difference just on compute, not to mention that memory is an even stronger advantage for GPUs.