AVX-512, originally called the "Larrabee New Instructions" has been the only decent vector...

adrian_b • today at 12:24 PM • 1 reply • view on HN

AVX-512, originally called the "Larrabee New Instructions" has been the only decent vector extension of the Intel-AMD ISA, which has been coherently planned instead of being a heap of more or less randomly chosen instructions, each being thought to be useful to accelerate some particular benchmark or a certain workload of one of the big customers.

MMX (Pentium MMX, 1997) sucked badly (because in designing it ease of implementation was prioritized over usefulness), SSE (Pentium III, 1999) was much worse than the simultaneously launched Motorola AltiVec, and AVX (Sandy Bridge, 2011) was much worse than the simultaneously developed Larrabee New Instructions (despite the fact that Sandy Bridge was developed by the A-team, while Larrabee was developed by the C- or D-team, which however had hired competent consultants from outside Intel, experienced in programming games and graphic applications).

AVX-512 is for now better than any competitive vector ISA, both in the achievable energy efficiency and in the achievable performance. Obviously, it is possible that some future Aarch64 (Arm) or even RISC-V CPUs will change this, by implementing wider registers and execution units and by adding any missing operations.

The SME ISA extension (Scalable Matrix Extension), which is available in the latest Apple CPUs and in the current 2026 generation of Arm C1 CPUs, has the potential to be more efficient than AVX-512, exploiting the fact that the current Intel AMX ISA is intended only for ML/AI and not also for general-purpose computing. Nonetheless this may happen only in a rather distant future, because neither Apple nor Qualcomm nor Arm seem interested to make products suitable for the needs of technical and scientific computing, like Intel and AMD. Because of that, in the existing CPUs with SME the ratio between SME execution units and the general-purpose CPU cores is low, resulting in a low total throughput.

Replies

fweimer • today at 1:14 PM

SME (like AMX) are easier in this regard because there is a clear expectation that they are used in dedicated code blocks only, so run-time dispatch becomes feasible. In contrast, with auto-vectorization, general-purposes vector ISAs such as AVX-512 and SVE tend to get used all over the place.

alt Hacker News

Replies