Demystifying ARM SME to Optimize General Matrix Multiplications

65 points • by matt_d • yesterday at 8:05 PM • 14 comments • view on HN

Comments

I don’t get why they didn’t compare against BLIS. I know you can only do so many benchmarks, and people will often complain no matter what, but BLIS is the obvious comparison. Maybe BLIS doesn’t have kernels for their platform, but they’d be well served by just mentioning that fact to get that question out of the reader’s head.

BLIS even has mixed precision interfaces. But might not cover more exotic stuff like low-precision ints? So this paper could have had a chance to “put some points on the board” against a real top-tier competitor.

➕ show 2 replies

anematode • yesterday at 9:49 PM

ARM SME as implemented on the Apple M4 is quite interesting. Super useful for matrix math (as this paper illustrates well), but my attempts at using the SSVE extension for vector math were an utter failure for performance, despite the increased vector width (512 bits vs. 128 bits for NEON). Potentially the switch into/out of streaming mode is too expensive, but my microbenchmarks indicated the SSVE instructions themselves just didn't have great throughput.

➕ show 1 reply

Archit3ch • yesterday at 10:48 PM

Is there a version of this that supports sparse LU solves?

➕ show 1 reply

starkeeper • yesterday at 11:32 PM

This will save us from the nvidia monster! And then we can have our DRAM back!!!

alt Hacker News

Demystifying ARM SME to Optimize General Matrix Multiplications

Comments