Multiplatform Matrix Multiplication Kernels

58 points • by homarp • yesterday at 7:59 PM • 23 comments • view on HN

Comments

burnt-resistor • today at 4:13 AM

GPUs came about because of the need for faster float 4x4 and 3x3 matrix, and 3 and 4 vector math ops like multiply, multiply-accumulate, and such, and faster pushing of pixels with things like texture mapping. All hail OpenGL and dual Voodoo2 SLI. ;)

Lerc • yesterday at 10:58 PM

Has there been much research into slightly flawed matrix multiplications?

If you have a measure of correctness, and a measure of performance. Is there a maximum value of correctness per some unit of processing that exists below a full matrix multiply

Obviously it can be done with precision, since that is what floating point is. But is there anything where you can save x% of computation and have fewer than x% incorrect values in a matrix multiplications?

Gradient descent wouldn't really care about a few (Reliably) dud values.

➕ show 1 reply

nathanielsimard • yesterday at 9:56 PM

One of the author here, don't hesitate if you have any question or comment!

➕ show 1 reply

semessier • today at 1:04 AM

I had bet that matmult would be in transformer-optimized hardware costing a fraction of GPUs first class in torch 2 years ago with no reason to use GPUs any more. Wrong.

➕ show 1 reply

raphaelty • yesterday at 9:19 PM

Very interesting, willing to try burn

apitman • yesterday at 11:16 PM

Could something like this be done in WebGPU?

➕ show 1 reply

airstrike • yesterday at 10:47 PM

burn is awesome

almostgotcaught • yesterday at 10:16 PM

I'm sorry this is a low brow comment but this is the dumbest thing you can do in this space:

> Unit (thread in CUDA, invocation in Vulkan/Wgpu): the smallest execution entity performing computations.

> Plane (warp in CUDA, subgroup in Vulkan/Wgpu): a group of (typically 32) units executing in lockstep and able to share data efficiently through registers.

> Cube (thread block in CUDA, workgroup in Vulkan/Wgpu): a group of units that execute on the same SM, sharing memory and able to synchronize

It's already bad enough that the vendors themselves insisted on different names but why in the bejesus would you rename these concepts and diverge from literally all existing naming conventions when you're providing middleware. Ie when using your tool I'm still going to reference NVIDIA's or AMD's docs to understand how the hardware actually works. Like do you really think otherwise - that your thing is gonna be end of the line???

FYI the word warp isn't random techno babble but is actually a very clever pun that actually fits very well conceptually:

https://en.m.wikipedia.org/wiki/Warp_and_weft

➕ show 1 reply

alt Hacker News

Multiplatform Matrix Multiplication Kernels

Comments