Interesting that OpenBLAS and MPS are reportedly nearly the same speed although the README sounds like only MPS uses the GPU.
I think that this is because the current code does a terrible job at taking the activations in the GPU and fusing the kernels. This is the next thing to fix in this implementation indeed.
I think that this is because the current code does a terrible job at taking the activations in the GPU and fusing the kernels. This is the next thing to fix in this implementation indeed.