Comparison with vanilla of the same size/flops budget?

lostmsu • yesterday at 12:48 PM • 2 replies • view on HN

Replies

I'm not sure if that is the right calculation.

Provided the flops are not prohibitive. Output quality per model bytes might be better. In general people run the largest model they can.

I certainly think trading speed for quality at the same size is worth looking at. Especially if it uses methods that can benefit from the efforts of others to improve speed in general.

That said performance difference at 30M may not be representative of performance difference at 30B

There are probably a lot of really good ideas out there waiting for someone to drop a few million in training to reveal how good they are on large sizes.

➕ show 1 reply

oofbey • yesterday at 8:11 PM

The big bet with this technique is in having a fixed (non learned) matrix which converts the tokens latent space to the linear attention space. So you can kinda cheat and say your model is small because a bunch of the smarts are in this fixed big graph laplacian matrix L.

So how do you scale this up from a toy problem? Well that L would Have to get bigger. And it’s hard to imagine it being useful if L is not trained. Then it starts to look a lot more like a conventional transformer, but probably harder to train, with the benefit of smaller KV caches. (Half the size - not a massive win.)

So overall doesn’t seem to me like it’s gonna amount to anything.

alt Hacker News

Replies