I think this mostly comes down to (multi-headed) scaled dot-product attention just being very easy t...

D-Machine • last Thursday at 2:46 AM • 1 reply • view on HN

I think this mostly comes down to (multi-headed) scaled dot-product attention just being very easy to parallelize on GPUs. You can then make up for the (relative) lack of expressivity / flexibility by just stacking layers.

Replies

MontyCarloHall • last Thursday at 3:03 AM

A neural-GP could probably be trained with the same parallelization efficiency via consistent discretization of the input space. I think their absence owes more to the fact that discrete data (namely, text) has dominated AI applications. I imagine that neural-GPs could be extremely useful for scale-free interpolation of continuous data (e.g. images), or other non-autoregressive generative models (scale-free diffusion?)

➕ show 1 reply

alt Hacker News

Replies