> But on a tangent, why do you believe in mixture of experts? The fact that all big SoTA models...

xml • today at 12:05 PM • 0 replies • view on HN

> But on a tangent, why do you believe in mixture of experts?

The fact that all big SoTA models use MoE is certainly a strong reason. They are more difficult to train, but the efficiency gains seem to be worth it.

> Every thing I know about them makes me believe they're a dead-end architecturally.

Something better will come around eventually, but I do not think that we need much change in architecture to achieve consumer-grade AI. Someone just has to come up with the right loss function for training, then one of the major research labs has to train a large model with it and we are set.

I just checked Google Scholar for a paper with a title like "Temporally Persistent Mixture of Experts" and could not find it yet, but the idea seems straightforward, so it will probably show up soon.

alt Hacker News