I agree the parent is a bit too pessimistic, especially because we care about logical skills and con...

athrowaway3z • today at 10:57 AM • 2 replies • view on HN

I agree the parent is a bit too pessimistic, especially because we care about logical skills and context size more than remembering random factoids.

But on a tangent, why do you believe in mixture of experts?

Every thing I know about them makes me believe they're a dead-end architecturally.

Replies

xml • today at 12:05 PM

> But on a tangent, why do you believe in mixture of experts?

The fact that all big SoTA models use MoE is certainly a strong reason. They are more difficult to train, but the efficiency gains seem to be worth it.

> Every thing I know about them makes me believe they're a dead-end architecturally.

Something better will come around eventually, but I do not think that we need much change in architecture to achieve consumer-grade AI. Someone just has to come up with the right loss function for training, then one of the major research labs has to train a large model with it and we are set.

I just checked Google Scholar for a paper with a title like "Temporally Persistent Mixture of Experts" and could not find it yet, but the idea seems straightforward, so it will probably show up soon.

amelius • today at 11:37 AM

> But on a tangent, why do you believe in mixture of experts

In a hardware inference approach you can do tens of thousands tokens per second and run your agents in a breadth first style. It is all very simply conceptually, and not more than a few years away.

alt Hacker News

Replies