Yeah, not a great apples-to-apples comparison. I think the point stands: MoE, a myriad of complex ...

alecco • today at 11:56 AM • 0 replies • view on HN

Yeah, not a great apples-to-apples comparison.

I think the point stands: MoE, a myriad of complex attention approaches, shared layers, you name it. And making it all work together well is a huge trial-and-error pain even for small models, never mind getting to efficient hardware utilization.

alt Hacker News