logoalt Hacker News

aleccotoday at 11:56 AM0 repliesview on HN

Yeah, not a great apples-to-apples comparison.

I think the point stands: MoE, a myriad of complex attention approaches, shared layers, you name it. And making it all work together well is a huge trial-and-error pain even for small models, never mind getting to efficient hardware utilization.